Distributed Proofreaders

Many of you may have heard of Distributed Proofreaders (DP) already, which I just came across on the O'Reilly Radar blog. DP is a network of volunteers who perform what is the proofreading equivalent of the double-keying method:

During proofreading, volunteers are presented with a scanned page image and the corresponding OCR text on a single web page. This allows the text to be easily compared to the image, proofread, and sent back to the site. A second volunteer is then presented with the first volunteer's work and the same page image, verifies and corrects the work as necessary, and submits it back to the site. The book then similarly progresses through two formatting rounds using the same web interface.

Titles proofread by DP are submitted to Project Gutenberg, but would this idea work for smaller digitization projects? How would a library go about getting users to proofread dirty OCR?


Wow, you're way ahead of me!

Wow, you're way ahead of me! Thanks for the link to your site. It looks like we have some momentum here... I'd be interested in hearing more about how people envision a service for proofreading dirty OCR or transcribing manuscript materials. It would be neat if plugins or other add-ons could be written for the applications that institutions use to present material to users.

I have pondered this idea as

I have pondered this idea as well. Read more of my thoughts on the subject in my post Archival Transcriptions: for the public, by the public.

Basically I think it could work very well - the trick is building the infrastructure that small institutions could use. Either open source software that could be installed anywhere OR a central website into which small institutions could easily upload their digital documents for transcription or proofreading. I like the idea of a central cite best because it means that a community of people who enjoy doing this sort of thing could easily find projects from multiple small institutions. Also I suspect that small institutions would rather not need to add more to their own overhead in the form of a new software tool to install and support.