After the consumer-like scanning process of two books of Neel Doff, I thought I was in heaven. The pdf that it gives as result is neat, keeps the nice texture of the old book and reads very well on an e-reader.
But to make this work accessible for machinal agents, the text needs to be non-formatted in a plain text-file. The command that I knew of in Linux seemed easy and magic: pdftotext file.pdf.
With the naivety of the consumer I opened the text-file with the expectation to only have to delete the side-texts, like the introduction, the credits, etc.
But, oh, no. The content of the file is pure art, a beautiful piece of text that is hardly legible at some places!
Unfortunately the job is not perfect. Because OCR has difficulties to interpret particular elements in lay-out and fonts,the txt-file comes with a lot of errors.
Some regular phenomenons are:
*the combination of specific letters in some fonts (it can take m for n or I for i etc)
*headers might have become part of sentences
*footnotes are placed inside the flowing text
*page numbers are not recognized as such
The Gutenberg Project developed a collective tool to proofread ocr’ed scans. If your book is in the public domain and if you sign up as a proofreader, you can activate your own project and invite other proofreaders in. As a result your book becomes part of the Gutenberg Project, a stable and durable collection, and will be freely shared all over the planet.
If your scanned book is under copyright, you can fasten up the text-corrections by copying the txt in an Open Office document. The spellcheck option will be your greatest collaborator.