Older books, rescued by OCR
Nan Barnes
Two of my new clients need me to perform OCR on their books. OCR, you say - is that like CPR? Sort of. OCR, or optical character recognition, can save the life of a book that would otherwise die in this digital age. It allows us to scan hard copies of books, one that are not already on a computer, and to transcribe the text into a word processing program like Microsoft Word. Then the text can be edited and designed just like a modern book.
For the first book project, the OCR will be easier. That's because the author wrote the book very simply, using a very clear font in black on white paper. Although this book was printed out on an inexpensive home printer back in the 1980s, the type is easily recognizable. All I have to do is peel away the cover and tug it away from the interior pages, which are glued together. That glue is so old it will be very easy, with the aid of a sharp razor knife, to slice at the binding and separate each page, until I have a clean stack of two-sided pages. I have a good document scanner that allows me to feed a stack of about 50 pages into the hopper, and it will, relatively quickly, scan both sides of the page. Unlike a regular scanner, the outcome will not be picture files such as JPEGs or TIFFs. Instead, one of the options in the scanner's software is to process the scan with OCR, optical character recognition, to identify the text letter by letter. With a clear, simple document, as I have in this book I can expect an accuracy of about 99% or better. Occasionally the software may be confused by a letter (or more likely a letter combination, such as "fi" or "mn"). So I will have to manually fix the error in Word. But it sure beats typing an entire book!
In the second book I have in front of me, I see some complicating factors. It also was a home printed book, contained in a three ring binder for these past 30 years or so. At one time, those three rings, the holes in the paper, would confuse my OCR software. But now the more recent versions are not thrown by this non-text element.
However, graphics and photos that interrupt the regular lines of text can stymie OCR software, particularly if the type is not clear. This book also contains handwritten captions and arrows pointing to the images - not good. So I need to work around this. How? I sort through the book and locate all the pages that are text-only. They're going to be the easy ones, so I put them in one stack and feed them through my document scanner, performing OCR as I go. Then I tackle the pages with complicating elements such as handwriting and graphics separately, on my flatbed scanner. The software for my higher-end flatbed scanner allows me to select specific areas of the page. I can zoom in specifically on text alone, identify it as text, and have it go to the OCR software. While I have that page on the scanner, I next zoom in on the photograph, scan it, and save it as a separate file, to be inserted into the book later. This is a more cumbersome process, as I must work on one page at a time, but it is still far better than typing an entire book.
How can this be useful to you? In the case of the first author, he is revisiting writing that he had done earlier in his life, in order to publish it now. Then, he couldn't find a publisher. Now, we live in a time when it is so easy to publish a book - why not? In the case of the second author, this isn't her book; it was written by her mother, who is now gone, and these are her life stories. So this is a way for the daughter to honor the mother, and to preserve her writing and family photos in a way that is more lasting than a three ring binder. We will design a beautiful cover with her image. How precious.