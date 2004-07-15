I have been involved in scanning text using a digital camera for over a year now and I can steer you in the right direction. I am currently writing a rather extensive paper on the subject for publication to the web, but it is not quite done yet.



Speed is my top priority, but quality is a close second, and I do try to OCR all my results. I can tell you from experience that you will get rather good OCR results with a resolution as low as 200 dpi. I would advise you to get the highest possible megapixel camera you can as you will get markedly better results. Most books at the library are not in fact as large as A4 (8.5 x 11), they usually measure a smaller 6 inches wide. Now then, to determine DPI take the resolution of the camera and divide by the number of inches the book is wide. That 9 inch wide book at 200 DPI needs a horizontal resolution of 1800. To find the vertical divide 1800 by 4 and multiple by 3 (cameras take pics in a 4 x 3 ratio) and you get 1800 x 1350, which is 2.43 megapixels. This is a minimum value to get OCR of greater than 95% accuracy. An A3 sized area (8.5 x 11 opened up showing two pages) requires about 8 megapixels to get good OCR.



Light will make or break your shots. I use Photoshop to blacken all dark gray pixels, but shadowy areas become obscured. I do not use a flash with some of my digital cameras because it creates a washout effect. I have one camera that can not image a page at all without the flash and flanking halogen lights. Kodak cameras are good at reproducing natural light. If you can read the page, so can the camera. Kodak cameras also have a macro mode that allows better close-ups, essential in document imaging. There is a new Kodak 6 MP camera (DX7630) that even has a document setting. If you use a flash just tape some plastic milk jug material over the flash to diffuse it. You may need several layers. This also makes the flash somewhat unobtrusive. One thing you can do to improve your results is to raise the exposure value (EV). A high EV lets in more light making for better text in low light conditions.



Use the highest possible resolution with no compression. If you zoom in on your images the text is actually a series of black and dark gray pixels. The more compression there, or the lower the resolution, the more gray pixels you will have. Gray pixels are not read as well as black in OCR apps in my experience. This is why I use Photoshop to darken my text.



I find a USB card reader works best in transferring images. The transfer software that cameras include can be very tempermental to use, but my USB card reader never fails. Get yourself a decent tripod or build a copystand for your camera if you are on location. If you do a websearch for "digital camera" plus "genealogy" you will find many useful plans and tips for using a digital camera. I also like to use a TV set and output the cameras video out so I can see what I am taking a picture of. This is really a lot better than a tiny LCD screen.



One word of advise that took me a few books to realize: Tape your books to a tabletop so they do not move. Each turn of the page slowly moves the book and it is a nightmare in post to correct all the misalignments.



I use a variety of software to get the final product. I use a batch renaming program to rename my files as even and odd numbers when I take pics of two pages at a time. I use ThumbsPlus as a thumbnail viewer and to apply simple rotation or grayscale conversion. I use a program called ClearImage to convert the large jpgs to bitonal tiffs a fraction of the size (a 90% size reduction in most cases). I do everything with macros, and I use a macro program to record and playback my mouseclicks so it all works automatically. Then of course there is the OCR software and final conversion to a pdf with text under image. I am still looking for a way to do OCR on high resolution pictures, but then replace them with low-res compressed versions (these are still human readable).



What I have written here barely scratches the surface of all this. I have 20 pages of material written that goes into much greater detail. If you provide me with your email I can send you to my website when it is done.