by collective good / July 15, 2004 6:16 AM PDT

I'm new to this but plan soon to visit a few archives, where I'd like to capture text (and some images, which would ideally reproduce at publication quality, since they are to be included in a book). My primary hope is to find the simplest and most time- and cost-effective means to shoot the pages, OCR them so that I can keyword-search and edit them. Looking into using a digital camera since some archives won't let you lay a scanner down atop the pages (assuming camera is quicker, too, but please disabuse me of any unfounded assumptions). Note that some of these situations are in relatively low-light environments. I'd ideally like to buy a digital camera that I can quickly transfer data from so that I don't end up in an archive without the memory needed to capture potentially large amounts of data. I could bring a laptop into the archive. Inexperienced, but could learn if there's a fast way to transfer dstright from camera to laptop.

At the same time, it would be great if the same camera could have a zoom and fast-sequence capturing of images that would allow for good photojournalism. I am asking too much of a single camera with these diverse desires? Would prefer to buy just one but am eager to hear of costs and impliations of all compromises.

Thanks for your consideration.

7 total posts
Re: capturing text for OCR in lowlight- ideal minimum MP + >
by R. Proffitt Forum moderator / July 15, 2004 6:54 AM PDT

For such, you'll want a large format black and white film camera. It's low light performance is magnitudes better than any of your available digital cameras.

Bob

Re: capturing text for OCR in lowlight- ideal minimum MP + >
by collective good / July 15, 2004 7:10 AM PDT

thanks for the fast response. But since I must get a digital camera for my other purpose (web journalism, so speed-of-shot and zoom capabilities outweigh megapixils here), and since I have heard of others taking digital photos of text, I'm still curious about what might be my best digital choice for my upcoming archives visits. When I say low light, what I mean is a library-like setting, usually with some natural light via windows plus flourescent or other room lighting. I may be able to use flash in some instances, but have to assume, since that could be disturbing to other patrons, that in most cases I will not be able to use flash.

I have actually been looking at some other options in cnet since posting earlier: "pen" scanners that can be run like a highlighter or (better) a wand across the page. So maybe my envisioned use of a digital camera for this purpose is all wrong, as you seem to suggest. The image-capture element, while lower priority than the text capture because I can only use a limited number of images in the book anyway, is still a subject of interest. And the fast-sequence + zoom interests I have for getting shots for my students' webzine is still relevant to my curent buying preparations. Should I be posting separate posts for all these things? I was kind of hoping for an all-in-one solution that may not be there. Anyway, if you or anyone has anything more to add now that you are armed with the additional info posted here, I'd invite any additional feedback. Thanks again.

Let's try a little math then.
by R. Proffitt Forum moderator / July 15, 2004 8:27 AM PDT

First, let me write that LOW LIGHT is the death call for digital cameras that you can buy from the usual places.

To get any decent copy, the scan or picture of the page will have to be about 300 DPI. Let's take an 8.5 by 11 inch sheet with 1 inch borders and figure out just how many megapixels the imager will need.

With the 1 borders we now need an exact 6.5 x 9 inch image. I'm going to fib about problems with width and height for now and just find out how many pixels we'll need. So far we have 58.5 square inches to get a 300DPI image of. That's 300x300 times 58.5 or 5,265,000 pixels.

The problem of light not withstanding and the issues of the width and hieght ratio, you will move to the 8 MegePixel (MP) camera to get a clean shot of the page that you'll have to post process with some imaging software to flatten it back out. You as a photographer will know how straight lines will curve from shooting too close.

At least you can buy 8 MP cameras.

Bob

There is a ray of light
by junkbokx / July 16, 2004 11:46 AM PDT

I have been involved in scanning text using a digital camera for over a year now and I can steer you in the right direction. I am currently writing a rather extensive paper on the subject for publication to the web, but it is not quite done yet.

Speed is my top priority, but quality is a close second, and I do try to OCR all my results. I can tell you from experience that you will get rather good OCR results with a resolution as low as 200 dpi. I would advise you to get the highest possible megapixel camera you can as you will get markedly better results. Most books at the library are not in fact as large as A4 (8.5 x 11), they usually measure a smaller 6 inches wide. Now then, to determine DPI take the resolution of the camera and divide by the number of inches the book is wide. That 9 inch wide book at 200 DPI needs a horizontal resolution of 1800. To find the vertical divide 1800 by 4 and multiple by 3 (cameras take pics in a 4 x 3 ratio) and you get 1800 x 1350, which is 2.43 megapixels. This is a minimum value to get OCR of greater than 95% accuracy. An A3 sized area (8.5 x 11 opened up showing two pages) requires about 8 megapixels to get good OCR.

Light will make or break your shots. I use Photoshop to blacken all dark gray pixels, but shadowy areas become obscured. I do not use a flash with some of my digital cameras because it creates a washout effect. I have one camera that can not image a page at all without the flash and flanking halogen lights. Kodak cameras are good at reproducing natural light. If you can read the page, so can the camera. Kodak cameras also have a macro mode that allows better close-ups, essential in document imaging. There is a new Kodak 6 MP camera (DX7630) that even has a document setting. If you use a flash just tape some plastic milk jug material over the flash to diffuse it. You may need several layers. This also makes the flash somewhat unobtrusive. One thing you can do to improve your results is to raise the exposure value (EV). A high EV lets in more light making for better text in low light conditions.

Use the highest possible resolution with no compression. If you zoom in on your images the text is actually a series of black and dark gray pixels. The more compression there, or the lower the resolution, the more gray pixels you will have. Gray pixels are not read as well as black in OCR apps in my experience. This is why I use Photoshop to darken my text.

I find a USB card reader works best in transferring images. The transfer software that cameras include can be very tempermental to use, but my USB card reader never fails. Get yourself a decent tripod or build a copystand for your camera if you are on location. If you do a websearch for "digital camera" plus "genealogy" you will find many useful plans and tips for using a digital camera. I also like to use a TV set and output the cameras video out so I can see what I am taking a picture of. This is really a lot better than a tiny LCD screen.

One word of advise that took me a few books to realize: Tape your books to a tabletop so they do not move. Each turn of the page slowly moves the book and it is a nightmare in post to correct all the misalignments.

I use a variety of software to get the final product. I use a batch renaming program to rename my files as even and odd numbers when I take pics of two pages at a time. I use ThumbsPlus as a thumbnail viewer and to apply simple rotation or grayscale conversion. I use a program called ClearImage to convert the large jpgs to bitonal tiffs a fraction of the size (a 90% size reduction in most cases). I do everything with macros, and I use a macro program to record and playback my mouseclicks so it all works automatically. Then of course there is the OCR software and final conversion to a pdf with text under image. I am still looking for a way to do OCR on high resolution pictures, but then replace them with low-res compressed versions (these are still human readable).

What I have written here barely scratches the surface of all this. I have 20 pages of material written that goes into much greater detail. If you provide me with your email I can send you to my website when it is done. As a grad student I consider a digital camera an investment that can not be ignored. I have taken to scanning all my overpriced textbooks and returning them without paying a dime. In 3 semesters my Dimage A2 will pay for itself in book savings. I am also firmly opposed to copyright; if you disagree, suck it.

Re: There is a ray of light
by adslkjflkf / December 11, 2004 6:31 AM PST

junkbokx, please email me with a link to your site. Thank you.

I'm interested in your material
by mseidner / December 23, 2004 10:14 PM PST

I use a Ricoh Caplio 4 MP for document capture, always looking for ways to make it better your help would be appreciated

markseidner@yahoo.com

