Thank you for being a valued part of the CNET community. As of December 1, 2020, the forums are in read-only format. In early 2021, CNET Forums will no longer be available. We are grateful for the participation and advice you have provided to one another over the years.

Thanks,

CNET Support

General discussion

OCR a new world for me

Apr 28, 2004 5:16AM PDT

I have a new application at work, which involves document scanning and OCR. The volume doesn't appear to justify a many-$K approach. I have been experimenting with Textbridge 11.0 and a fairly inexpensive scanner (Visoneer 8920). My need is to scan some older dot matrix printouts on letter and 11 x 14 printouts. The data is numeric and in columns.

Textbridge with some more learning on my part seems to be able to achieve about 80% recognition.

Is this the best I can expect in an under $200 approach? Does anyone have any recommendations for other OCR software at <$100? I am considering a 11 x 17 scanner which is USB and under $200. I have heard that the number of bits (e.g. 24 vs. 4Cool is part of the key to getting a clean scan. Is this true?

Thanks for any guidance.

Discussion is locked

- Collapse -
Re:OCR a new world for me
Apr 28, 2004 6:22AM PDT

Jconnor,

The quality of the results critically depends on the quality of the print. Arial 12 on a white paper on a laser printer is recognized very well: all lines are clear black lines. Dot matrix printouts tend to be grey (not black) and dotty (not solid lines), which makes them much more difficult to interpret for an OCR program.

The scanner is no limiting factor. The resolution of even the cheapest one nowadays is more than enough. The program used might make a difference. You might be able to find some try-out possibilities for other programs with google, just to try-out.

It might, however, help to experiment with the settings of the scanner. Set it to black and white, not color or grey values. Vary the resolution. Or treat the result with some photo editing software, incrementing the contrast or smooting irregularities (or both). It might even be a solution to make a photocopy of the document before scanning it; experiment with the settings of the photocopier also.

The better the quality of the result, the better the recognition of the text. 80% for numeric data is useless, in practice; typing from scratch is faster then locating and correcting the differences. Typing in Excel might do some calculations and thus provide a check on totals.

Hope this helps.


Kees

- Collapse -
Re:Re:OCR a new world for me
Apr 28, 2004 9:44AM PDT

Thanks on the suggestions. I downloaded a couple of 15 day trial applications and most yield the same or worse results. I will keep playing as I have a lot of graphics tools to enhance the image (good suggestion). Maybe some of the tracing tools might help.

I found most of the imbedded "character recognition or learning" tools are not too hot. For me the greyscale has worked better than black and white scanning. Funny thing the best results so far have come from the software that came with the scanner.

Typing this in is not an option. I can scan directly into excel so some possibilities there.

- Collapse -
Re:Re:Re: OCR suggestions
Apr 28, 2004 10:46AM PDT

Like what was already mentioned, dot matrix output tends to get erratic results because of the breakup of characters (the dots), and the inconsistency of the density of the dots.

The important thing for OCR is the scan resolution, the number of lines per inch, if you will. Anything less than 150/inch is probably going to be too coarse for the OCR software to properly render. Anything higher than 600/inch is probably too high to be meaningful, that is, the higher detail, while not bad - since the info is "all there", but the file will be so large that it is a labor for the OCR software to digest just to get your info rendered.
Try about 300/inch and see what results. Adjust that up or down a tad (like from 200 to 600 if possible) to see if you get any better results with your pages.

The number of bits (i.e. 24 or 36 or 48 or whatever) is the number used for color depth, and is not important for your scanning of text. In fact, what you really desire is just 1 bit color depth (black and white) since you are not trying to bring any color into your OCR software. You can achieve this in a couple, or more, ways.
See if you have a line art setting, that is typically just 1 bit color depth - just black on white, crisp lines.
If you can set the color depth to one, or just a few bits deep, try that.
You might even have a density threshold, where any scanned mark exceeding the threshold will register as black, and anything below is seen as white. That might get rid of the fuzziness around dot matrix characters. However, by setting that too high, then the dots making up what should be a solid character become disjointed, and your OCR software will get confused with the breakup of characters.

Good luck.

- Collapse -
Re:Re:Re:Re: OCR suggestions
Apr 28, 2004 12:07PM PDT

Thanks, I will give that a shot. I seem to recall that one of the packages had those type of features.

- Collapse -
Re:OCR a new world for me
Apr 29, 2004 8:20PM PDT

As previously mentioned, I would agree that the bit depth (24 vs more) is not all that relavent when it comes to OCR. The OCR engine I think tends to use a 1 bit mode (bitonal) for decision making. The issue you are dealing with is the broken nature of the characters with dot matrix printing.
The key would be to optimize the thresholding during scanning or even after the scanning. If the scanner has adjustable thresholding, then play with it. Higher numbers usually make text bolder. If your scanner does not have adjustments, then you can scan grayscale first, then you may be able to optimize the image prior to OCR with imaging programs like Photoshop. Photoshop has widely adjustable thresholding can thicken lines via use of a minimum filter on the grayscale image. Photoshop also has an automate function which can pull batchs of images in for processing. There may well be others on the market as well.