Tesseract

For two client projects this summer, I’ve needed an OCR solution, and I’ve ended up using Tesseract.  It seemed like the obvious choice – open source, been in development since the ’80’s, development ‘sponsored by Google’ since 2006, etc.

Initial signs were good.  I installed both the command line tool and the SDK on Linux.  Within 5 minutes I was getting results from the command line tool, and within an hour I was also getting results from my own test program using the API.  Only another few minutes after that, and I had got it using images provided by OpenCV, rather than by Leptonica, which it uses by default.  All was looking good.

But since then, things have gone downhill somewhat.  Maybe I’m using it in a case that it isn’t really designed for, and/or maybe I haven’t put enough time into training it with the specific font in question.

My ‘use case’ (without giving away client-specific details) is that I’m trying to recognise a sequence of numbers and letters, which may not be dictionary words – they may be acronyms, or just ‘random’ strings, and in some case will be individual letters.

For some characters it seems to work fairly well.  In some of the cases it doesn’t, it’s almost understandable:  An upper case letter ‘O’ does look a bit like an upper case letter ‘D’, and I can understand it confusing the upper case letter ‘I’ with the numeral ‘1’.  But in other examples, it almost always seems to confuse upper case ‘B’ and ‘E’, even when the difference (i.e. the right hand side) is clearly visible.  Why?!

For customisation, it seems to want training on languages, which I can understand – but surely there should be the option to just train it on a new font and have it simply recognise on a character-by-character basis too?  There are options to switch off whole-word recognition, but they don’t seem to make much difference.

Finally, the whole thing is very under-documented, and unstable.  One wrong parameter, and the whole thing crashes without an error message.  In particular, the training process is long, cumbersome, and then crashes without further explanation.

I’ve spent a lot of time on this recently, and am probably about to give up for now.  On the plus side, I did get it working on Android, thanks to the tess-two library, but the OCR results themselves were of course the same.

I’m hoping Google will pump some serious resource into getting Tesseract up to scratch – or that someone will come up with a good (i.e. documented, stable, and working) open source alternative.

[rant ends]