Another busy month getting some OCR work up to production standard – there’s a big difference between ‘basically working’ and ‘industrial grade’. It’s taken a lot of work, but is there now.
I have written a test harness that allows bulk amounts of test data to be tested and retested, in as little as 1 second each time, and in an automated way after each round of training. A bit of ‘infrastructure’ and formal testing goes as long way at this stage.
Success rates in the test suite are now up to 100%, and the real prodution environment beckons.
In other news, I have some work to do on a ‘pure’ AI project – some software to take one player’s turn in a ‘full information’ two-player game, and associated research on scoring algorithms. The idea is to see what level of ‘intelligent behaviour’ emerges from just the raw computation, given enough time and CPU cores. It sounds like my idea of fun 🙂
For a client’s project recently I needed to be able to correlate the positions of a number of points in 2D space from one image to another. These aren’t ‘features’ as usually handled by routines such as SIFT/SURF/ORB, etc, but just 2D points with no other attributes at all. Most of the points in image A will be present in image B, but some may be missing, there may be extras, and image B may be rotated, scaled, and translated in x,y, by any amount.
It turns out to be quite an interesting problem. Luckily the number of points in the images were fairly small (<50), so a brute force approach works – and it does work well. As long as the most of the points in image A are present in image B, in something close to the same positions relative to each other, then they are found correctly in image B – and importantly the algorithm knows which points correlates with which, so the position of each point can be ‘followed’ into the new image. Finally, it returns the amount of rotation, scaling and translation applied.
If there were a large number of points, the efficiency of the algorithm would be a problem, and a different approach would be needed, but for this application it has worked very well.
This week I have taken delivery of a Raspberry Pi 2, and a Pi camera module: Total cost around UKP50. The aim of the experiment is to see whether the Pi is powerful enough to be used for computer vision applications in the real world. More of that over the coming days, but the short version is: Yes it is.
I also needed several other Pi-related components (again, more details of the fun we’re having at a later date). For various reasons mostly to do with who had what in stock, I split the purchases between two UK companies – 4Tronix, who supply all sorts of superb robotics stuff for Pi and Arduino, and The Pi Hut, who as the name implies sell all things Pi-related. Both orders were handled quickly, and I recommend both companies highly.
Setting up the new Pi took 2 minutes, and attaching the camera module is easy, if slightly fiddly.
I used the ‘picamera’ module and was getting images displayed on screen, and saved to the filesystem, all within a further few minutes. The ‘picamera’ module appears to be a very well written library, and the API is certainly powerful.
It was then time to build OpenCV. This is a slightly more involved process (build it from the source code), which took a few minutes of hands-on time, followed by about 4 hours of waiting for it to compile. A quick experiment then showed OpenCV working properly from both C++ and Python.
The picamera module can process images in such a way that they can be handled by OpenCV – the interface between the two is straightforward. As such, within a few more minutes I was grabbing images live from the Pi camera module, and processing them with normal OpenCV Python calls. I don’t yet know what would be involved in getting images from the camera from C++, but with a Python interface this good, it may not be necessary to worry about it (Python can of course call C/C++ routines anyway).
Initial impressions are that it all works beautifully. On the *initial* setup, it seems to take about one second to capture a frame from the camera, but the good news is that OpenCV processing (standard pre-processing such as blurring, and Canny edge detection) are faster than I’d expect from a computer this size. After playing with a few settings, I am now able to increase the frame rate to many frames per second at capture, and around 4 FPS even including some OpenCV work (colour conversion, blur, and Canny edge detection) – bearing in mind some of those are compute-intensive tasks, I think that’s impressive.
So yes: The Raspberry Pi 2 and the Pi camera module are certainly suitable for computer vision tasks using OpenCV, and I have two contracts lined up already to work on this.
A busy month of OpenCV contracting for a number of clients, including some work in areas of OpenCV I’ve not used much, if at all, before (non-chargeable, of course – I only charge for productive time).
I am now more familiar than I ever thought I’d be with the HoughLines(P) and HoughCircles functions – the former of which is more complex than it first seems. Like many things in computer vision, it takes some coaxing to get good results, and even more coaxing to get really robust results across a range of ‘real live’ images in the problem domain.
I have also worked a lot this month with the whole ‘camera calibration’ suite of functions, and then followed that up by gaining experience with the ‘project image points into the plane’ routines, which can lead to some interesting ‘augmented reality’ applications. However, in my case, I’ve used them to simply determine exactly where (in the 2D image) a specific point in the 3D space would appear. It works very well, and I have a project lined up ready to put this into action.
I’ve revisited one of my ‘favourite’ (i.e. most used) parts of the library: contour finding, and associated pre- and post-processing, but this time all from Python.
During the last few days, I’ve started looking at 2D pose estimation: specifically in this case, trying to determine the location of a known set of 2D points in a target image, given possible translation, rotation and scale invariance. Not finished with that one, yet.
Last (but not least – this isn’t going to go away) I’ve been making an effort to learn Git. I was pleased to find this simple guide, which at least let me get on with my work while I learn the rest.
I’ve spent a month or so trying to make an effort to learn Python, mostly by forcing myself to do any new ‘prototype’ vision / OpenCV work in the language. This has cost me some money – I only charge for ‘productive’ time, not ‘learning’ time, and at times the temptation to go back to ‘nice familiar C++’ has been great. But I’ve made good progress with Python, and I’m glad I’ve stuck at it. Apart from anything else, the language itself isn’t hard to pick up.
The pros and cons from a computer vision perspective are roughly as expected. It can be slower to run, but depending on how the code is written, it’s not a big difference. Once ‘inside’ the OpenCV functions, the speed appears to be about the same (as you’d expect: it’s just a wrapper for the same code), but any code run actually in Python needs careful planning, and if large amounts of compution were going to be done, C++ would no doubt still be the best bet.
But anything it lacks in runtime speed, it certainly makes up for in speed of development. As a prototyping language, I think I’m already more productive in Python than C++ (and that’s after 20+ years of C++, and a month of part-time Python). There will always be more to learn, of course, but I think I’m at the point where the learning curve is beginning to get less steep.
Python has been around for years (since the late 80’s, I was surprised to discover, although not in mainstream use until much more recently). I have used it occasionally for very simple scripts, usually where the larger ‘ecosystem’ of the project I’m involved with has also been Python-based.
However, it’s now becoming clear that Python has broken through as a mainstram language in the scientific community, and also specifically in computer vision and AI. OpenCV – my main area of work – has a good Python binding.
Time to learn this language more deeply then, I think. I have shied away from it a little until now, on the basis that C/C++ are bound to be faster for compute-intensive tasks such as vision. However, initial tests show that the Python binding is roughly as fast (perhaps because it is exactly that – a binding, to the core of OpenCV which is still C/C++). It may be the case that C/C++ will remain faster when much of the functionality of the application is above the OpenCV level – but if the application is mainly just calls to OpenCV, then perhaps C/C++ doesn’t have such an advantage.
I will be creating some test apps in both languages as way of learning Python, and will post comparitive results here in due course.
Almost exactly a year ago, I was invited by a client to Google’s Campus in London, to attempt to port our app to the new Google Glass device, which was still pre-release in the UK at that point. As I wrote at the time, our app worked well (and surprisingly quickly), but due to a bug at the device driver level, the images being received from the camera were garbled, so our app wasn’t able to do anything very useful.
We reported the bug to Google and to OpenCV on the day. I am pleased to say that my client now has access to a current Google Glass device, containing the latest drivers – he has just contacted me to inform me that the app works perfectly now, without any modification, and is processing images as intended.
As you can imagine, I’m very pleased to hear this – not only does it prove that our app was working properly in the first place, but it now gives us an exciting new platform to develop computer vision apps on in the future. Watch this space.
There are a few areas of computer vision and image processing where a little bit of maths is hard to avoid. Luckily for me (I’m no mathematician) these are few and far between – in most cases these days, either the maths is not too advanced, or the popular libraries (such as OpenCV) help hide the worst of it and let us get on with being ‘practitioners’.
However, one exception that keeps cropping up is Fourier Transforms. They are everywhere in computer vision, and for good reason: they help solve a lot of problems (I’ll write another post about this when time allows, but my current project has been revolutionised by using Fourier Transforms).
However, almost all explanations plunge straight into maths, involving the so-called complex numbers: the square root of minus one, and all that. The simple truth is that my school maths (hi, Mr. Feakes!) didn’t equip me for this, and I strongly suspect I’m not alone. While OpenCV helps hide the real nuts and bolts, an intuitive explanation of what is going on is essential to help decide when to use this tool, and just on a basic level, how it works.
So I was very pleased to find the following: An Intuitive Explanation of Fourier Theory, with pictures, and no hard stuff. Just enough for me to understand intuitively how this works – perfect. Thanks to Steven Lehar for writing it.
EDIT 2014-04-16: Having been in touch with Steven to thank him personally, he has recommended a number of other articles for people who, like me, prefer ‘intuitive’ approaches to things. In particular, I’m looking forward to studying two – one of his own, and one other he recommends:
I was very pleased to be invited last weekend to a ‘wearable devices hackathon’, at Google Campus in central London. Having never been to such a thing before, I wasn’t sure what to expect.
We were asked to go with an Android app, ready to port to Google Glass. I have plenty of my own Android Apps, but I had little idea in advance of what capabilities Glass would offer in terms of what my apps need (namely: direct access to the camera, and plenty of CPU horsepower). Would I need to link to a back-end Android phone, and if so, how? Would the Mirror API offer what I needed? Would I be able to learn the GDK, which had been announced as a ‘sneak preview’ a few days earlier, quickly enough? Would I be twice the age of everyone else there? On a more serious note, how much would I be able to achieve in one day? I’ve had single bugs that have taken longer than that to fix. I am persistent, and don’t often give up on getting something working, but I rarely make promises as to how long something will take.
I went well-prepared: Two laptops, both identically configured with the entire Android SDK / OpenCV4Android / Tesseract stack (see this post), and well rehearsed in the code of my own app. It was an interesting day – we started at 9:00am, with some short presentations, following by splitting into teams. As we had gone with our own app, my client and I formed our own team. The day ended at midnight.
On the whole, the day and the end result was positive. Having got the initial ‘Let me try it on! Someone take my picture!’ moment out of the way (I had to join the queue…), it turned out that porting our app to run on Google Glass was not as difficult as I had feared. We didn’t need any Glass-specific code, other than installing the ‘sneak preview’ API into Eclipse. With some help from a couple of real (and very jet-lagged) Glass experts, our app was installed on a Glass device. With a couple of tweaks, it was running – quite a moment to see your own app running on brand new hardware.
We had one problem, which spoiled our day to some extent: The image that comes back from the Glass camera was distorted, as if by a driver problem. A quick search revealed that other people had had the same problem in the last couple of weeks since the ‘sneak preview’ was released. Various work-arounds were suggested, but none that would work for OpenCV. As much as we tried to work around it, there was nothing to be done – the incoming image (from the point of view of our app) was garbled. A real shame, as our app itself was working well – debug windows, output to the Glass screen, and all – and furthermore, it was running reasonably fast: at least 2 frames per second, comparing favourably to the 3-4 I expect on a Samsung Galaxy S4. Not bad at all for a 1st version of a brand new headset.
So, we didn’t quite get there. However, I have since reported the bug to OpenCV, had the bug accepted, and it is slated to be fixed in the next release (2.4.8). At that point, we will be up and running on Glass, and ready to move on to tweaking the UI for Glass-specific gestures. My client has a specific market, that will open up fairly rapidly at that point.
[EDIT Nov 2014: My client recently took delivery of a new Google Glass device, with the new drivers, and the app we had developed worked immediately and very well. To quote him, “it works like a miracle”].
Summary: As soon as OpenCV 2.4.8 is out, we should be there – and we now know that Google Glass is a capable platform for running OpenCV4Android apps, on board (and how to achieve that). Exciting times ahead, I think.
For two client projects this summer, I’ve needed an OCR solution, and I’ve ended up using Tesseract. It seemed like the obvious choice – open source, been in development since the ’80’s, development ‘sponsored by Google’ since 2006, etc.
Initial signs were good. I installed both the command line tool and the SDK on Linux. Within 5 minutes I was getting results from the command line tool, and within an hour I was also getting results from my own test program using the API. Only another few minutes after that, and I had got it using images provided by OpenCV, rather than by Leptonica, which it uses by default. All was looking good.
But since then, things have gone downhill somewhat. Maybe I’m using it in a case that it isn’t really designed for, and/or maybe I haven’t put enough time into training it with the specific font in question.
My ‘use case’ (without giving away client-specific details) is that I’m trying to recognise a sequence of numbers and letters, which may not be dictionary words – they may be acronyms, or just ‘random’ strings, and in some case will be individual letters.
For some characters it seems to work fairly well. In some of the cases it doesn’t, it’s almost understandable: An upper case letter ‘O’ does look a bit like an upper case letter ‘D’, and I can understand it confusing the upper case letter ‘I’ with the numeral ‘1’. But in other examples, it almost always seems to confuse upper case ‘B’ and ‘E’, even when the difference (i.e. the right hand side) is clearly visible. Why?!
For customisation, it seems to want training on languages, which I can understand – but surely there should be the option to just train it on a new font and have it simply recognise on a character-by-character basis too? There are options to switch off whole-word recognition, but they don’t seem to make much difference.
Finally, the whole thing is very under-documented, and unstable. One wrong parameter, and the whole thing crashes without an error message. In particular, the training process is long, cumbersome, and then crashes without further explanation.
I’ve spent a lot of time on this recently, and am probably about to give up for now. On the plus side, I did get it working on Android, thanks to the tess-two library, but the OCR results themselves were of course the same.
I’m hoping Google will pump some serious resource into getting Tesseract up to scratch – or that someone will come up with a good (i.e. documented, stable, and working) open source alternative.