I recently had the opportunity to do some work on an NVidia Jetson TK1 – a customer is hoping to use these (or other powerful devices) for some high-end embedded vision tasks.
First impressions of the TK1 were good. After installing the host environment on a Ubuntu 14.04 64 bit box and ‘flashing’ the TK1 with the latest version of all the software (including Linux4Tegra), I ran some of the NVidia demos – all suitably impressive. It has an ARM quad-core CPU, but the main point of it is the CUDA GPU, with 192 cores, giving a stated peak of 326 GFLOPS – not bad for board that is under 13cm square. It’s a SIMD (Single Instruction Multiple Data) processor, also known as a vector processor – so their claim that it is a ‘minisupercomputer’ isn’t too wildly unrealistic – although just calling it a ‘graphics cards with legs’ would also be fair.
I wrote some sample OpenCV programs using OpenCV4Tegra, utilising the GPU and CPU interchangeably, so we could do some performance benchmarks. The results were OK, but not overwhelming. Some code ran up to 4x faster on the GPU than the CPU, while other programs didn’t see that much benefit. Of course, GPU programming is quite different from CPU programming, and not all tasks will ‘translate’ well to a vector processor. One task we need in particular – stereo matching – might benefit more.
We will do more work on this in due course. We will also be comparing the processing power to a Raspberry Pi 3, and some Odroids, as part of our evaluation of suitable hardware for this demanding embedded project. More results will be posted here as we get them.
Lots of work for a client recently in the area of colour matching, and moving on from there, the more challenging problem of matching patterns (i.e. groups of colours, arranged into shapes or regions of different sizes).
Although OpenCV provides the basic tools needed to work with colour (primarily, the ability to switch between colour spaces like RGB and HSV), matching colours is more challenging than it first seems. In this instance, we were trying to match colours as the human eye would perceive them, which is different from the simple measure of how ‘close’ two colours are in a simple RGB cube. From there, a further complication is that much of the matching was done from photographs taken in the ‘real world’, where lighting and shading varies widely – even something which is the ‘same colour’ across all its surface may appear very different at different points in the photo, or across several photos. Deciding what is a difference in colour, as opposed to a difference in lighting, is tricky, but there are techniques that can be used to help. In our case, we also had to decide which part of the image was the ‘object’, and which was the background to be ignored completely.
Moving on to matching patterns adds further complexity. Is it more important to match colours as exactly as possible, regardless of size/shape of the pattern, or vice versa? We created a test set containing around 20,000 images to experiment with the best ways of finding a pattern that most closely matches another pattern, and have found a good balance – but it’s not a simple problem.
For a client’s project recently I needed to be able to correlate the positions of a number of points in 2D space from one image to another. These aren’t ‘features’ as usually handled by routines such as SIFT/SURF/ORB, etc, but just 2D points with no other attributes at all. Most of the points in image A will be present in image B, but some may be missing, there may be extras, and image B may be rotated, scaled, and translated in x,y, by any amount.
It turns out to be quite an interesting problem. Luckily the number of points in the images were fairly small (<50), so a brute force approach works – and it does work well. As long as the most of the points in image A are present in image B, in something close to the same positions relative to each other, then they are found correctly in image B – and importantly the algorithm knows which points correlates with which, so the position of each point can be ‘followed’ into the new image. Finally, it returns the amount of rotation, scaling and translation applied.
If there were a large number of points, the efficiency of the algorithm would be a problem, and a different approach would be needed, but for this application it has worked very well.
This week I have taken delivery of a Raspberry Pi 2, and a Pi camera module: Total cost around UKP50. The aim of the experiment is to see whether the Pi is powerful enough to be used for computer vision applications in the real world. More of that over the coming days, but the short version is: Yes it is.
I also needed several other Pi-related components (again, more details of the fun we’re having at a later date). For various reasons mostly to do with who had what in stock, I split the purchases between two UK companies – 4Tronix, who supply all sorts of superb robotics stuff for Pi and Arduino, and The Pi Hut, who as the name implies sell all things Pi-related. Both orders were handled quickly, and I recommend both companies highly.
Setting up the new Pi took 2 minutes, and attaching the camera module is easy, if slightly fiddly.
I used the ‘picamera’ module and was getting images displayed on screen, and saved to the filesystem, all within a further few minutes. The ‘picamera’ module appears to be a very well written library, and the API is certainly powerful.
It was then time to build OpenCV. This is a slightly more involved process (build it from the source code), which took a few minutes of hands-on time, followed by about 4 hours of waiting for it to compile. A quick experiment then showed OpenCV working properly from both C++ and Python.
The picamera module can process images in such a way that they can be handled by OpenCV – the interface between the two is straightforward. As such, within a few more minutes I was grabbing images live from the Pi camera module, and processing them with normal OpenCV Python calls. I don’t yet know what would be involved in getting images from the camera from C++, but with a Python interface this good, it may not be necessary to worry about it (Python can of course call C/C++ routines anyway).
Initial impressions are that it all works beautifully. On the *initial* setup, it seems to take about one second to capture a frame from the camera, but the good news is that OpenCV processing (standard pre-processing such as blurring, and Canny edge detection) are faster than I’d expect from a computer this size. After playing with a few settings, I am now able to increase the frame rate to many frames per second at capture, and around 4 FPS even including some OpenCV work (colour conversion, blur, and Canny edge detection) – bearing in mind some of those are compute-intensive tasks, I think that’s impressive.
So yes: The Raspberry Pi 2 and the Pi camera module are certainly suitable for computer vision tasks using OpenCV, and I have two contracts lined up already to work on this.
A busy month of OpenCV contracting for a number of clients, including some work in areas of OpenCV I’ve not used much, if at all, before (non-chargeable, of course – I only charge for productive time).
I am now more familiar than I ever thought I’d be with the HoughLines(P) and HoughCircles functions – the former of which is more complex than it first seems. Like many things in computer vision, it takes some coaxing to get good results, and even more coaxing to get really robust results across a range of ‘real live’ images in the problem domain.
I have also worked a lot this month with the whole ‘camera calibration’ suite of functions, and then followed that up by gaining experience with the ‘project image points into the plane’ routines, which can lead to some interesting ‘augmented reality’ applications. However, in my case, I’ve used them to simply determine exactly where (in the 2D image) a specific point in the 3D space would appear. It works very well, and I have a project lined up ready to put this into action.
I’ve revisited one of my ‘favourite’ (i.e. most used) parts of the library: contour finding, and associated pre- and post-processing, but this time all from Python.
During the last few days, I’ve started looking at 2D pose estimation: specifically in this case, trying to determine the location of a known set of 2D points in a target image, given possible translation, rotation and scale invariance. Not finished with that one, yet.
Last (but not least – this isn’t going to go away) I’ve been making an effort to learn Git. I was pleased to find this simple guide, which at least let me get on with my work while I learn the rest.
I’ve spent a month or so trying to make an effort to learn Python, mostly by forcing myself to do any new ‘prototype’ vision / OpenCV work in the language. This has cost me some money – I only charge for ‘productive’ time, not ‘learning’ time, and at times the temptation to go back to ‘nice familiar C++’ has been great. But I’ve made good progress with Python, and I’m glad I’ve stuck at it. Apart from anything else, the language itself isn’t hard to pick up.
The pros and cons from a computer vision perspective are roughly as expected. It can be slower to run, but depending on how the code is written, it’s not a big difference. Once ‘inside’ the OpenCV functions, the speed appears to be about the same (as you’d expect: it’s just a wrapper for the same code), but any code run actually in Python needs careful planning, and if large amounts of compution were going to be done, C++ would no doubt still be the best bet.
But anything it lacks in runtime speed, it certainly makes up for in speed of development. As a prototyping language, I think I’m already more productive in Python than C++ (and that’s after 20+ years of C++, and a month of part-time Python). There will always be more to learn, of course, but I think I’m at the point where the learning curve is beginning to get less steep.
Python has been around for years (since the late 80’s, I was surprised to discover, although not in mainstream use until much more recently). I have used it occasionally for very simple scripts, usually where the larger ‘ecosystem’ of the project I’m involved with has also been Python-based.
However, it’s now becoming clear that Python has broken through as a mainstram language in the scientific community, and also specifically in computer vision and AI. OpenCV – my main area of work – has a good Python binding.
Time to learn this language more deeply then, I think. I have shied away from it a little until now, on the basis that C/C++ are bound to be faster for compute-intensive tasks such as vision. However, initial tests show that the Python binding is roughly as fast (perhaps because it is exactly that – a binding, to the core of OpenCV which is still C/C++). It may be the case that C/C++ will remain faster when much of the functionality of the application is above the OpenCV level – but if the application is mainly just calls to OpenCV, then perhaps C/C++ doesn’t have such an advantage.
I will be creating some test apps in both languages as way of learning Python, and will post comparitive results here in due course.
I was very pleased to be invited last weekend to a ‘wearable devices hackathon’, at Google Campus in central London. Having never been to such a thing before, I wasn’t sure what to expect.
We were asked to go with an Android app, ready to port to Google Glass. I have plenty of my own Android Apps, but I had little idea in advance of what capabilities Glass would offer in terms of what my apps need (namely: direct access to the camera, and plenty of CPU horsepower). Would I need to link to a back-end Android phone, and if so, how? Would the Mirror API offer what I needed? Would I be able to learn the GDK, which had been announced as a ‘sneak preview’ a few days earlier, quickly enough? Would I be twice the age of everyone else there? On a more serious note, how much would I be able to achieve in one day? I’ve had single bugs that have taken longer than that to fix. I am persistent, and don’t often give up on getting something working, but I rarely make promises as to how long something will take.
I went well-prepared: Two laptops, both identically configured with the entire Android SDK / OpenCV4Android / Tesseract stack (see this post), and well rehearsed in the code of my own app. It was an interesting day – we started at 9:00am, with some short presentations, following by splitting into teams. As we had gone with our own app, my client and I formed our own team. The day ended at midnight.
On the whole, the day and the end result was positive. Having got the initial ‘Let me try it on! Someone take my picture!’ moment out of the way (I had to join the queue…), it turned out that porting our app to run on Google Glass was not as difficult as I had feared. We didn’t need any Glass-specific code, other than installing the ‘sneak preview’ API into Eclipse. With some help from a couple of real (and very jet-lagged) Glass experts, our app was installed on a Glass device. With a couple of tweaks, it was running – quite a moment to see your own app running on brand new hardware.
We had one problem, which spoiled our day to some extent: The image that comes back from the Glass camera was distorted, as if by a driver problem. A quick search revealed that other people had had the same problem in the last couple of weeks since the ‘sneak preview’ was released. Various work-arounds were suggested, but none that would work for OpenCV. As much as we tried to work around it, there was nothing to be done – the incoming image (from the point of view of our app) was garbled. A real shame, as our app itself was working well – debug windows, output to the Glass screen, and all – and furthermore, it was running reasonably fast: at least 2 frames per second, comparing favourably to the 3-4 I expect on a Samsung Galaxy S4. Not bad at all for a 1st version of a brand new headset.
So, we didn’t quite get there. However, I have since reported the bug to OpenCV, had the bug accepted, and it is slated to be fixed in the next release (2.4.8). At that point, we will be up and running on Glass, and ready to move on to tweaking the UI for Glass-specific gestures. My client has a specific market, that will open up fairly rapidly at that point.
[EDIT Nov 2014: My client recently took delivery of a new Google Glass device, with the new drivers, and the app we had developed worked immediately and very well. To quote him, “it works like a miracle”].
Summary: As soon as OpenCV 2.4.8 is out, we should be there – and we now know that Google Glass is a capable platform for running OpenCV4Android apps, on board (and how to achieve that). Exciting times ahead, I think.
For two client projects this summer, I’ve needed an OCR solution, and I’ve ended up using Tesseract. It seemed like the obvious choice – open source, been in development since the ’80’s, development ‘sponsored by Google’ since 2006, etc.
Initial signs were good. I installed both the command line tool and the SDK on Linux. Within 5 minutes I was getting results from the command line tool, and within an hour I was also getting results from my own test program using the API. Only another few minutes after that, and I had got it using images provided by OpenCV, rather than by Leptonica, which it uses by default. All was looking good.
But since then, things have gone downhill somewhat. Maybe I’m using it in a case that it isn’t really designed for, and/or maybe I haven’t put enough time into training it with the specific font in question.
My ‘use case’ (without giving away client-specific details) is that I’m trying to recognise a sequence of numbers and letters, which may not be dictionary words – they may be acronyms, or just ‘random’ strings, and in some case will be individual letters.
For some characters it seems to work fairly well. In some of the cases it doesn’t, it’s almost understandable: An upper case letter ‘O’ does look a bit like an upper case letter ‘D’, and I can understand it confusing the upper case letter ‘I’ with the numeral ‘1’. But in other examples, it almost always seems to confuse upper case ‘B’ and ‘E’, even when the difference (i.e. the right hand side) is clearly visible. Why?!
For customisation, it seems to want training on languages, which I can understand – but surely there should be the option to just train it on a new font and have it simply recognise on a character-by-character basis too? There are options to switch off whole-word recognition, but they don’t seem to make much difference.
Finally, the whole thing is very under-documented, and unstable. One wrong parameter, and the whole thing crashes without an error message. In particular, the training process is long, cumbersome, and then crashes without further explanation.
I’ve spent a lot of time on this recently, and am probably about to give up for now. On the plus side, I did get it working on Android, thanks to the tess-two library, but the OCR results themselves were of course the same.
I’m hoping Google will pump some serious resource into getting Tesseract up to scratch – or that someone will come up with a good (i.e. documented, stable, and working) open source alternative.
I spent a fair bit of time over the summer getting set up and aquainted with OpenCV4Android, starting with the Android SDK itself, and also including the NDK for native C++ development.
It’s all quite a bit of a learning curve: You end up learning Eclipse, the ADT add-ons, the Android framework, Java, NDK, and OpenCV4Android all more or less at the same time – and that’s assuming you’re happy with OpenCV and C++ to start with. Anyway, I have got there with plenty of help from the web. (I also have my own set of notes, taking me through the entire process again step by step, which I have since done twice on two new laptops).
But I am now completely up and running on this environment, and have successfully ported a number of my own OpenCV apps to run on various Android devices. They run well, too – the processing horsepower available on a modern smartphone turns out to be more than enough to handle images from the on-board camera, and run ‘average’ OpenCV apps, and I am currently working on two such apps for clients.