Computer Vision – a Developer’s Viewpoint

This article, written by me, originally appeared on the website of the Institution of Analysts and Programmers.

Computer Vision A Developer’s Viewpoint

Tom Reader
Alver Valley Software Limited

Computer Vision is an exciting branch of computer science for the programmer, combining mathematics, artificial intelligence and machine learning techniques, with more traditional programming skills and problem solving. It can require large amounts of computing horsepower, and as such the theory has often been ahead of what has been practically possible. These days however, the hardware has caught up, and real, practical applications are now hitting the mainstream. Tom Reader, a computer vision expert at Alver Valley Software Limited, gives us a look back at the history, and a glance at the future of this fascinating field.


Computer vision can be summarised as the science of turning an image into usable information. The classic example is probably Automatic Numberplate Recognition (ANPR): Given an image of a car, extract the number plate text. Given a picture of a face, classify it as male or female, identify the emotion they are showing, or maybe even try to recognise the specific person. Given an image of a level crossing, identify whether or not there is a vehicle stranded on the railway line. Given an image of a station platform, estimate the number of people present. Given an image of a part on a production line, is it perfect or does it have a defect? Given a film taken from inside a shop window, estimate the number (and gender, age, etc) of people who stop to look at the window display, and for how long it keeps their attention. In all cases, we’re trying to distil an image (high dimensional data) into a simpler representation (maybe even binary or one-dimensional: yes/no, male/female, etc).

So what’s the problem?

The problems are immense. For a start, a typical image might contain 10MB of data, usually split into 3 separate colour channels. In almost all cases, the images presented will be ‘noisy’ – for example, they may be out of focus, include reflections and other distractions, and some items may be obscured by other items. Camera lenses almost always add their own ‘errors’ (photographers will be familiar with barrel distortion and chromatic aberrations, for example), and an image that has been stored as a JPEG may have extra ‘artefacts’ added due to the compression.

Even given a clear image, the task of explaining to a computer, in a conventional programming language, how to read a line of text or identify a person, can probably be imagined by anyone familiar with programming of any sort. As an added complication, humans see colour in a very different way to how computers process it, so anything to do with colour matching has to take that into account.

And it’s so easy for us humans…

Part of the problem as a computer vision professional is that humans (and other animals) are incredibly good at vision – we have very highly adapted visual systems and powerful brains. In the case of humans, some of this begins from a very early age – a baby will look towards a human face almost from birth. A few years later, when a bit of knowledge and common sense has developed, a small child will be capable of looking at a photo (or real-world scene) and answering questions like “what colour is the car?”, or “how many cats can you see?”. The ease with which we do this makes it hard to explain to clients that these are very difficult problems for a computer system.

Basic techniques

The basic building blocks of computer (and probably animal) vision include low-level mathematical techniques. Changes between colour and brightness levels are analysed with the aim of detecting ‘edges’ and the ‘regions’ that they separate. Even this is a difficult problem, involving significant computing power, and some of the theory was only developed surprising recently. In humans, some of this is achieved by very low level processing, including some in the retina itself before the signal even reaches the brain. In the computer, it’s pixels all the way, and everything has to be programmed from there.

Having done the simple processing, things get worse. Given the set of edges and/or regions, which are ‘real’, and which are image artefacts? For example, given a photo of a bowl of fruit (one of the computer vision staple test images), which edges are the actual edges of pieces of fruit, and which are just the lines of shadows? A yellow grapefruit may look dark grey when viewed in deep shadow, and almost white at the point where the light reflects off it. A child of 4 could point at the grapefruit, but it’s not a simple problem to solve in software.

Other basic techniques include blurring, sharpening, resizing, contour finding, and converting to different image representations.

Looking for ‘features’

Often, the points of interest in an image boil down to a set of ‘features’. For example, a letter ‘A’ has a sharp point, a ‘B’ and a ‘D’ have some rounded corners and some sharp corners, a ‘C’ has a rounded corner and open ends, an ‘O’ has rounded corners but no point or ends, a ‘4’ and an ‘X’ both have crossed lines. Those ‘features’ can be extracted, and then used as inputs for higher-level processes to work with.

Higher level techniques

Having extracted features from an image, interesting techniques can then be applied. Artificial Intelligence (AI) techniques such as neural networks and support vector machines can be used to automatically ‘learn’ from a large enough set of training data, and then to classify future, previously unseen cases. But choosing the correct features in the first place (and writing the code to extract them) is essential, and then all the normal AI problems (e.g. how to train and test a neural network) still apply.

Vision libraries

Not all the code has to be written from scratch. There are libraries (both commercial, and open source) that can help with a lot of the leg-work. I use the OpenCV library, which is open source, and contains a large number of highly optimised routines for doing some of the low-level work, and also has some higher-level tools in the box. It is still just a tool-box though – although it includes some ready-to-go solutions straight ‘out of the box’, it is mostly a programming library.

Development techniques

Computer vision projects present some unique challenges to the developer. I work mostly as part of a team, but I tend to be the main (or only) computer vision programmer, with other people doing other parts of the task – user interface, back-end integration, communications, etc. From the computer vision perspective, I have found myself concentrating on a number of areas apart from the problem-solving aspects of the vision work itself.

Firstly, keep the basics in place. For example, I have become increasingly keen on my software writing a proper log, either to file on disk, or via callbacks to the calling program in the case of an API. Either way, I need to be able to see everything my program is doing, at a configurable level of detail – this is essential when trying to identify problems which will inevitably crop up. Also, documentation is essential, especially when working on multiple projects concurrently. I use Doxygen (with standard in-code comments, which I use without thinking) to give me a good overview of each program down to the class, function and parameter level.

Secondly, the computer vision process is almost always a workflow. Images pass through many processes (not always the same pathway, as images get processed), and the information gradually gets extracted to higher levels. Bearing this in mind, I write a series of ‘debug’ images to disk at pre-determined points, to aid understanding problems that will occur with some images later on. For example, did a problem occur because the image was too blurred or distorted, or was the image fine but the neural network came up with a wrong classification?

Thirdly, computer vision usually involves some aspect of machine learning. One thing this entails is huge volumes of data for training and testing. A typical sub-project may require training with 10,000 images, half of which do contain a certain object, and half of which don’t. This data all has to be generated or extracted somehow. I spend quite a bit of time designing, writing, and using a variety of in-house ‘data wrangling’ tools to handle and manage all this data, and of course also using standard tools such as those built into Linux where possible.

At the ‘project’ level, things also need managing. Often, I will initially be given a few hours to see if a task is feasible at all, and I will produce a ‘quick and dirty’ test to try out a few concepts with some simple test images. If it goes well, things may proceed to the next stage – basically to see how far we can get with this idea – so the code develops further. Later on, the project will get more serious, the images will become ‘noisier’ and more realistic, and the expectation of correct results will get closer to 100%. Finally, things get ready to go into production, almost always as part of a larger software infrastructure, with all normal expectations of stability, scalability (maybe into the cloud), manageability, performance, error handling, etc.

During this process, it is essential to remember that the code needs to match the stage in the project it embodies. During the ‘proof of concept’ stage, a few lines of uncommented Python may be enough to try things out, and we can let the error checking slide for a while. But if the project proceeds to later stages, it is essential to make sure the code quality keeps up. At some point the language decision needs to be made (computer vision is compute-intensive, both in terms of CPU and RAM, so C++ is popular), integration decisions need to be made, and documentation needs to be kept up to date.

Future applications

Away from the ‘traditional’ applications, computer vision is now finding new niche markets. For example ‘augmented reality’, where a computer superimposes certain information onto a ‘Head Up Display’ in the user’s vision, can be further improved with computer vision. Smartphones are now powerful enough to run some computer vision tasks. Other platforms also are, or will be: although Google Glass was only a prototype, and appears to have gone back to the drawing board for now, I wrote a simple ANPR application on that platform two years ago. No doubt similar platforms will hit the market eventually, and there are many computer vision possibilities.

Deep learning

As computing power and storage continues to grow, artificial intelligence improves, and computer vision advances. Given every image on the Internet, almost infinite computing power, and a few hundred experts in the field, it should be possible to create some really good computer vision applications – keep an eye on what Google are coming up with for details. They would love to be able to classify a set of images into folders such as ‘the beach’, ‘the kids’, ‘Christmas photos’, etc, and they have made huge progress towards that recently. That kind of ‘deep learning’ is almost a higherlevel of abstraction still – if and how those techniques will ‘filter down’ to the handling of computer vision for specific tasks, remains to be seen.

For now, with ever increasing computing power and research results, I can’t think of any more rewarding branch of computer science in which to work.

Tom Reader

Alver Valley Software Limited

Colour and pattern matching

Lots of work for a client recently in the area of colour matching, and moving on from there, the more challenging problem of matching patterns (i.e. groups of colours, arranged into shapes or regions of different sizes).

Although OpenCV provides the basic tools needed to work with colour (primarily, the ability to switch between colour spaces like RGB and HSV), matching colours is more challenging than it first seems.  In this instance, we were trying to match colours as the human eye would perceive them, which is different from the simple measure of how ‘close’ two colours are in a simple RGB cube.  From there, a further complication is that much of the matching was done from photographs taken in the ‘real world’, where lighting and shading varies widely – even something which is the ‘same colour’ across all its surface may appear very different at different points in the photo, or across several photos.  Deciding what is a difference in colour, as opposed to a difference in lighting, is tricky, but there are techniques that can be used to help.  In our case, we also had to decide which part of the image was the ‘object’, and which was the background to be ignored completely.

Moving on to matching patterns adds further complexity.  Is it more important to match colours as exactly as possible, regardless of size/shape of the pattern, or vice versa?  We created a test set containing around 20,000 images to experiment with the best ways of finding a pattern that most closely matches another pattern, and have found a good balance – but it’s not a simple problem.

2D point matching (or pose estimation) based just on relative position

For a client’s project recently I needed to be able to correlate the positions of a number of points in 2D space from one image to another.  These aren’t ‘features’ as usually handled by routines such as SIFT/SURF/ORB, etc, but just 2D points with no other attributes at all.  Most of the points in image A will be present in image B, but  some may be missing, there may be extras, and image B may be rotated, scaled, and translated in x,y, by any amount.

It turns out to be quite an interesting problem.  Luckily the number of points in the images were fairly small (<50), so a brute force approach works – and it does work well.  As long as the most of the points in image A are present in image B, in something close to the same positions relative to each other, then they are found  correctly in image B – and importantly the algorithm knows which points correlates with which, so the position of each point can be ‘followed’ into the new image.  Finally, it returns the amount of rotation, scaling and translation applied.

If there were a large number of points, the efficiency of the algorithm would be a problem, and a different approach would be needed, but for this application it has worked very well.

Fourier Transforms for the non-mathematician

There are a few areas of computer vision and image processing where a little bit of maths is hard to avoid. Luckily for me (I’m no mathematician) these are few and far between – in most cases these days, either the maths is not too advanced, or the popular libraries (such as OpenCV) help hide the worst of it and let us get on with being ‘practitioners’.

However, one exception that keeps cropping up is Fourier Transforms. They are everywhere in computer vision, and for good reason: they help solve a lot of problems (I’ll write another post about this when time allows, but my current project has been revolutionised by using Fourier Transforms).

However, almost all explanations plunge straight into maths, involving the so-called complex numbers: the square root of minus one, and all that. The simple truth is that my school maths (hi, Mr. Feakes!) didn’t equip me for this, and I strongly suspect I’m not alone. While OpenCV helps hide the real nuts and bolts, an intuitive explanation of what is going on is essential to help decide when to use this tool, and just on a basic level, how it works.

So I was very pleased to find the following: An Intuitive Explanation of Fourier Theory, with pictures, and no hard stuff. Just enough for me to understand intuitively how this works – perfect. Thanks to Steven Lehar for writing it.

EDIT 2014-04-16: Having been in touch with Steven to thank him personally, he has recommended a number of other articles for people who, like me, prefer ‘intuitive’ approaches to things. In particular, I’m looking forward to studying two – one of his own, and one other he recommends:

A Visual, Intuitive Guide to Imaginary Numbers

Clifford Algebra: A visual introduction

Thanks again, Steven.