ONNX files in OpenCV

I have been aware of OpenCV’s ‘dnn’ module for some time: Last time we tried to use it in a project was a number of years ago, and it didn’t seem to be ready for what we needed – or perhaps we just misunderstood it and didn’t give it a good enough look.

Aside from that, I’ve been using .ONNX (Open Neural Network eXchange) files for a while now. My standard usage of these is to transport a trained model from PyTorch – for example a ResNet classifier – onto a Jetson Nano, NX or Orin. PyTorch can export as .ONNX, and TensorRT on the Jetson can import them, so it’s been literally an ‘exchange’ file format for me.

However, pulling these two things together, I have recently learned that OpenCV’s ‘dnn’ module can load directly from .ONNX files, specifically including ResNet models such as the ResNet18 classifier I have recently trained for a client.

There are a few ‘tricks’ required to prepare images to be classified, and it took me a fair amount of research (including some trial-and-error, and using ChatGPT – that was a day I can never get back…) but it works now: I can classify images, using a ONNX file, in OpenCV, from either C++ or Python.

This means that models that I originally trained for Jetson hardware can now be used on any platform with OpenCV. I will be testing this on a Raspberry Pi 5 shortly to gauge performance.

Currently, it’s using CPU only – but it does use all CPU cores available – but I believe GPU is also supported given a suitably-compiled OpenCV: I may try that next.

Cyber Essentials certification achieved

We’re pleased to say we recently achieved Cyber Essentials certification – showing that we take protection of customers’ data (including valuable machine learning training data and source code) seriously. We were happy to find we were already compliant in most areas, but we reviewed our policies and procedures, including strengthening them in a couple of specific cases. A useful process to go through.

We chose to work with CSIQ (https://www.csiq.co.uk/) as our certification partner, and strongly recommend their services.

Jetson hardware, and the ‘jetson-inference’ package

I have been involved in several projects very recently (and two ongoing) where we have used NVIDIA ‘Jetson’ hardware (Nano, Xavier / NX, and ConnectTech Rudi NX). These machines are roughly ‘credit-card sized’ (apart from the Rudi, which has a larger but very ‘rugged’ case) and are ideal for ‘edge’ or embedded systems.

The Jetson hardware is basically a small but powerful GPU, but also including a CPU and small ‘motherboard’ providing the usual USB ports, etc. They run a modified version of Ubuntu Linux.

In some cases I developed software in-house using OpenCV (C++ and Python). However, I am also making more and more use of the excellent ‘jetson-inference’ library of deep-learning tools, and have now built up quite a bit of experience in using this library and developing applications and solutions based on it.

In short, it is very good for developing solutions that need:

Image classification (e.g. cat vs dog, or labrador vs poodle, or beach vs park)
Object detection (i.e. accurate location, and classification of objects – can be trained to recognise new objects, including very small/distant)
Pose estimation (e.g. standing, sitting, walking, pointing, waving)

I have now developed a number of solutions that have ‘gone live’ using this hardware and toolkit. I am also experienced in the ‘back end’ tasks of training new ‘models’ to recognise new, specfic classes of objects, and porting those models to the Jetson hardware.

Please contact me to discuss whether I can help you with your Jetson-based project. tom [at] alvervalleysoftware.com.

How to install NVIDIA drivers and CUDA on Ubuntu 18.04

Settings -> Graphics: If it shows something ‘generic’ like ‘NV136’, then it’s not using NVIDIA drivers.

Go to ‘Software & Updates’ -> ‘Additional Drivers’. Select a recent/recommended one – I use ‘nvidia-driver-410’ at the time of writing (Mar 2019).

Let it update then reboot. If it’s worked, the command ‘nvidia-smi’ should show GPU status information.

NOTE: If it doesn’t work (i.e. comes up in low-resolution mode, and the ‘Settings->Graphics’ still shows ‘NV136’ or ‘llvm…’, then it’s not worked. IF THAT HAPPENS, the most likely fix is to DISABLE SECURE BOOT in the BIOS – secure boot stops some drivers being loaded. This was the problem on my machine, and disabling secure boot made the driver I’d already installed work.

Following CUDA installation instructions on NVIDIA site: Use Ubuntu package manager version

sudo apt-get install cuda

CUDA toolkit seems to be separate?

sudo apt install nvidia-cuda-toolkit

As per the CUDA post-install instructions – add the NVIDIA dir to PATH, install samples, compile them, run them.

OpenCV on CUDA

I recently had the opportunity to do some work on an NVidia Jetson TK1 – a customer is hoping to use these (or other powerful devices) for some high-end embedded vision tasks.

First impressions of the TK1 were good. After installing the host environment on a Ubuntu 14.04 64 bit box and ‘flashing’ the TK1 with the latest version of all the software (including Linux4Tegra), I ran some of the NVidia demos – all suitably impressive. It has an ARM quad-core CPU, but the main point of it is the CUDA GPU, with 192 cores, giving a stated peak of 326 GFLOPS – not bad for board that is under 13cm square. It’s a SIMD (Single Instruction Multiple Data) processor, also known as a vector processor – so their claim that it is a ‘minisupercomputer’ isn’t too wildly unrealistic – although just calling it a ‘graphics cards with legs’ would also be fair.

I wrote some sample OpenCV programs using OpenCV4Tegra, utilising the GPU and CPU interchangeably, so we could do some performance benchmarks. The results were OK, but not overwhelming. Some code ran up to 4x faster on the GPU than the CPU, while other programs didn’t see that much benefit. Of course, GPU programming is quite different from CPU programming, and not all tasks will ‘translate’ well to a vector processor. One task we need in particular – stereo matching – might benefit more.

We will do more work on this in due course. We will also be comparing the processing power to a Raspberry Pi 3, and some Odroids, as part of our evaluation of suitable hardware for this demanding embedded project. More results will be posted here as we get them.

Why is the recent ‘Go’ victory so important for AI? (Part 2)

(Since I wrote Part 1 of this article, the ‘AlphaGo’ AI won the 5th game in the series, giving a 4:1 victory over one of the top human players, Lee So-dol).

We have already discussed how ‘Go’ is much more difficult for a computer to play than Chess – mainly because the number of possible different moves per turn is so much bigger (and so the total ‘game space’ is even more vast), and because deciding how ‘good’ a particular board position is, is so much harder with ‘Go’.

First, let’s address one of the points the mainstream press have been making: No, the ‘artificial intelligence’ computers are not coming to get us and annihilate the human race (I’ve seen articles online that pretty much implied this was the obvious next step). Or at least, not because of this result. ‘Go’ is still a ‘full information, deterministic’ game, and these are things computers are good at, for all ‘Go’ is about as hard as such games get. This is very different from forming a good understanding of a ‘real world’ situation such as politics, business or even ‘human’ actions such as finding a joke funny, or enjoying music.

But back to ‘Go’. With Chess, the number of possible moves per turn means that looking at all possible moves beyond about 6 moves out is not a sensible approach. So, pre-programmed approaches (‘heuristics’) are used to decide which moves can safely be ignored, and which need looking at more closely.

With ‘Go’, even this is not possible, as no simple rules can be programmed. So, how did ‘AlphaGo’ tackle the problem?

The basic approach (searching the ‘game tree’) remained similar, but more sophisticated. Decisions about which parts of the tree to analyse in more detail (and which to ignore) were made by neural networks (of which more later).

Similarly, the ‘evaluation function’ which tries to ‘score’ a given board position had to be more sophisticated than for Chess. In Chess, the evaluation function is usually written (i.e. programmed into the software) by humans – indeed, in the 1997 Kasparov match won by IBM’s Deep Blue, the evaluation function was even changed between games by a human Grand Master, a cause of some controversy at the time (i.e. had the computer really won ‘alone’, or had the human operators helped out, albeit only between games).

In ‘AlphaGo’, another neural network (a ‘deep’ NN) was employed to analyse positions. And here lies the real difference. With AlphaGo, the software analysed a vast number of real games, and learned by itself what are features of good board positions. Having done this, it then played against itself in millions more games, and in doing so was able to fine tune this learning even further.

It learned how to play ‘Go’ well, rather than being programmed.

This ‘deep neural network’ approach is the hallmark of many modern ‘deep learning’ systems. ‘Deep’ is really just the latest buzzword, but the underlying concept is that the software was able to learn – and not just learn specific features, like a traditional neural network, but also to learn which features to choose in the first place, rather than having features hand-selected by a programmer.

We’ve probably got to the stage now where the perennial argument – are computers ‘really intelligent’, or just good at computing – has become fairly irrelevant. AI systems are now able to not only learn a given set of features, but to choose those features themselves – this is how human (and other animal) brains work. This is undoubtedly a very powerful technique, which will guide the future of AI for the next few years.

Why is the recent ‘Go’ victory so important for AI? (Part 1)

Anyone who has seen any news in the last few days will know that a computer has for the first time beaten a top human player at the ancient Chinese game of ‘Go’. In fact, at the time of writing, the AI (let’s call it by its name: AlphaGo) has beaten its opponent 3 times, and the human (Lee So-dol) has won one – the fifth in the series takes place shortly. But why is this such important news for AI?

After all, AI has been beating top grand-masters at Chess for a while now – Gary Kasparov was beaten by a computer in 1997, and although the exact ‘fairness’ of those matches has been questioned by some, it’s certainly been the case since about 2006 that a ‘commercially available’ computer running standard software can beat any human player on the planet.

So why is ‘Go’ so different? In many ways, it’s a very similar game. It’s ‘zero-sum’ (meaning one player’s loss exactly matches the other players gain), deterministic (meaning there is no random element to the game), partisan (meaning all moves are available to both players), and ‘perfect-information’ (meaning both players can see the whole game state – there are no hidden elements or information). Just like Chess.

From an AI point of view, two things make ‘Go’ vastly more difficult than Chess.

Firstly, the board is a lot bigger (19×19), meaning that the average number of legal moves per turn is around 200 (compared to an average of 37 for chess). This means that the ‘combinatorial explosion’ (which makes chess difficult enough) is much worse for ‘Go’: to calculate the next 4 moves (2 each for each player) would need 320,000,000,000 board positions to be analysed – and looking ahead 2 moves each would give a pathetically weak game.

The second factor is that for Chess, analysing the ‘strength’ of a board position is fairly easy. The material ‘pieces’ each player owns are all worth something that can be approximated with a simple scoring system, and that can be made more elaborate with some simple extra strategic rules (knights are more valuable near the centre, pawns are best arranged in diagonals, etc). But for ‘Go’, a simple ‘piece counting’ system is nothing like a useful enough indicator of the advantage a player has in the game, and no ‘simple rules’ can be written which help.

Instead, good human players (and even relative amateurs) can assess a board position, more or less just by using their intuition, and that intuition is where a lot of the best play comes from. Computers, of course, are not well known for their use of ‘intuition’.

I’ll write more about the approach ‘AlphaGo’ used – and why this has wider implications for AI in general – in a follow-up article in the next few days.

Computer Vision – a Developer’s Viewpoint

This article, written by me, originally appeared on the website of the Institution of Analysts and Programmers.

Computer Vision A Developer’s Viewpoint

Tom Reader
Alver Valley Software Limited

Computer Vision is an exciting branch of computer science for the programmer, combining mathematics, artificial intelligence and machine learning techniques, with more traditional programming skills and problem solving. It can require large amounts of computing horsepower, and as such the theory has often been ahead of what has been practically possible. These days however, the hardware has caught up, and real, practical applications are now hitting the mainstream. Tom Reader, a computer vision expert at Alver Valley Software Limited, gives us a look back at the history, and a glance at the future of this fascinating field.

Introduction

Computer vision can be summarised as the science of turning an image into usable information. The classic example is probably Automatic Numberplate Recognition (ANPR): Given an image of a car, extract the number plate text. Given a picture of a face, classify it as male or female, identify the emotion they are showing, or maybe even try to recognise the specific person. Given an image of a level crossing, identify whether or not there is a vehicle stranded on the railway line. Given an image of a station platform, estimate the number of people present. Given an image of a part on a production line, is it perfect or does it have a defect? Given a film taken from inside a shop window, estimate the number (and gender, age, etc) of people who stop to look at the window display, and for how long it keeps their attention. In all cases, we’re trying to distil an image (high dimensional data) into a simpler representation (maybe even binary or one-dimensional: yes/no, male/female, etc).

So what’s the problem?

The problems are immense. For a start, a typical image might contain 10MB of data, usually split into 3 separate colour channels. In almost all cases, the images presented will be ‘noisy’ – for example, they may be out of focus, include reflections and other distractions, and some items may be obscured by other items. Camera lenses almost always add their own ‘errors’ (photographers will be familiar with barrel distortion and chromatic aberrations, for example), and an image that has been stored as a JPEG may have extra ‘artefacts’ added due to the compression.

Even given a clear image, the task of explaining to a computer, in a conventional programming language, how to read a line of text or identify a person, can probably be imagined by anyone familiar with programming of any sort. As an added complication, humans see colour in a very different way to how computers process it, so anything to do with colour matching has to take that into account.

And it’s so easy for us humans…

Part of the problem as a computer vision professional is that humans (and other animals) are incredibly good at vision – we have very highly adapted visual systems and powerful brains. In the case of humans, some of this begins from a very early age – a baby will look towards a human face almost from birth. A few years later, when a bit of knowledge and common sense has developed, a small child will be capable of looking at a photo (or real-world scene) and answering questions like “what colour is the car?”, or “how many cats can you see?”. The ease with which we do this makes it hard to explain to clients that these are very difficult problems for a computer system.

Basic techniques

The basic building blocks of computer (and probably animal) vision include low-level mathematical techniques. Changes between colour and brightness levels are analysed with the aim of detecting ‘edges’ and the ‘regions’ that they separate. Even this is a difficult problem, involving significant computing power, and some of the theory was only developed surprising recently. In humans, some of this is achieved by very low level processing, including some in the retina itself before the signal even reaches the brain. In the computer, it’s pixels all the way, and everything has to be programmed from there.

Having done the simple processing, things get worse. Given the set of edges and/or regions, which are ‘real’, and which are image artefacts? For example, given a photo of a bowl of fruit (one of the computer vision staple test images), which edges are the actual edges of pieces of fruit, and which are just the lines of shadows? A yellow grapefruit may look dark grey when viewed in deep shadow, and almost white at the point where the light reflects off it. A child of 4 could point at the grapefruit, but it’s not a simple problem to solve in software.

Other basic techniques include blurring, sharpening, resizing, contour finding, and converting to different image representations.

Looking for ‘features’

Often, the points of interest in an image boil down to a set of ‘features’. For example, a letter ‘A’ has a sharp point, a ‘B’ and a ‘D’ have some rounded corners and some sharp corners, a ‘C’ has a rounded corner and open ends, an ‘O’ has rounded corners but no point or ends, a ‘4’ and an ‘X’ both have crossed lines. Those ‘features’ can be extracted, and then used as inputs for higher-level processes to work with.

Higher level techniques

Having extracted features from an image, interesting techniques can then be applied. Artificial Intelligence (AI) techniques such as neural networks and support vector machines can be used to automatically ‘learn’ from a large enough set of training data, and then to classify future, previously unseen cases. But choosing the correct features in the first place (and writing the code to extract them) is essential, and then all the normal AI problems (e.g. how to train and test a neural network) still apply.

Vision libraries

Not all the code has to be written from scratch. There are libraries (both commercial, and open source) that can help with a lot of the leg-work. I use the OpenCV library, which is open source, and contains a large number of highly optimised routines for doing some of the low-level work, and also has some higher-level tools in the box. It is still just a tool-box though – although it includes some ready-to-go solutions straight ‘out of the box’, it is mostly a programming library.

Development techniques

Computer vision projects present some unique challenges to the developer. I work mostly as part of a team, but I tend to be the main (or only) computer vision programmer, with other people doing other parts of the task – user interface, back-end integration, communications, etc. From the computer vision perspective, I have found myself concentrating on a number of areas apart from the problem-solving aspects of the vision work itself.

Firstly, keep the basics in place. For example, I have become increasingly keen on my software writing a proper log, either to file on disk, or via callbacks to the calling program in the case of an API. Either way, I need to be able to see everything my program is doing, at a configurable level of detail – this is essential when trying to identify problems which will inevitably crop up. Also, documentation is essential, especially when working on multiple projects concurrently. I use Doxygen (with standard in-code comments, which I use without thinking) to give me a good overview of each program down to the class, function and parameter level.

Secondly, the computer vision process is almost always a workflow. Images pass through many processes (not always the same pathway, as images get processed), and the information gradually gets extracted to higher levels. Bearing this in mind, I write a series of ‘debug’ images to disk at pre-determined points, to aid understanding problems that will occur with some images later on. For example, did a problem occur because the image was too blurred or distorted, or was the image fine but the neural network came up with a wrong classification?

Thirdly, computer vision usually involves some aspect of machine learning. One thing this entails is huge volumes of data for training and testing. A typical sub-project may require training with 10,000 images, half of which do contain a certain object, and half of which don’t. This data all has to be generated or extracted somehow. I spend quite a bit of time designing, writing, and using a variety of in-house ‘data wrangling’ tools to handle and manage all this data, and of course also using standard tools such as those built into Linux where possible.

At the ‘project’ level, things also need managing. Often, I will initially be given a few hours to see if a task is feasible at all, and I will produce a ‘quick and dirty’ test to try out a few concepts with some simple test images. If it goes well, things may proceed to the next stage – basically to see how far we can get with this idea – so the code develops further. Later on, the project will get more serious, the images will become ‘noisier’ and more realistic, and the expectation of correct results will get closer to 100%. Finally, things get ready to go into production, almost always as part of a larger software infrastructure, with all normal expectations of stability, scalability (maybe into the cloud), manageability, performance, error handling, etc.

During this process, it is essential to remember that the code needs to match the stage in the project it embodies. During the ‘proof of concept’ stage, a few lines of uncommented Python may be enough to try things out, and we can let the error checking slide for a while. But if the project proceeds to later stages, it is essential to make sure the code quality keeps up. At some point the language decision needs to be made (computer vision is compute-intensive, both in terms of CPU and RAM, so C++ is popular), integration decisions need to be made, and documentation needs to be kept up to date.

Future applications

Away from the ‘traditional’ applications, computer vision is now finding new niche markets. For example ‘augmented reality’, where a computer superimposes certain information onto a ‘Head Up Display’ in the user’s vision, can be further improved with computer vision. Smartphones are now powerful enough to run some computer vision tasks. Other platforms also are, or will be: although Google Glass was only a prototype, and appears to have gone back to the drawing board for now, I wrote a simple ANPR application on that platform two years ago. No doubt similar platforms will hit the market eventually, and there are many computer vision possibilities.

Deep learning

As computing power and storage continues to grow, artificial intelligence improves, and computer vision advances. Given every image on the Internet, almost infinite computing power, and a few hundred experts in the field, it should be possible to create some really good computer vision applications – keep an eye on what Google are coming up with for details. They would love to be able to classify a set of images into folders such as ‘the beach’, ‘the kids’, ‘Christmas photos’, etc, and they have made huge progress towards that recently. That kind of ‘deep learning’ is almost a higherlevel of abstraction still – if and how those techniques will ‘filter down’ to the handling of computer vision for specific tasks, remains to be seen.

For now, with ever increasing computing power and research results, I can’t think of any more rewarding branch of computer science in which to work.

Tom Reader

Alver Valley Software Limited

www.alvervalleysoftware.com

Distributed Computing

I have been an enthusiastic supporter of distributed computing projects since the early days of SETI@Home. The computing power in a modern PC is too useful to waste, and can now be harnessed to help research some of mankind’s most serious problems. As such, I now ‘crunch’ mostly for the World Community Grid – all Alver Valley Software machines are on 24×365, running life sciences projects (particularly cancer research). This work is done in the background, i.e. in addition to the normal usage of my computer, with no direct involvement from me at all. My output is usually somewhere around 100 Gflops – my statistics summary can be seen below. More importantly, if you have a PC that is switched on for more than a couple of hours a day, please consider joining the project – just click on the ‘World Community Grid’ graphic below to read more.

Computer Vision development

Often computer vision projects begin with a very early proof-of-concept, or at best a prototype, to see whether something is going to be feasible. Hopefully it turns out to be, so things then progress to a ‘see how much we can get working’ stage. As long as that goes well, then things eventually move towards a live production release. During this process, the images I’m expected to handle can become more and more ‘real world’: I start to get more noisy images, focus problems, reflections – the nice clean images I was originally testing with seem trivial now. In response, the code (and the computer vision workflow it embodies) gets ever more complex, often evolving from the original prototype, sometimes as a complete rewrite. This is a familiar pattern I’ve experienced across many projects.

Managing this process in a sensible way is something I’ve learned is very important. It’s essential to be aware of how the project is moving from proof-of-concept to production, and to make sure the code-base keeps up. Running multiple projects often in parallel, it’s even more important to keep documentation up to date, and to generally ‘run a tight ship’ on the development front.

Much of this is just good development practice, while some is more specific to computer vision projects. Some specific things I do include:

Write a program log, with variable-level tracing showing what is going on. I do this in a formalised way so I can identify problem areas in the workflow for a given image very quickly.
Write debug versions of the images as they move through the workflow, again in a defined and standard way.
Comment code in a standard way.
Use tools such as Doxygen to document the program automatically. The class and function dependencies diagrams alone are worth the small amount of effort.
Use Git to manage the version control, locally, and on the client’s system as well if appropriate. This one has been a learning curve for me, but is paying off now.
Maintain a document giving a brief overview of the whole system layout.

I don’t use truly ‘formal’ development methods – they don’t bring benefits to most projects of the type I’m involved with – but a little bit of good development practice goes a long way, and means the client ends up getting a documented, maintainable code-base and a production-ready solution.