AI That Replicates Human Vision – In 8 Steps

Deep Learning

The process with which humans and animals use their eyes to see is fascinating. We can very quickly and accurately identify a multitude of items in our line of sight. 


Nowadays with the availability of 

  1. better and cheaper hardware & cloud (see the GPUs section)
  2. lots of data
  3. computer vision algorithms that have evolved and matured over the last 40 years
  4. more AI engineers 

we can have AI object detection models that are trained to quickly and accurately identify and locate objects in images and videos. 


Businesses can use AI and train object detection models to “see” faces, animals, plants, trees, vehicles and buildings. Objects can be counted and tracked digitally with relative ease like never before and thus computer vision is being used in products and services to solve real business problems.


The main uses machine learning object detection are

  1. in agriculture to recognise plants, trees, 
  2. in medicine to recognise diseases from scans
  3. in the automotive industry to recognise vehicle dents from hail
  4. in retail to automatically identify what buyers put in their basket
  5. in geospatial remote sensing from satellites and drones – to identify buildings, crops, vehicles from satellite photos and aerial drone video footage


The lifecycle of building an AI object detection model is somewhat very different from that of building other AI models or software platforms. In this article we’ll take you through each of the steps required, while sharing with you the tips and tricks we have learned over the years.  


The high level lifecycle is that of supplying thousands of images of the object your want your AI model to identity and eventually, after the AI model is trained on the annotated training data, the AI model starts identifying (with a certain level of accuracy – not 100%) the objects that you trained the image on. Here are the steps to build such a system.


1. Find The Right Team Of AI Engineers


Let’s not beat around the bush. AI is a very vast subject. Machine Learning (ML) is a subset of it.  And deep learning is a subset of machine learning. While computer vision is one of the most popular uses of AI which uses some deep learning techniques – but could also use ML techniques which do not involve deep learning.  


Most of the AI engineers have a good knowledge of data science and are well versed in programming languages such as Python and R. The languages have a huge set of readily available libraries for AI and data science.  


However, as with other skills, not all AI engineers out there are equal – both in experience and in their focus. So before you start an AI project make sure you partner up with the right team – that has experience with what you want to achieve.

Image Source :


2. Decide What AI To Use


The AI engineers will decide which object detection algorithm to use. This depends a lot on what the AI model is going to be used for.  There are various deep learning tasks one can choose from, each task gives a deeper more precise recognition of the objects in the image or video.


What You Need Deep Learning Task
What object is present in the image/video? Classification
Classification + where is the object? Object Detection
Object detection + pixel level classification Semantic Segmentation
Semantic segmentation + classification of each instance Instance Segmentation


The algorithm to use might also determine the type of labelling you’ll need and the amount of labelled images you will need to train the model. 


3. Decide The Target Accuracy


No AI model will achieve 100% accuracy. However, given enough training data you can aim for a good accuracy. In very broad terms you can expect the following accuracy …


  • Around 67% – Minimum – you have a start.
  • Around 77% – Poor but ok – you can use it but you should improve it.
  • Around 87% – First acceptable accuracy. You can use such a model in production. 
  • Over 87% – Where you should target to be.


4. Do You Need To Build A Training Dataset?


An AI model is a good a the data is trained on. Garbage in – garbage out applies here too. So having a good training dataset is probably the most critical aspect of the whole process.


To have such dataset is it might not be required to procure and to label thousands of images. If you are lucky there might be publicly available datasets out there that you might use to train your AI model. The three most common large scale image datasets are COCO (Common Objects in Context), ImageNet, Google’s Open Images


4.1 Labelling


If you cannot use a readily available dataset you will have to resort to labelling. Labelling is a laborious task. It involves human labellings thousands of images – marking the objects that you want to model to recognise. It is an especial data preprocessing part of any  supervised learning AI project. 


Labelling instructions. Before you ask labellers to start labelling make sure you’ve documented what they need to label and how. It is essential to prepare a document which lists down how to store the labelling results, how to tackle tricky / “edge cases” and to what level of detail should the labeller go to. 


Bounding Boxes or Polygons

Bounding boxes are rectangle that the labellels put around an object. When using polygons instead of bounding boxes the annotator specifies all the vertices of the polygon. In many cases bounding boxes are enough to arrives to the required accuracy of the AI model – however using polygons gives better results. Choosing between boxes and polygons depends a lot on your use case and on the accuracy that you are after. The downside of using polygons is the fact that the labelling exercise is slower. And many a times you do not have the luxury to invest a lot of time in labelling thousands of images.


A well labelled database is the cornerstone of any AI platform.  


Choosing A Labelling (Annotation) Tool


There are many labelling tools available. Some are free, some are paid. Some are SAAS platform where the labelling is done through the browser and some where the labelling is done locally on the labellers machine.


The most popular labelling tools are LabelMe, label-studio, LabelImg, labelbox, CVAT, VoTT, hasty, v7, make-sense, coco-annotator and Scale AI.


If you want a simple free polygon labeller you can go for LabelME. If you want a simple free bounding box labeller you can go for LabelImg.


One of the items to take care of is the output of your labelling tool. Most labelling tools output some kind of JSON file. But not all tools output the same format. 


The most common formats are COCO JSON, Tensorflow TFRecord, Pascal VOC and YOLO Darknet TXT.


AI-assisted labelling

Apart from getting human labellers to manually label thousands of images you can also use AI to assist you in having a bigger training dataset. 


Comparing Results From Annotators 


Labelling is challenging. Firstly because human labellers are very prone to making mistakes. Secondly because even the most attentive of labellers do not label images exactly the same. Even if they have all the good intentions they could easily misclassify parts of the image in thousands of images giving rise to badly labelled. To minimise this risk it recommended to do two things


Before you start make sure you have a user manual that clearly explains the classes to annotate and the process of annotation. And make sure that each annotator is trained on at least 100 images before he/she goes ahead and annotates thousands of images. 


Compare results of different annotators. You can do this by sending a part of each batch of images to more than one labeler and then comparing the results on each of these specific images. If the labelling is not matching you’ll need to take action to solve these inconsistencies. 



5. Building The AI Model


There are two philosophies to choose from when it comes to deciding the object detection algorithm to use. Both use convolutional neural networks (CNNs) –  a type of supervised deep learning network.


A two-stage approach – based on classification

A single shot approach


With the two-stage approach the first step is to select the interesting regions. Then these are classified using CNNs. The region-based CNN (RCNN) (Fast R-CNN, Faster R-CNN, and Mask R-CNN) are the most popular examples of this. These are slower than the single shot algorithms but perform better on small objects. 


YOLO (You Only Look Once), SSD (Single Shot Detector) and RetinaNet on the other hand are examples of the single shot approach.  They are based on regression. The prediction of the classes is done in one run for the whole image. These algorithms are faster than the two-stage approach algorithms and thus are used more frequently in real-time object detection. However their accuracy is not as good as the two-stage algorithms on small objects.


A set of Machine Learning (ML) Frameworks exist to make building an AI model faster and smoother. These frameworks have many of the most commonly used libraries in AI & data science. The most popular ML frameworks are


  • Tensorflow (developed by Google) 
  • PyTorch  (developed by Facebook)
  • Keras


6. GPUs & The Processing Power You’ll Need


AI models are trained and are run on GPUs … they are “special CPUs” – hardware that is more specifically adapted to the processing of graphical data. 

  • There are few approaches for having GPU processing power
  • Have a physical GPU that you one
  • Hire a server with GPU (like hosting but with a GPU)
  • Have access to GPUs in the cloud


The most common go to options are Jupyter Notebooks (standalone on your machine) or in the cloud with Google Colab. Kaggle is also a popular hosted version. 


All the major cloud providers offer ways of getting GPU processing power and environments for Jupyter Notebooks. You can have a look at

  • Azure (Azure Machine Learning &  GPU optimised VMs)
  • AWS (AWS Sagemaker)
  • Google (Vertex AI Workbench)


7. Training The AI Model


Once you finish your data labelling, have built the first version of your AI model, and have the necessary GPU processing power you can proceed to train the AI model.


This involves supplying the training set to the AI model so that it learns. 


To optimise the accuracy results of the algorithm, the AI engineers will need to do some tuning – referred to hyperparameter optimization.


8. Integrating The AI Model


When the AI model is ready and trained, the project is not ready. You’ll need to integrate the output from the AI model into your product, your service, your database – wherever it is that the AI model was built for. This usually involves storing the output (what the AI model recognised on the input image) into a database – from where this result can be accessed by other systems and interfaces using the result. 


Let’s Do It 

Nowadays AI models are being used by businesses to provide real-life solutions. While some of the 8 steps above might seem daunting, building and appyling AI is very doable once you know the challenges you will have to face and have the right team to support you. 


Smart Studios is hear to help you all the way – we can help you with the whole process of incorporating AI within your products and services.  If you would like to know more get in touch via the Contact form and together we will work with you through all the 8 steps mentioned above.