Object Detection using TensorFlow

Vishnuvardhan Reddy
Jul 24, 2018
8 min read

Today i wanted to discuss some of the problems in the area of Computer Vision i would like to list down below:

1.Semantic Segmentation.

2.Image Localization and Classification

3.Object Detection.

4.Instance Segmentation.

I would like to give a brief idea and the amount of complexity each problem have and would like to discuss some of the potential approaches to solve each problems. So ,lets get started.

Problem 1:

Semantic Segmentation:

Firstly i would like to give an Image and expect what are the objects present in the image this can be considered as multi class classification for each pixel in the image.Down below on the left corner i have a city image as a segmented output i would like to predict the what are the objects or regions present inside my image for example in the image on the right corner you see the outputs are sky,left facade,right facade,background,street,and so on so forth.I hope i made my problem clear.

How to solve this problem?.....

Some strategies/approaches to solve this problem are :

As a computer Science student when i solve Algorithmic problems i have the idea to try with a brute force approach there after incrementally i would like to improve the complexity so even i see semantic segmentation problem i learnt and attacked the problem in the same way.

The brute force Technique is:

1.What about converting this problem of Image segmentation to Classification:

Approach 1:

Yes, I would like to crop each pixel and try to input each pixel to a Convolutional Neural Network and try to predict the class label for each pixel there by i can the class label for every pixel and i am done. I am fairly successful in solving this problem through the approach of Exhaustive search called as "Sliding Window Technique " in the literature of Image Processing and Computer Vision.

Short Comings of the sliding Technique :

From here if i would like to mention Convolutional Neural Network as CNN and i would like to mention Alexnet architecture even though there are many brilliant architectures i would like to talk about the Approach with Alex net .

ok, ,Now am i done with the problem lets discuss the Trade off the potential of the algorithm depends only on the speed. The more faster your approach the better your Techniques are. ok how may pixels i might have ok lets i have W*H*C where W=width of the image , H=height of the image, C=No of Channels and now i passed W*H number of pixels in to the CNN network and i get the W*H outputs which are required but see the number of Computations For lot of real world applications we use even Complicated models like Resnet, Inception(googLenet),etc which are really big and this is an over kill computation.

Approach 2:

Fully convolutional Approach: what do i mean by fully convolutional? lets deep dive.

Maintain the architecture with the same spatial size till the end and at the output we get an output Tensor to an equivalent size of W*H*P where p=No of predictions/ Number of classes in the data set.

"Credits Refereed from CS231 Stanford cnn course":

Short Comings of this approach:

If you observe the network architecture we try to maintain the same spatial size and at the output we get an output Tensor to an equivalent size of W*H*P where p=No of predictions/ Number of classes in the data set. Just assume the level of computation if you do not reduce the spatial size the number of multiplications and network size has millions of weights at the time of back propagation you will need to update millions of weights so this is also not as efficient methods just see the Alex net structure at every layer we try to down sample for most of our classification Tasks to an equivalent predictions with the help of fully connected layer whereas this approach suggest to maintain the spatial size.

Approach 3:

Down sampling and Up sampling Architecture:

Some brilliant Idea if i wanted to mention in this architecture are:

1. Deconvolution or Transpose convolution.

2. Learnable up sampling and down sampling.

if you want to learn more about Deconvolution i would like to mention the link down below. I borrowed this from Dr. Dhruv Batra professor from Virginia Tech . I personally got benifitted from this video.( In one word this video is phenomenal)

Youtube-link: https://www.youtube.com/watch?v=Xk7myx9_OmU

Problem 2:

Image Localization and Classification:

Let's make the problem slightly harder than the previous one..

Problem statement is:

" If i give an input image find the object in the image as well as the location of the image" by drawing a bounding box around it?

In localization and Classification in addition to predicting the class of the image you need to identify the location of the image.

How can i do it?

Approach 1:

This is really a great approach of Parallel divergence ( I am not sure whether such a technique called "parallel divergence "exists!) however to make it clear the idea is performing classification and regression simultaneously . The architecture down below will give you a full rounded understanding what i am talking about.

So here you wanted to predict the class label for the image as well as the location of the image so you need two loss functions one loss function is Cross Entropy(Multi class log loss) and L2-loss(Squared loss or Euclidean loss).

Short Comings:

1.This is computationally challenging problem internally lots of techniques related to parallel and Distributed Computing are used to make computations faster.

2. In the data set you should know the location of the object in the image prior to training. This is really expensive.However if you have only image with you it is your responsibility to annotate the images prior to training to realize the full potential of the Algorithm.

I would like to mention some Annotation tools down below if you are interested you can definitely give it a shot.

Fast Annotation tool: https://github.com/christopher5106/FastAnnotationTool

Labellmg: https://github.com/tzutalin/labelImg

Both are really good tools you can definitely try one or both of this to annotate your images this is quite a heavy lifting but we have to bear with it.

Problem 3:

Object Detection:

This is Considered to be the Core Computer Vision and Image processing problem.If i wanted to give an analogy there is a problem called Travelling sales men in Computer Science(Algorithms and Data structures) considered as NP-HARD and i believe Object Detection is Such a problem. however i would be rather succinct towards the object Detection with respect to deep learning.

So in Localization + classification you have fixed set of objects . I mean you have one object or fixed number of objects whereas in Object detection you have multiple objects and you have to draw a bounding box for every object of interest in the images( object of interest i mean the number of classes in the data set).

Now i can expect a question why can't i solve this problem of using the Techniques like Semantic Segmentation because even in Semantic Segmentation i am finding class label for every pixel and finding all the class labels for every object in the image why can't i slightly change the architecture of Semantic Segmentation to make it work for Object Detection even though i do not find an bounding box why can't i solve this problem of Semantic Segmentation especially to predict the what objects are there in the image.

Perfect let's see why this is wrong?

Just observe the image i present above in the right corner i mean look at the two cows in the above image as well as the segmented image After segmentation there appears to one cow label in the segmented image.That's why segmentation cannot capture the correct number of objects in the image because this is the question i always have in mind. I hope i made it clear with this image. Exact count of the objects cannot be preserved by Semantic Segmentation.one potential answer to solve the short coming of semantic segmentation is with the help of Instance segmentation.

The image down below gives you a profound understanding of what i am talking about.

So, there is a need to Learn Object Detection. Let me explain the complexity of this problem that gives the better sense of what is the level of difficulty this problem contains.

1.Firstly,here in the problem i have to find the names of the object (we have to preserve the count as well ).

2.Secondly, here in the problem we have to draw the bounding boxes for the predicted object within the image.

Key Challenge is the number of objects in the image are variable. That is in some images we may have five men and 2 women and some images have one person and a cat so essentially i mean to say there are variable number of objects and varied objects in the image.

So how can we solve the problem what about the localization Technique well lets discuss..

In localization we have fixed number of objects in the image and we know in prior the fixed set of images so we are able to design the network such a way that can perform parallel classification and Regression.To give an Analogy object detection is like a Text classification problem where we do not know the length of the text prior.So the hack we used to resolve this issue by maintaining a right threshold on the vocabulary size and number of words each text data contains we tried to follow the same technique here but in slightly different way.

Finally, there is a need of Paradigm shift to tackle this problem we have to modify the Architecture in such a way that exactly fits the problem of Object Detection.

Solutions to solve this problem:

Solution 1: Apply sliding window Technique we know computational over kill leave it...

Solution 2: R-CNN(Region based Convolutional Neural Network) "not Regular CNN".

Lets discuss about the Architecture and key ideas used here...

1.here in the Architecture of R-CNN we know finding each crops of the image and run it on a CNN is an overkill and secondly we don't even know how many possible crops might be generated here this might be really huge so we follow a technique called Regional Proposals of Interest also known as (ROI).This ROI is an algorithm in Computer Vision where it gives 2000 regions proposals within the image and pass it to a CNN so this technique or i would say this hack is similar to the Text classification hack of fixing a Threshold on the top words.So here from many possible regions we came to 2000 regions.This is better than sliding window Technique.

2.Internally ROI uses a technique called Selective Search for generating the region proposals.To learn more about Selective Search i would recommend to study the link i present down below.

Selective Search: https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/

Disadvantages :

This approach is really slow because we are taking a full resolution image and finding the region proposals on top of it and passing through CNN for object Detection.This works but it takes very long time to complete the training.

Solution 3:

Fast R-CNN:

I would call this as an Engineering HACK once again basically R-CNN used Selective Search(ROI) to reduce the region proposals to 2000 however they applied ROI on the Full Resolution RGB Image.

What did Fast R-CNN does?

let's discuss they took the image and pass it through a Convolutional layer they got a reduced dimensional image from the perspective of Height and width if you do not apply same padding instead you have no padding and stride is one if not atleast from the perspective of channels now they applied ROI layer on top of it .This is literally very very faster than R-CNN.

I would like to attach the Architecture of Fast R-CNN below.

Featured Posts

Recent Posts

Object Detection using TensorFlow

Archive

Search By Tags

Follow Us

Object Detection using TensorFlow

Comments