Sunday, 31 March 2013

Identifying Duplicate Images

Identifying similar images and determining which are identical. 

So in this post I want to talk about how you can write a program that will be able to compare images against each other and determine if they're identical or very similar. Basicly the same thing tineye and google's reverse image search does. So if you've ever wondered how they work then this post might clear things up a bit. In a follow up post I also want to describe how we can apply a similar technique to identify duplicate videos.

The Trick?
There is no, one solution to the problem and every solution has its advantages and disadvantages,  the approach im going to be using is based off of the histogram of the images. A very simple yet very effective method, the draw back of it being false positives and it can only match images that are visually very similar. But it can identify images with varing qualities of:
  • Resolution
  • Rotation
  • Filesize
  • format
  • compression
  • quality

What we are going to do is, extract the distribution of colours that make up the image and plot the frequecy of those colours. Also known as a histogram, in short a histogram will show us how much of each colour value(0-255) in each colour range(RGB) is used in the image.

The histogram represents the frequency of each colour value.

Two images that are visually identical will have a very similar histogram(98%) depending resolution and other factors. However images can have they same histogram yet not be visually similar, but at least we are certain to find duplicates, and these false positive detection is a problem we can reduce.

As the figures above show that two identical looking images with different properties in both size and resolution share a near identical histogram.

The Algorithm

The algorithm in broken up into 3 main sections. 
  1. Finding the images
  2. Analysis
    1. Load Images
    2. Extract Colour Map
    3. Generate Histogram
    4. Normalise Histogram
  3. Comparison

Step 1 - Finding the images

We can hardcode the locations of the images into the program, this is useful when just testing the analysis of the algorithm, but for practical purposes we want the program to be able find those images for us. To do this we will use another algorithm to recursively load each directory and subdirectory and find all the images, and add those to a list.

Select a Directory
public void LoadDirectory()
 JFileChooser fileChooser = new JFileChooser();
 int result = fileChooser.showOpenDialog(null);
  URL fileURL = null;
   fileURL = fileChooser.getSelectedFile().toURL();
   String _folderPath = fileURL.toString().substring(6);
   String[] folderPath = _folderPath.split("/");
   for(int i=0; i<folderPath.length-1; i++)
    directory = directory+folderPath[i]+"/";
  catch(Exception e)
 File folder = new File(directory);
So the above snippet will allow us to pick a folder to start our search, and the last method called LoadSubDirectories(folder) will then load all the subdirectories in that folder 
Select Directory view

Load Subdirectories
public static int spc_count=-1;
private void LoadSubDirectories(File folder)
    String spcs = "";
     for (int i = 0; i &lt spc_count; i++)
       spcs += " ";
   String[] temp3 = folder.getName().split("\\.");
//Adds file location to ImageObject(my own class) you can just add
//absoluteFile parameter to an ArrayList
            ImageObject iObject = new ImageObject(folder.getAbsoluteFile());
     else if (folder.isDirectory()) 
       File[] listOfFiles = folder.listFiles();
  for (int i = 0; i &lt listOfFiles.length; i++)

  //Add logic of what to do when done searching 
This code will then load all the subdirectories and find all images ending with a jpg, jpeg and add those to a list, or in my case an ImageObject(own class), which I also use to store their features(histograms).

Step 2 - Analysis

In this step we are going to use the absolute addresses we found in step one to load those images and perform analysis on them. To perform analysis on the images we are going to have to do a few things
1. Load the image into the program
2. Extract the colour map
3. Generate Histogram
4. Normalise Histogram

2.1 - Loading the image

for(ImageObject listIfImages : iObject)

BufferedImage image = null;
//now this does use my class(ImageObject) to store and 
//load the images But all you have to do is replace
//iObject.toString() with the absolute path.toString()
//of the image 

image = File(iObject.toString()));  

        Imager.process(image);//my class that processes the image

//Just storing the histograms into the image object, 
//so that I can save and load it to disk
//easier so that I don't have to process an 
//image more than once.
iObject.redHisto = Imager.getRedHisto();
iObject.greenHisto = Imager.getGreenHisto();
iObject.blueHisto = Imager.getBlueHisto();
catch(Exception e)
So in my ImageProcessor Class I will have an a method that takes a buffered image. 2 - Extract the Colour map

2.2 - Extract Colour Map

public final void process(BufferedImage _image)
 width = _image.getWidth(this);
 height = _image.getHeight(this);
 final int[] rawPixels = new int[width * height]; 
 PixelGrabber grabber = new PixelGrabber
 catch(Exception e)
 final int[][] pixels = new int[width][height];
 alpha = new int[width][height];  
 red = new int[width][height];
 green = new int[width][height];
 blue = new int[width][height];
int offset = 0;

//This double for loop is where the colour map gets made
//This colour map will be used to generate the histogram
//Note that is section can take the longest and might cause
//Java virtual machine to crash due too large images if this 
//happens do the following select the project:
//properties->Run/Debug Settings-> {class of main method}  
//(x)=Arguments->VM arguments->-Xmx1024m
for (int i = 0, n=width; i < n; i++) 
 for (int j = 0, m=height; j < m; j++) 
     pixels[i][j] = rawPixels[offset];
     // extract the 4 fields
     alpha[i][j] = (pixels[i][j] >> 24) & 0xff;
     red[i][j]   = (pixels[i][j] >> 16) & 0xff;
     green[i][j] = (pixels[i][j]  >>  8) & 0xff;
     blue[i][j]  = pixels[i][j]  & 0xff;

redHisto = this.createHistogram(red);
greenHisto = this.createHistogram(green);
blueHisto = this.createHistogram(blue);
alphaHisto = this.createHistogram(alpha);

return feature;
In case you missed the comment in the middle if your JVM bombs out due to a memory error do the following: Right click on the project and select properties. Click Run/Debug Settings and select the class where your main method is located and click edit, select the (x)=Arguments tab and enter in the VM arguments text area: -Xmx1024

Solving JVM memory problem

2.3 - Generate Histogram

To actually generate a histogram from the colourmap is ridiculously simple
/**@_colorMap created an array of type int that represent 
 *the histogram of that colour
 * @histogram the array returned
private final int[] createHistogram(int[][] _colorMap)
 int[] histogram = new int[256];
 for(int i=0, n=width; i<n; i++)
  for(int j=0, m=height; j<m; j++)
 return histogram;

So what this code snippet above does is create a histogram from the colormap that is parsed through. A new array of type int and size of 256(0-255) is created, each array index will then be the corresponding colour value. Such that all instances of red-145 will be stored in histogram[145]. So the trick is that we access the the Array with the value of the _colormap[i][j] and just increment that index.

We then simple return that array, which now represents that colour's histogram.

2.4 - Normalise Histogram

Since we want to compare images of all sizes you have to realise that larger versions of images will have larger value in the histogram array, but relatively speaking those histograms will be near identical. This we need normalise the histogram so that ALL histograms will have the same upper limit. Now you can scale the histogram right after you've generated it and just before you returned it, or you can do it during comparison.

I did mine during comparison so that I could save a raw copy of the histogram since during normalisation certain data is lost. Of course this is slightly slower since I would have to regenerate the histogram each time.

I'm no mathematician so there probably exists a more elegant solution than mine. But in short we going to find the largest value in the histogram and see how close it is to our limit, then divide or multiple all the values by the difference depending on if its larger or smaller;

private int[] normaliseHistogram(int[] _histo)
 double scaler=0.0;
 int largestValue=0;
 //Normalise all values relative to 255
 int limit = 255;
 //Finding Largest Value
 for(int i=0; i<_histo.length; i++)
  if(_histo[i] > largestValue)
   largestValue = _histo.length;
 //Calculate the scaler value
 scaler = largestValue/limit;
 for(int i=0; i< _histo.length; i++)
  //If scaler is larger than 1 we devide all value by it
  if(scaler > 1)
   _histo[i] =  (int) (_histo[i]/scaler);
  else //if histo smaller than 1 we multiple all the values
   _histo[i] = (int)(_histo[i]*scaler);
 return _histo;

Step 3 - Comparison

And only one thing left to do and that compare the histograms with each other. So to compare to histograms with each other.
public double compareHistogram(ImageObject _image1, ImageObject _image2, int _color)
 int tempHistImage1 = _image1.getHisto(_color);
 int tempHistImage2 = _image2.getHisto(_color);
 for(int i=0; i<256; i++)
  double diff = 0;

   diff += temp2/temp1*100;    
   diff += temp1/temp2*100; 
 return diff;

Note that this method takes in two ImageObjects again this is my own class I use to represent the data of each image such as there histograms, all I do with these objects is extract there histograms so I can use them. In this code snippet I compare the red histograms with each other, this code will have to be done for both green and blue.
Now all that’s left is to establish the threshold and see if it passes it, if it does then the images are a match.
histogramThreshold = 80.0;

if(compareHistogram(image1, image2, 1)>histogramThreshold)
 return true;

My final program

The image above is of my program that uses the same algorithm as above with some modification, including faster matching and reduced false positives. As well has the ability to find duplicate videos.

No comments:

Post a Comment