Identifying similar images and determining which are identical.
So in this post I want to talk about how you can write a program that will be able to compare images against each other and determine if they're identical or very similar. Basicly the same thing
tineye and google's reverse image search does. So if you've ever wondered how they work then this post might clear things up a bit. In a follow up post I also want to describe how we can apply a similar technique to identify duplicate videos.
The Trick?
There is no, one solution to the problem and every solution has its advantages and disadvantages, the approach im going to be using is based off of the histogram of the images. A very simple yet very effective method, the draw back of it being false positives and it can only match images that are visually very similar. But it can identify images with varing qualities of:
- Resolution
- Rotation
- Filesize
- format
- compression
- quality
What we are going to do is, extract the distribution of colours that make up the image and plot the frequecy of those colours. Also known as a histogram, in short a histogram will show us how much of each colour value(0-255) in each colour range(RGB) is used in the image.
The histogram represents the frequency of each colour value.
Two images that are visually identical will have a very similar histogram(98%) depending resolution and other factors. However images can have they same histogram yet not be visually similar, but at least we are certain to find duplicates, and these false positive detection is a problem we can reduce.
As the figures above show that two identical looking images with different properties in both size and resolution share a near identical histogram.
The Algorithm
The algorithm in broken up into 3 main sections.
- Finding the images
- Analysis
- Load Images
- Extract Colour Map
- Generate Histogram
- Normalise Histogram
- Comparison
Step 1 - Finding the images
We can hardcode the locations of the images into the program, this is useful when just testing the analysis of the algorithm, but for practical purposes we want the program to be able find those images for us. To do this we will use another algorithm to recursively load each directory and subdirectory and find all the images, and add those to a list.
Select a Directory
public void LoadDirectory()
{
JFileChooser fileChooser = new JFileChooser();
int result = fileChooser.showOpenDialog(null);
if(result==JFileChooser.APPROVE_OPTION)
{
URL fileURL = null;
try
{
fileURL = fileChooser.getSelectedFile().toURL();
String _folderPath = fileURL.toString().substring(6);
String[] folderPath = _folderPath.split("/");
for(int i=0; i<folderPath.length-1; i++)
{
directory = directory+folderPath[i]+"/";
}
}
catch(Exception e)
{}
}
File folder = new File(directory);
LoadSubDirectories(folder);
}
So the above snippet will allow us to pick a folder to start our search, and the last method called LoadSubDirectories(folder) will then load all the subdirectories in that folder
|
Select Directory view |
Load Subdirectories
public static int spc_count=-1;
private void LoadSubDirectories(File folder)
{
spc_count++;
String spcs = "";
for (int i = 0; i < spc_count; i++)
spcs += " ";
if(folder.isFile())
{
String[] temp3 = folder.getName().split("\\.");
if(
temp3[1].equalsIgnoreCase("jpg")||
temp3[1].equalsIgnoreCase("jpeg")
)
{
//Adds file location to ImageObject(my own class) you can just add
//absoluteFile parameter to an ArrayList
ImageObject iObject = new ImageObject(folder.getAbsoluteFile());
}
}
else if (folder.isDirectory())
{
File[] listOfFiles = folder.listFiles();
if(listOfFiles!=null)
{
for (int i = 0; i < listOfFiles.length; i++)
LoadSubDirectories(listOfFiles[i]);
}
}
spc_count--;
if(spc_count==-1)
{
//Add logic of what to do when done searching
}
}
This code will then load all the subdirectories and find all images ending with a jpg, jpeg and add those to a list, or in my case an ImageObject(own class), which I also use to store their features(histograms).
Step 2 - Analysis
In this step we are going to use the absolute addresses we found in step one to load those images and perform analysis on them.
To perform analysis on the images we are going to have to do a few things
1. Load the image into the program
2. Extract the colour map
3. Generate Histogram
4. Normalise Histogram
2.1 - Loading the image
for(ImageObject listIfImages : iObject)
BufferedImage image = null;
try
{
//now this does use my class(ImageObject) to store and
//load the images But all you have to do is replace
//iObject.toString() with the absolute path.toString()
//of the image
image = ImageIO.read(new File(iObject.toString()));
Imager.process(image);//my class that processes the image
//Just storing the histograms into the image object,
//so that I can save and load it to disk
//easier so that I don't have to process an
//image more than once.
iObject.redHisto = Imager.getRedHisto();
iObject.greenHisto = Imager.getGreenHisto();
iObject.blueHisto = Imager.getBlueHisto();
}
catch(Exception e)
{}
So in my ImageProcessor Class I will have an a method that takes a buffered image.
2 - Extract the Colour map
2.2 - Extract Colour Map
public final void process(BufferedImage _image)
{
width = _image.getWidth(this);
height = _image.getHeight(this);
final int[] rawPixels = new int[width * height];
PixelGrabber grabber = new PixelGrabber
(
_image,
0,
0,
width,
height,
rawPixels,
0,
width
);
try
{
grabber.grabPixels();
}
catch(Exception e)
{}
final int[][] pixels = new int[width][height];
alpha = new int[width][height];
red = new int[width][height];
green = new int[width][height];
blue = new int[width][height];
int offset = 0;
//This double for loop is where the colour map gets made
//This colour map will be used to generate the histogram
//Note that is section can take the longest and might cause
//Java virtual machine to crash due too large images if this
//happens do the following select the project:
//properties->Run/Debug Settings-> {class of main method}
//(x)=Arguments->VM arguments->-Xmx1024m
for (int i = 0, n=width; i < n; i++)
{
for (int j = 0, m=height; j < m; j++)
{
pixels[i][j] = rawPixels[offset];
// extract the 4 fields
alpha[i][j] = (pixels[i][j] >> 24) & 0xff;
red[i][j] = (pixels[i][j] >> 16) & 0xff;
green[i][j] = (pixels[i][j] >> 8) & 0xff;
blue[i][j] = pixels[i][j] & 0xff;
offset++;
}
}
redHisto = this.createHistogram(red);
greenHisto = this.createHistogram(green);
blueHisto = this.createHistogram(blue);
alphaHisto = this.createHistogram(alpha);
return feature;
}
In case you missed the comment in the middle if your JVM bombs out due to a memory error
do the following: Right click on the project and select
properties. Click
Run/Debug Settings and select the class where your main method is located and click
edit, select the
(x)=Arguments tab and enter in the
VM arguments text area:
-Xmx1024
|
Solving JVM memory problem |
2.3 - Generate Histogram
To actually generate a histogram from the colourmap is ridiculously simple
/**@_colorMap created an array of type int that represent
*the histogram of that colour
* @histogram the array returned
**/
private final int[] createHistogram(int[][] _colorMap)
{
int[] histogram = new int[256];
for(int i=0, n=width; i<n; i++)
{
for(int j=0, m=height; j<m; j++)
{
histogram[_colorMap[i][j]]++;
}
}
return histogram;
}
So what this code snippet above does is create a histogram from the colormap that is parsed through. A new array of type
int and size of 256(0-255) is created, each array index will then be the corresponding colour value. Such that all instances of red-145 will be stored in histogram[145].
So the trick is that we access the the Array with the value of the _colormap[i][j] and just increment that index.
We then simple return that array, which now represents that colour's histogram.
2.4 - Normalise Histogram
Since we want to compare images of all sizes you have to realise that larger versions of images will have larger value in the histogram array, but relatively speaking those histograms will be near identical. This we need normalise the histogram so that ALL histograms will have the same upper limit. Now you can scale the histogram right after you've generated it and just before you returned it, or you can do it during comparison.
I did mine during comparison so that I could save a raw copy of the histogram since during normalisation certain data is lost. Of course this is slightly slower since I would have to regenerate the histogram each time.
I'm no mathematician so there probably exists a more elegant solution than mine. But in short we going to find the largest value in the histogram and see how close it is to our limit, then divide or multiple all the values by the difference depending on if its larger or smaller;
private int[] normaliseHistogram(int[] _histo)
{
double scaler=0.0;
int largestValue=0;
//Normalise all values relative to 255
int limit = 255;
//Finding Largest Value
for(int i=0; i<_histo.length; i++)
{
if(_histo[i] > largestValue)
largestValue = _histo.length;
}
//Calculate the scaler value
scaler = largestValue/limit;
for(int i=0; i< _histo.length; i++)
{
//If scaler is larger than 1 we devide all value by it
if(scaler > 1)
{
_histo[i] = (int) (_histo[i]/scaler);
}
else //if histo smaller than 1 we multiple all the values
{
_histo[i] = (int)(_histo[i]*scaler);
}
}
return _histo;
}
Step 3 - Comparison
And only one thing left to do and that compare the histograms with each other. So to compare to histograms with each other.
public double compareHistogram(ImageObject _image1, ImageObject _image2, int _color)
{
int tempHistImage1 = _image1.getHisto(_color);
int tempHistImage2 = _image2.getHisto(_color);
for(int i=0; i<256; i++)
{
double diff = 0;
TotalIntervals++;
if(temp1>=temp2)
{
diff += temp2/temp1*100;
}
else
{
diff += temp1/temp2*100;
}
}
return diff;
}
Note that this method takes in two ImageObjects again this is my own class I use to represent the data of each image such as there histograms, all I do with these objects is extract there histograms so I can use them. In this code snippet I compare the red histograms with each other, this code will have to be done for both green and blue.
Now all that’s left is to establish the threshold and see if it passes it, if it does then the images are a match.
histogramThreshold = 80.0;
if(compareHistogram(image1, image2, 1)>histogramThreshold)
return true;
|
My final program |
The image above is of my program that uses the same algorithm as above with some modification, including faster matching and reduced false positives. As well has the ability to find duplicate videos.