SnapMatcher - Monstrous Software

News Features Requirements Usage Method License Downloads

SnapMatcher is an application intended for photographers, artists, or image packrats who have very large collections of digital images, some of which may be duplicates or near duplicates. By identifying images across multiple formats with the ability to filter out minor edits such as changes to contrast, brightness, color balance, resizing, or even the addition of text or borders SnapMatcher can be a valuable tool in organizing culling unwieldy collections.

The current version is recent work on an early prototype that had been sitting idle for a while. Features are now being added regularly, and a 1.0 release with optional GUI should happen within a few months. In the meantime the command line version, while a bit raw, is still effective and useful.

News

2007.03.18

SnapMatcher 0.4 released. Changes include:

Added the --match command to quickly match a small set of images (which may or may not be referenced in any image database) against an image database.
Re-fixed the bug preventing the directory scan process in database creation from working on Windows after discovery that the fixed provided in version 0.3 was flawed.
Added PCX, PPM, XPM, and XBM to the previous default image extensions (JPG, JPEG, JPE, PNG, GIF, TIF, TIFF, BMP) used when creating an image database, so now all standard image formats supported by the Python Imaging Library are included by default.
Improved reliability for database and matches file updates by writing updates to a temp file until the operation is complete, so the original files are unmodified in the case of program interruption.
Improved efficiency of database updates.

2007.03.04

SnapMatcher 0.3 released. Changes include:

Added the --updatedb command to update an image database and existing matches file by correcting paths for moved files, (re)generating signatures for new or modified files, and removing matches for modified or delete files.
Improved match finding process for databases updated since an original set of matches was identified to only check for new matches between pairs of new images or pairs of new and old images (ignoring pairs where both images are old and have already been tested), allowing efficient match identification for large, regularly updated image libraries.
Fixed a bug preventing the directory scan process in database creation from working on Windows (thanks Steven).

2007.02.19

SnapMatcher 0.2 released. Changes include:

Switched ordering of file group names and extensions, and changed the "results" extension to "matches".
Reduced DB creation time by 30%-40%.
Reduced image DB and match results file sizes using relative directory paths and a terser image signature representation.
Changed ordering of --builddb arguments.

2007.01.27

SnapMatcher 0.1 released.

Features

SnapMatcher is a ways from being feature-complete right now, but the core capabilities are there:

Ability to create multiple image databases for later queries
Customizable matching threshold allowing results to be generated anywhere from near exact matches only to identification of distinct images with very similar appearances
Support of most standard image formats (JPEG, GIF, TIFF, PNG, BMP, PCX, PPM, XPM and XBM by default, with a few other formats readable by PIL possible using command line arguments)
Ability to specify image types by extension in image DB creation process
Output of match results into a simple text file
Update existing image databases while only processing images which have been added or modified since the last update (currently any changes to a set of images requires a full DB reconstruction)
Search for matches only against new or modified images
Quick search for matches against a single image or small group of images

Features which might be added in later releases:

Search for matches only against images added/modified after a certain date
Multiple directory and excluded subdirectory support for image DBs (currently one DB equals one directory plus all subdirectories)
Error reporting for images which cannot be processed
Performance enhancements for very large image collections
A user friendly GUI, enabling easy management of image DBs, easy queries, and side-by-side comparison of matched images with options for renaming or deletion

There are also likely to be some bugs due to the early nature of the code, but because the current version only scans image collections and does not modify existing files in any way it is quite safe to use.

Requirements

SnapMatcher is built using these primary tools: the Python language, SciPy (scientific tools for Python), and the Python Imaging Library. A Python interpreter and the necessary modules are available for most desktop Linux, BSD, Unix, Mac OS X and Windows systems.

Usage

To create a database run the command

python SnapMatcher.py --builddb GROUPNAME ROOTDIR [EXT 1] [EXT 2] ...

GROUPNAME is the name you choose for the image database, while ROOTDIR is the relative path to the directory containing the image collection (all subdirectories will be included). If a list of file extensions is omitted the default list of "jpg jpeg jpe gif tif tiff png bmp" will be used. The database will be stored in a file named GROUPNAME.imagedb. If any images in this set or modified, or new images added, the DB will need to be regenerated by running this command again (existing DBs will be overwritten if the same GROUPNAME is reused.

To scan for matching images run the command

python SnapMatcher.py --finddups GROUPNAME [TOLERANCE]

The results will be written to GROUPNAME.matches. TOLERANCE is the minimum correlation coefficient required for a successful image match, and must be a number between 0.0 and 1.0. If omitted a threshold of 0.95 will be used. Choice of tolerance has no effect on DB processing time, so higher tolerances are not recommended unless you are experiencing a lot of false positives (i.e., images which are similar but not exact matches). Of course what you decide is a false positive is up to you, since SnapMatcher is useful in identifying similar images which vary in exact pose or viewing angle of the subject.

In cases where matches have previously been identified, SnapMatcher will update the existing GROUPNAME.matches file by only looking for matches between new images (those added to the database using the --updatedb command since the matches were found) and each other as well as new images and existing images, ignoring all old-old image pairings. If the database has not been updated since matches were last identified SnapMatcher will simply report that GROUPNAME.matches is up to date.

To update an existing image database and identified matches run the command

python SnapMatcher.py --updatedb GROUPNAME

GROUPNAME.imagedb will be updated with corrected paths for images which have been moved within the database directory structure, while new signatures will be generated for new and modified images (images which are renamed are considered modified). If a GROUPNAME.matches file exists for this images set it will likewise be updated with corrected file paths, and matches containing modified or deleted images will be removed.

To check for matches between an image database and a small set of images (which need not be referenced in this or other image databases) run the command

python SnapMatcher.py --match GROUPNAME TOLERANCE IMAGE1 [IMAGE2] ...

The default tolerance of 0.95 is recommended for most situations, but values between 0.9 and 0.99 will be useful in various situations. Results are printed to the console.

Method

The method used for image matching is straightforward but effective. An image collection is processed into a database (currently stored in just a simple text file). Each image is converted to grayscale and scaled down to a very low resolution (4x4 pixels) and the 2D image is coverted into a one-dimensional signature vector. During the analysis process a correlation test is performed between image signatures, and if the correlation coefficient exceeds a threshold the images are determine to match and are added to a results list.

The main drawback of this method is that the analysis phase runs in O(n²) time complexity, meaning that the time required for the analysis phase starts to increase rapidly once the size of an image collection begins to get large. On my 3 year old machine the software can analyze about 500,000 image pairs per minute. Using this as a guideline, here are estimated times for the analysis phase for various sizes of image collection:

Number of Images	Approximate Number of Pairings	Estimated Processing Time
500 or less	less than 25,000	less than 1 second
1,000	1 Million	2 seconds
3,000	10 Million	20 seconds
10,000	100 Million	3-4 minutes
30,000	1 Billion	20 minutes
100,000	10 Billion	5-6 hours
300,000	100 Billion	2-3 days
1 Million	1 Trillion	20-30 days

A few methods have been tested or merely looked at for speeding this process up, but in the context of a single user desktop application these methods are not very promising. Those interested can read a more detailed discussion of image search methods and performance.

License

SnapMatcher is Free Software, and is licensed under the GPL (GNU Public License) version 2.0.

Downloads

SnapMatcher is currently available only as Python source. Standalone binaries may be made available in the future.

SnapMatcher 0.4 Source

This package contains four source files in a self-contained directory. Simply extract the data from the archive to any desired location, then run the application by executing

python SnapMatcher.py

inside the directory from your system's command line.

Older versions available here.

Contact me at arkaein@monsterden.net with any questions, suggestions, bug reports or patches for SnapMatcher.

Back to Monstrous Software

SnapMatcher (A Near-Duplicate Image Finder)