# Binary File Similarity Checking

How can one determine whether two binary files (for example, executable images) are somehow similar? I started writing a program to perform this task. Such a program could be useful for determing whether a vendor had included GNU Public License (GPL) code in a propriatary product, violating the GPL license. After writing about 20 lines, I realized that I needed an accurate definition of similarity than the vague "the two files contain a number of identical subsequences" I had in mind.

The accurate definition turns out to be based on information theory. Consider two files containing data with entropy Ha and Hb respectively. If the two files are completely different, then the entropy of their concatenation Hab would be (Ha + Hb) / 2. We can derive an approximation of a file's entropy value by compressing the file. If the file were perfectly compressed, the information in the file Itotal would be equal to the compressed file's length. Then the file's entropy could be calculated as the ratio of the file's information Itotal over the original file's length L:
H = Itotal/L

Comparing the expected entropy of the file concatenation with the actual entropy, should provide us with an indication of whether the two files are similar: an entropy value significantly lower than the expected, would indicate a high degree of similariy.

For the scheme I described to work, the compression algorithm should be able to compress the combined file as a single block. It turns out that the Unix compress and gzip programs are not suitable, but that bzip2 will indeed compress blocks of up to 900K, and can therefore be used for combined files up to that size.

The following Perl program implements the algorithm I described.

```#!/usr/bin/perl
#
# (C) Copyright 2004 Diomidis Spinellis
#
# Permission to use, copy, and distribute this software and its
# documentation for any purpose and without fee is hereby granted,
# provided that the above copyright notice appear in all copies and that
# both that copyright notice and this permission notice appear in
# supporting documentation.
#
# THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
# WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
# MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

# Return the entropy of the file passed as the argument
sub
entropy
{
my(\$file) = @_;
my(\$i, \$l);

# File length
\$l = `wc -c \$file`;
# File information (approximation)
\$i = `bzip2 -c \$file | wc -c`;
print STDERR "\$0: warning file size exceeds bzip2 block size\n" if (\$l > 900 * 1024);
return (\$i / \$l);
}

# Return the entropy of the two files passed as arguments
sub
entropy2
{
my(\$file1, \$file2) = @_;
my(\$oldrs) = (\$/);
my(\$tmp) = ("/tmp/entropy.\$\$");

undef(\$/);
open(IN, \$file1) || die "read from \$file1: \$!\n";
binmode(IN);
open(OUT, ">\$tmp") || die "write to \$tmp: \$!\n";
print OUT <IN>;
open(IN, \$file2) || die "read from \$file2: \$!\n";
binmode(IN);
print OUT <IN>;
close(IN);
close(OUT);
my(\$e) = (entropy(\$tmp));
return (\$e);
}

\$#ARGV == 1 || die "Usage \$0: file1 file2\n";

printf("%.3g - Entropy of \$ARGV[0]\n", \$e0 = entropy(\$ARGV[0]));
printf("%.3g - Entropy of \$ARGV[1]\n", \$e1 = entropy(\$ARGV[1]));
printf("%.3g - Combined predicted entropy\n", (\$e0 + \$e1) / 2);
printf("%.3g - Combined actual entropy\n", entropy2(\$ARGV[0], \$ARGV[1]));
```

My original intent was to compare the Windows-distributed network programs that were derived from BSD code (finger.exe, ftp.exe, nslookup.exe, rcp.exe, and rsh.exe) with the corresponding actual BSD implementations. As a swift way to compile the BSD versions under Windows I simply copied across the FreeBSD /usr/include directory, and changed Microsoft's C compiler INCLUDE path to use that directory instead. The process worked like a charm. To link the files, I wrote a tiny shell script to create stubs for all the missing libraries reported by the linker:

```cl  *.c |
awk '/: unresolved/{print \$8}' |
/usr/bin/sort -u |
sed '
s/^_/void /
s/\$/(){}/
' &gt;libstub.c
```
Unfortunatelly, the results were not what I expected:
```0.428 - Entropy of \windows\system32\ftp.exe
0.48 - Entropy of \dds\src\port\ftp\ftp.exe
0.454 - Combined predicted entropy
0.466 - Combined actual entropy

0.391 - Entropy of \windows\system32\rcp.exe
0.523 - Entropy of \dds\src\port\rcp\rcp.exe
0.457 - Combined predicted entropy
0.48 - Combined actual entropy

0.421 - Entropy of \windows\system32\finger.exe
0.483 - Entropy of \dds\src\port\finger\finger.exe
0.452 - Combined predicted entropy
0.471 - Combined actual entropy
```
I would expect the combined actual entropy to be a lot lower than the predicted one. I can think of two reasons for the failure:
• The Microsoft and FreeBSD implementations have diverged significantly over the years.
• I was using different compilers or compiler options, than the ones Microsoft uses to build the tools for distribution.

In other cases, the approach appears to work quite reliably. It finds the BSD-derived ftp, rcp, and rsh programs significantly similar, and fails to find a similarity between the same-sized ftp and osuninst programs.

```0.428 - Entropy of \windows\system32\ftp.exe
0.391 - Entropy of \WINDOWS\system32\rcp.exe
0.409 - Combined predicted entropy
0.377 - Combined actual entropy

0.415 - Entropy of \windows\system32\rsh.exe
0.391 - Entropy of \WINDOWS\system32\rcp.exe
0.403 - Combined predicted entropy
0.328 - Combined actual entropy

0.428 - Entropy of \windows\system32\ftp.exe
0.479 - Entropy of \WINDOWS\system32\osuninst.exe
0.453 - Combined predicted entropy
0.446 - Combined actual entropy
```

The program also detects a large similarity between the executables of version 1.1 and version 1.19 of my outwit winclip program.

```0.428 - Entropy of \dds\src\sysutil\outwit\winclip\winclip-1.1.exe
0.452 - Entropy of \dds\src\sysutil\outwit\winclip\winclip-1.19.exe
0.44 - Combined predicted entropy
0.33 - Combined actual entropy
```
This is impressive, if one considers that the source code differences between the two files are longer that the current version of the source code.