Send in your Unix questions today! |
See additional Unix tips and tricks
Unix systems provide numerous ways to compare files. The most common way to verify that you have received or downloaded the proper file is to compute a checksum and compare it
against one computed by a reliable source. MD5 is frequently used to compute checksums
because it is computationally unlikely that two different files will ever have the same
checksum. Similar commands, such as sum and cksum, also compute checksums but not with
as much reliability. Let's look at several checksums and see why.
One of the first things you'll notice if you compare the output of the sum, time and md5
commands is the length of each calculated value. The sum command prints two numbers.
The first (31339 in our example) is a 16-bit checksum. This means that you will get any
of 65,536 distinct responses (from 0 to 65,535) for any file. The chance of getting the
same checksum for two files which are different is very small. If you have 65,000 files
to compare, however, the chance that two of them have the same checksum, though different,
is quite high. In fact, you'll probably have a number of false matches.
# sum /export/home/jdoe/bigfile.gz
31339 165523 home/jdoe/bigfile.gz
One characteristic of the sum command is that the length of the checksum has some
relationship to the length of the file. If one file contains "abc" and another contains
"abd", the checksums are only different by 1. This command is clearly using a very
simple calculation, better for verifying the integrity of a file than for heavy duty or
high security file checking.
# sum /tmp/ab*
304 1 /tmp/abc
305 1 /tmp/abd
The second number that sum prints is the number of 512-byte blocks that are in the file.
This helps considerably to insure that dissimilar files are clearly dissimilar. Unless
the files you are comparing are also roughly the same size, the fact that the checksums
are the same can be discounted.
The cksum command works similarly. The first number that it prints is a cyclical
redundancy check (CRC) for the file. As you can see from the sample output below, the CRC
is a fairly large number. This decreases the chance that two files will be taken as
being identical when they are not. Notice the difference in the checksum of our two
three-byte files.
# cksum /tmp/ab*
1112837078 4 /tmp/abc
1197460547 4 /tmp/abd
Using cksum against the lartge file we saw earlier, we see a similar checksum even though
the size of the file is dramatically larger.
# cksum /export/home/jdoe/bigfile.gz
3574185895 84747520 home/tcs/bigfile.gz
The second number in the cksum output is the number of octets (bytes) in the file. This
is a similar concept to the number of blocks, but is considerably finer grained. Two
files occupying the same number of blocks are still likely to include a different number
of octets.
The md5 command is the most reliable of the three commands and the only one recommended
for serious file checking. If you are sending a gzipped file to a customer and want the
customer to be confident that the file you have sent is both intact and the file you
intended to send, providing him with an md5 checksum is a very good idea. Notice the
length of the checksum below.
# md5 /export/home/jdoe/bigfile.gz
MD5 (/export/home/jdoe/bigfile.gz) = e1e0aec5c73eeb3bcf4cff4d5a44b067
This thirty-two hexadecimal number can take on any of 2 ** 128 possible values. This is a
bigger number than most of us can think about. It's billions times billions big. I am
told, it is exactly:
340,282,366,920,938,463,463,374,607,431,768,211,456
Probably so. I don't even want to think about calculating so large a number.
The chance of two files having the same md5 checksum is infinitesimally small. Looking at
the two small files, we see that the md5 checksums seem to have no similarity whatsoever.
# md5 /tmp/ab*
MD5 (/tmp/abc) = 0bee89b07a248e27c83fc3d5951213c1
MD5 (/tmp/abd) = 8f0abafc5f8e6686a882c78cac4bcb9f
Of course, to be valuable, checksums have to compute identically on different systems.
Fortunately for us, this should always be the case.