open.itworld.com
  Search  
Security Home Page Security Webcasts Security White Papers Security Newsletters Security News Open Topics Careers ITworld Voices ITwhirled The Security site of ITworld.com

Unix Tip: Comparing Files with Checksums

ITworld.com 11/16/2006

Sandra Henry-Stocker, ITworld.com

Send in your Unix questions today! | See additional Unix tips and tricks

Unix systems provide numerous ways to compare files. The most common way to verify that you have received or downloaded the proper file is to compute a checksum and compare it against one computed by a reliable source. MD5 is frequently used to compute checksums because it is computationally unlikely that two different files will ever have the same checksum. Similar commands, such as sum and cksum, also compute checksums but not with as much reliability. Let's look at several checksums and see why.

One of the first things you'll notice if you compare the output of the sum, time and md5 commands is the length of each calculated value. The sum command prints two numbers. The first (31339 in our example) is a 16-bit checksum. This means that you will get any of 65,536 distinct responses (from 0 to 65,535) for any file. The chance of getting the same checksum for two files which are different is very small. If you have 65,000 files to compare, however, the chance that two of them have the same checksum, though different, is quite high. In fact, you'll probably have a number of false matches.

# sum /export/home/jdoe/bigfile.gz
31339 165523 home/jdoe/bigfile.gz
One characteristic of the sum command is that the length of the checksum has some relationship to the length of the file. If one file contains "abc" and another contains "abd", the checksums are only different by 1. This command is clearly using a very simple calculation, better for verifying the integrity of a file than for heavy duty or high security file checking.
# sum /tmp/ab*
304 1 /tmp/abc
305 1 /tmp/abd
The second number that sum prints is the number of 512-byte blocks that are in the file. This helps considerably to insure that dissimilar files are clearly dissimilar. Unless the files you are comparing are also roughly the same size, the fact that the checksums are the same can be discounted.

The cksum command works similarly. The first number that it prints is a cyclical redundancy check (CRC) for the file. As you can see from the sample output below, the CRC is a fairly large number. This decreases the chance that two files will be taken as being identical when they are not. Notice the difference in the checksum of our two three-byte files.
# cksum /tmp/ab*
1112837078      4       /tmp/abc
1197460547      4       /tmp/abd
Using cksum against the lartge file we saw earlier, we see a similar checksum even though the size of the file is dramatically larger.
# cksum /export/home/jdoe/bigfile.gz
3574185895      84747520        home/tcs/bigfile.gz
The second number in the cksum output is the number of octets (bytes) in the file. This is a similar concept to the number of blocks, but is considerably finer grained. Two files occupying the same number of blocks are still likely to include a different number of octets.

The md5 command is the most reliable of the three commands and the only one recommended for serious file checking. If you are sending a gzipped file to a customer and want the customer to be confident that the file you have sent is both intact and the file you intended to send, providing him with an md5 checksum is a very good idea. Notice the length of the checksum below.
# md5 /export/home/jdoe/bigfile.gz
MD5 (/export/home/jdoe/bigfile.gz) = e1e0aec5c73eeb3bcf4cff4d5a44b067
This thirty-two hexadecimal number can take on any of 2 ** 128 possible values. This is a bigger number than most of us can think about. It's billions times billions big. I am told, it is exactly:
340,282,366,920,938,463,463,374,607,431,768,211,456
Probably so. I don't even want to think about calculating so large a number.

The chance of two files having the same md5 checksum is infinitesimally small. Looking at the two small files, we see that the md5 checksums seem to have no similarity whatsoever.
# md5 /tmp/ab*
MD5 (/tmp/abc) = 0bee89b07a248e27c83fc3d5951213c1
MD5 (/tmp/abd) = 8f0abafc5f8e6686a882c78cac4bcb9f
Of course, to be valuable, checksums have to compute identically on different systems. Fortunately for us, this should always be the case.

On this topic

 

Sandra Henry-Stocker has been administering Unix systems for more than 18 years. She describes herself as "USL" (Unix as a second language) but remembers enough English to write books and buy groceries. She currently works for TeleCommunication Systems, a wireless communications company, in Annapolis, Maryland, where no one else necessarily shares any of her opinions. She lives with her second family on a small farm on Maryland's Eastern Shore. Send comments and suggestions to bugfarm@gmail.com.




Sponsored Links

Sign up for a Microsoft Dynamics® CRM WEBCAST
Hear globally recognized leaders in customer strategy discuss the importance and evolution of CRM.
Sun Microsystems' - FREE 60 DAY TRIAL OFFER!
Test Sun's Newest Servers BEFORE YOU BUY. Plug Them In With Access To Full Technical Support.
100% Web Based Help Desk Software
Easy to use, customizable to meet your needs, powerful and scalable. Free online demo. Try it today!
Sign up for a Microsoft Dynamics® CRM WEBCAST
Hear globally recognized leaders in customer strategy discuss the importance and evolution of CRM.
Used and Refurbished HP ProCurve Switches
Lifetime Warranties, Professional Testing & Shipping on all HP Equipment Purchases!
» Buy a link now

Advertisements
Sponsored links
Locate Hidden Software on business PCs with this free tool
Bring harmony to your mix of UNIX-Linux-Windows computing environments
Top 5 Reasons to Combine App Performance and Security
KODAK i1400 Series Scanners stand up to the challenge
 Home   Open source  Operating systems  Unix
www.itworld.com    open.itworld.com     security.itworld.com     smallbusiness.itworld.com
storage.itworld.com     utilitycomputing.itworld.com     wireless.itworld.com

 
Contact Us   About Us   Privacy Policy    Terms of Service   Reprints  

CIO   Computerworld   CSO   GamePro   Games.net   IDG Connect   IDG World Expo   Infoworld   ITworld   JavaWorld   LinuxWorld  MacUser   Macworld   Network World   PC World   Playlist  

Copyright © Computerworld, Inc. All rights reserved

Reproduction in whole or in part in any form or medium without express written permission of Computerworld Inc. is prohibited. Computerworld and Computerworld.com and the respective logos are trademarks of International Data Group Inc.