Send in your Unix questions today! |
See additional Unix tips and tricks
Following last week's column on file extensions, several readers wrote in to
mention the other ways in which Unix systems determine file types and, as a
consequence, how to handle files when you work with them. In particular, they
mentioned the /etc/magic file and the file signatures that it provides to identify
file types regardless of how the files are named -- even, in fact, when no file
extensions are used.
To demonstrate the file typing operation, let's examine a file named "unknown"
and see what we can learn about it.
First, here's a listing of the file:
> ls -l unknown
-rw-r--r-- 1 henrystocker staff 12578 Dec 4 11:10 unknown
|
When we ask the file command to identify it, it has no trouble determining
that this particular file is a JPEG file:
> file unknown
unknown: JPEG file
|
JPEG files, like other image file types (PNG, GIF, TIFF etc.), contain a form
of file identifier in addition to the data that comprises the image itself.
They start with a particular sequence of bytes. This byte sequence might be
\377\330\377\340 (0xffe0) or \377\330\377\341, (0xffe1). The difference between
these two identifiers is whether the image uses JFIF or EXIF. You might think
of these formats as extensions of the JPEG format, designed to support image
details not specified by the JPEG standard.
If we examine the beginning of our "unknown" JPEG file, for example,
we might see something like this:
> od -bc unknown | head -2
0000000 377 330 377 340 000 020 112 106 111 106 000 001 001 000 000 001
377 330 377 340 \0 020 J F I F \0 001 001 \0 \0 001
|
As you can see from the second line of output, this file uses JFIF (JPEG File
Interchange Format).
The second variety of JPEG uses EXIF (Exchangeable Image File Format). Most
digital cameras store image files using this format.
% od -bc myphoto.jpg | head -2
0000000 377 330 377 341 077 376 105 170 151 146 000 000 111 111 052 000
377 330 377 341 ? 376 E x i f \0 \0 I I * \0
|
EXIF markers store additional information about images such as an optional
thumbnail and sometimes even audio information.
The /etc/magic file identifies file types by capturing this type of information
and expressing it in four or five fields -- the byte offset (distance from the
start of the file), the type of the identifying value (e.g., string or short),
an optional operator, the value to be matched and the string to be printed by
commands that are meant to identify the files.
Here are the Solaris /etc/magic descriptors for both forms of JPEG files:
0 string \377\330\377\340 JPEG file
0 string \377\330\377\341 JPEG file
|
The offset for both JPEG identifiers is zero. In other words, the identifying
information is at the beginning of the file. The values to be matched are classified
as strings, though it is expressed as four bytes in octal format and the "string
to be printed" is "JPEG file". The description may vary slightly
from one OS to another. My Mac OS X system, for example, uses the description
"JPEG image data" while Solaris uses "JPEG file", but both
systems share the same basic knowledge about the file types and how to identify
them.
Ask a Unix system about a .doc or .docx file, on the other hand, the file command
is likely to tell you simply that the file is "data". While some Windows
files may have embedded identifiers, they're not as obvious as those associated
with standards-based image files and they don't seem to be heavily relied upon
for file identification. Windows systems depend on file extensions to a greater
degree than their Unix counterparts, but they do have some identifiers. Some
.exe files, for example, start with the letters "MZ" (said to be the
initials of one of the MS DOS developers) as shown in this example looking at
the PuTTY executable:
bash-2.05a$ od -bc putty.exe | head -2
0000000 115 132 220 000 003 000 000 000 004 000 000 000 377 377 000 000
M Z 220 \0 003 \0 \0 \0 004 \0 \0 \0 377 377 \0 \0
|
On both Windows explorer and Solaris file manager, when you double-click a
file to open it, the system examines the file name extension. If it recognizes
it, it opens the file with whatever program is associated with that file extension.
If Windows doesn't recognize a file's extension, it asks you what program it
should use. If Solaris doesn't recognize a file's extension or it doesn't have
one, it subjects it to greater scrutiny and seems to assign the proper icon
and action to it.
Identifying the type of files on the systems you manage isn't really magic.
Instead, it's a carefully crafted system that makes use of the ways in which
file types are defined.