file - determine file type
file [-c] [-f namefile] [-m magicfile] file ...
The file command tests each argument in an attempt to classify it. There are four sets of tests, performed in this order: filesystem tests, script tests, magic number tests, and language tests. The first test that succeeds causes the file type to be printed.
The type printed will usually contain one of the words text (the file contains only printable ASCII characters), executable, or data meaning anything else (usually 'binary' or non-printable).
The filesystem tests are based on examining the return from a stat system call. The program checks to see if the file is empty, or if it's some sort of special file. Any known file types appropriate to the system you are running on (sockets, symbolic links, or named pipes (FIFOs) on those systems that implement them) are intuited.
The script tests are used when the file is an executable text file. If the first line is a '#!' line, then the name of the program is reported, otherwise the file is reported as 'commands text'.
The magic number tests are used to check for files with data in particular fixed formats. Such files have a 'magic number' stored in a particular place near the beginning of the file that indicates its type. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way.
Finally, if all of the previous tests fail and the file appears to be an ASCII file, file attempts to guess its language using a crude search for common tokens associated with certain languages and file types. These tests are less reliable than the previous two groups, so they are performed last.
file accepts the following options:
Specify an alternate magic file containing magic numbers.
Cause a debug checking printout of the parsed form of the magic file and information regarding the magic file match process for any arguments. This is usually used in conjunction with -m to debug a new magic file before installing it.
Read the names of the files to be examined from namefile (one per line) before the argument list.
Follow symbolic links.
The default magic file is ../share/magic located in the distribution relative to the path of the program. If that is not found, then an attempt is made to open /etc/magic, a common location for a system magic file on many UNIX systems. Magic file formats vary. This version supports the BSD format including big-endian and little-endian numerics, ordered comparison of strings, and use of numerics as dates. In particular, some file formats interpret '<' or '>' as a literal character if matching a string, but this implementation treats them as an operator.
Multiple levels of sub-tests are supported.
The environment variable MAGIC can be used to override the default location of the magic file. Command line options still take precedence.
File can't read from standard input.
This implementation is significantly slower than the C version. Much of the time is startup, followed by the overhead of parsing the magic file. Once the magic file is loaded after evaluating the first input file, then subsequent evaluations are a little faster. I try to speed the operation by only loading new entries from the magic file as I need them and only parsing the subtests as needed, but this doesn't help much.
Some simpler versions of magic (e.g. solaris') only allow the '=' operator for strings. Thus, the following line from the solaris /etc/magic will be misinterpreted by this implementation of file (An '=' should be prepended):
0 string <ar> System V R1 archive
The BSD version of file has a few bugs which make it more tolerant of bogus entries including:
>168 belong &=0x00000004 dynamically linked 0 string ^!<arch>\n_______64E Alpha archive
This implementation accepts bogus numerics without complaining, and only complains about bogus operators if -c is enabled.
Special identification of pre-POSIX tar files is not included.
Many magic files include elaborate attempts to match the starting line of executable scripts. This implementation will not usually consider these magic conditions because it identifies executable scripts according to their '#!' line in a special test before considering magic. This is faster and typically more reliable than attempts at exact string matching on the first line of the script.
This program is copyright by dkulp 1999.
This program is free and open software. You may use, copy, modify, distribute and sell this program (and any modified variants) in any way you wish, provided you do not restrict others to do the same, except for the following consideration.
I read some of Ian F. Darwin's BSD C implementation, to try to determine how some of this was done since the specification is a little vague. I don't believe that this perl version could be construed as an "altered version", but I did grab the tokens for identifying the hard-coded file types in names.h and copied some of the man page.
Here's his notice:
* Copyright (c) Ian F. Darwin, 1987. * Written by Ian F. Darwin. * * This software is not subject to any license of the American Telephone * and Telegraph Company or of the Regents of the University of California. * * Permission is granted to anyone to use this software for any purpose on * any computer system, and to alter it and redistribute it freely, subject * to the following restrictions: * * 1. The author is not responsible for the consequences of use of this * software, no matter how awful, even if they arise from flaws in it. * * 2. The origin of this software must not be misrepresented, either by * explicit claim or by omission. Since few users ever read sources, * credits must appear in the documentation. * * 3. Altered versions must be plainly marked as such, and must not be * misrepresented as being the original software. Since few users * ever read sources, credits must appear in the documentation. * * 4. This notice may not be removed or altered.