The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::Unpack - A strong bz2/gz/zip/tar/cpio/rpm/deb/cab/lzma/7z/rar/... archive unpacker, based on mime-types

VERSION

Version 0.70

SYNOPSIS

This perl module comes with an executable script:

/usr/bin/file_unpack -h

/usr/bin/file_unpack [-1] [-m] ARCHIVE_FILE ...

File::Unpack is an unpacker for archives and files (bz2/gz/zip/tar/cpio/iso/rpm/deb/cab/lzma/7z/rar ... pdf/odf) based on MIME types. We call it strong, because it is not fooled by file suffixes, or multiply wrapped packages. It recursively descends into each archive found until it finally exposes all unpackable payload contents.

A logfile can be written, precisely describing MIME types and unpack actions.

    use File::Unpack;

    my $log;
    my $u = File::Unpack->new(logfile => \$log);

    my $m = $u->mime('/etc/init.d/rc');
    print "$m->[0]; charset=$m->[1]\n";
    # text/x-shellscript; charset=us-ascii

    map { print "$_->{name}\n" } @{$u->mime_helper()};
    # application/%rpm
    # application/%tar+gzip
    # application/%tar+bzip2
    # ...

    $u->unpack("inputfile.tar.bz2");
    while ($log =~ m{^\s*"(.*?)":}g) # it's JSON.
      {
        print "$1\n";   # report all files unpacked
      }

    ...

Most of the known archive file formats are supported. Shell-script-style plugins can be added to support additinal formats.

Helper shell-scripts can be added to support additional mime-types. Example:

$ echo "ar x $1" > /usr/share/File-Unpack/helper/application=x-debian-package

$ chmod a+x /usr/share/File-Unpack/helper/application=x-debian-package

This example creates a trivial external equivalent of the builtin MIME helper for *.deb packages. For details see the documentation of the unpack() method.

unpack examines the contents of an archive file or directory using an extensive mime-type analysis. The contents is unpacked recursively to the given destination directory; a listing of the unpacked files is reported through the built in logging facility during unpacking. Most common archive file formats are handled directly; more can easily be added as mime-type helper plugins.

SUBROUTINES/METHODS

new

my $u = new(destdir => '.', logfile => \*STDOUT, maxfilesize => '2G', verbose => 1, world_readable => 0, one_shot => 0, no_op => 0, archive_name_as_dir => 0, follow_file_symlinks => 0, log_params => {}, log_type => 'JSON');

Creates an unpacker instance. The parameter destdir must be a writable location; all output files and directories are placed inside this destdir. Subdirectories will be created in an attempt to reflect the structure of the input. Destdir defaults to the current directory; relative paths are resolved immediatly, so that chdir() after calling new is harmless.

The parameter logfile can be a reference to a scalar, a filename, or a filedescriptor. The logfile starts with a JSON formatted prolog, where all lines start with printable characters. For each file unpacked, a one line record is appended, starting with a single whitespace ' ', and terminated by "\n". The format is a JSON-encoded "key": {value},\n pair, where key is the filename, and value is a hash including 'mime', 'size', and other information. The logfile is terminated by an epilog, where each line starts with a printable character. As part of the epilog, a dummy file named "\" with an empty hash is added to the list. It should be ignored while parsing. Per default, the logfile is sent to STDOUT.

The parameter maxfilesize is a safeguard against compressed sparse files and test-files for archivers. Such files could easily fill up any available disk space when unpacked. Files hitting this limit will be silently truncated. Check the logfile records or epilog to see if this has happened. BSD::Resource is used manipulate RLIMIT_FSIZE.

The parameter one_shot can optionally be set to non-zero, to limit unpacking to one step of unpacking. Unpacking of well known compressed archives like e.g. '.tar.bz2' is considered one step only. If uncompressing an archive is considered an extra step before unpacking the archive depends on the configured mime helpers.

The parameter no_op causes unpack() to only print one shell command to STDOUT and exit. This implies one_shot=1.

The parameter world_readable causes unpack() change all directories to 0755, and all files to 444. Otherwise 0700 and 0400 (user readable) is asserted.

The parameter follow_file_symlinks causes some or all symlinks to files to be included. A value of 1 follows symlinks that exist in the input directory and point to a file. This has no effect if the input is an archive file. A value of 2 also follows symlinks that were extracted from archives. CAUTION: This may cause unpack() to visit files or archives elsewhere in the local filesystem. Directory symlinks are always excluded.

The parameter archive_name_as_dir causes the unpacker to store all unpacked files inside a directory with the same name as their archive.

The default depends on how many files are unpacked from the archive: If exactly one file (or one toplevel directory) is unpacked, then no extra directory is used. E.g. foo.tar.gz would unpack to foo.tar or foo-1.0.zip would unpack to foo-1.0/* and no files outside this directory. If multiple files (or directories) are unpacked, and the suffix of the archive can be removed with the suffix_re of its mime_helper, then the shortened name is used as a directory. E.g. foo.tar would unpack to foo/*. Otherwise ._ is appended to the archive name. E.g. foo.tar would unpack to foo.tar._/*.

In any case, the suffix ._ or ._NNN is used to avoid conflicts with already existing names where NNN is a numeric value.

exclude

exclude(add => ['.svn', '*.orig' ], del => '.svn', force => 1, follow_file_symlinks => 0)

Defines the exclude-list for unpacking. This list is advisory for the MIME helpers. The exclude-list items are shell glob patterns, where '*' or '?' never match '/'.

You can use force to have any of these removed after unpacking. Use (vcs => 1) to exclude a long list of known version control system directories, use (vcs => 0) to remove them. The default is exclude(empty => 1), which is the same as exclude(empty_file => 1, empty_dir => 1) -- having the obvious meaning.

(re => 1) returns the active exclude-list as a regexp pattern. Otherwise exclude always returns the list as an array ref.

Some symbolic links are included if {follow_file_symlinks} is nonzero. For details see <new()>.

If exclude patterns were effective, or if symlinks, fifos, sockets, ... were encountered during unpack(), the logfile contains an additional 'skipped' keyword with statistics.

unpack

$u->unpack($archive, [$destdir])

Determines the contents of an archive and recursivly extracts its files. An archive may be the pathname of a file or directory. The extracted contents will be stored in destdir/$subdir/$dest_name, where dest_name is the filename component of archive without any leading pathname components, and possibly stripped or added suffix. (Subdir defaults to ''.) If archive is a directory, then dest_name will also be a directory. If archive is a file, the type of dest_name depends on the type of packing: If the archive expands to multiple files, dest_name will be a directory, otherwise it will be a file. If a file of the same name already exists in the destination subdir, an additional subdir component is created to avoid any conflicts.

For each extracted file, a record is written to the logfile. When unpacking is finished, the logfile contains one valid JSON structure. Unpack achieves this by writing suitable prolog and epilog lines to the logfile. The logfile can also be parsed line by line. All file records is one line and start with a ' ' whitespace, and end in a ',' comma. Everything else is prolog or epilog.

The actual unpacking is dispatched to MIME type specific helpers, selected using mime. A MIME helper can either be built-in code, or an external shell-script found in a directory registered with mime_helper_dir. The standard place for external helpers is /usr/share/File-Unpack/helper; it can be changed by the environment variable FILE_UNPACK_HELPER_DIR or the new parameter helper_dir.

The naming of helper scripts is described under mime_helper().

A MIME helper must have executable permission and is called with 6 parameters: source_path, destfile, destination_path, mimetype, description, and config_dir. Note, that destination_path is a freshly created empty working directory, even if the unpacker is expected to unpack only a single file. The unpacker is called after chdir into destination_path, so you usually do not need to evaluate the third parameter.

The directory config_dir contains unpack configuration in .sh, .js and possibly other formats. A MIME helper may use this information, but need not. All data passed into new is reflected there, as well as the active exclude-list. Using the config information can help a MIME helper to skip unwanted work or otherwise optimize unpacking.

unpack monitors the available filesystem space in destdir. If there is less space than configured with minfree, a warning can be printed and unpacking is optionally paused. It also monitors the MIME helpers progress reading the archive at source_path and reports percentages to STDERR (if verbose is 1 or more).

After the MIME helper is finished, unpack examines the files it created. If it created no files in destdir, an error is reported, and the source_path may be passed to other unpackers, or finally be added to the log as is.

If the MIME helper wants to express that source_path is already unpacked as far as possible and should be added to the log without any error messages, it creates a symbolic link destdir pointing to source_path.

The system considers replacing the directory with a file, if all of the following conditions are met:

  • There is exactly one file in the directory.

  • The file name is identical with the directory name, except for one changed or removed suffix-word. (*.tar.gz -> *.tar; or *.tgz -> *.tar)

  • The file must not already exist in the parent directory.

unpack prepares 20 empty subdirectory levels and chdirs the unpacker in there. This number can be adjusted using new(dot_dot_safeguard => 20). A directory 20 levels up from the current working dir has mode 0 while the MIME helper runs. unpack can optionally chmod(0) the parent of the subdirectory after it chdirs the unpacker inside. Use new(jail_chmod0 => 1) for this, default is off. If enabled, a MIME helper trying to place files outside of the specified destination_path may receive 'permission denied' conditions.

These are special hacks to keep badly constructed tar-balls, cpio-, or zip-archives at bay.

Please note, that this can help against archives containing relative paths (like starting with '../../../foo'), but will be ineffective with absolute paths (starting with '/foo'). It is the responsibility of MIME helpers to not create absolute paths; unpack should not be run as the root user, to minimize the risk of compromising the root filesystem.

A missing MIME helper is skipped, and subsequent helpers may take effect. A MIME helper is expected to return an exit status of 0 upon success. If it runs into a problem, it should print lines starting with the affected filenames to stderr. Such errors are recorded in the log with the unpacked archive, and as far as files were created, also with these files.

Symbolic links are ignored while unpacking.

Currently you can call unpack only once.

run

$u->run([argv0, ...], @redir, ... { init => sub ..., in, out, err, watch, every, prog, ... })

A general purpose fork-exec wrapper, based on IPC::Run. STDIN is closed, unless you specify an in => as described in IPC::Run. STDERR and STDOUT are both printed to STDOUT, prefixed with 'E: ' and 'O: ' respectively, unless you specify out =>, err =>, or out_err => ... for both.

Using redirection operators in @redir takes precedence over the above in/out/err redirections. See also IPC::Run. If you use the options in/out/err, you should restrict your redirection operators to the forms '<', '0<', '1>', '2>', or '>&' due to limitations in the precedence logic. Piping via '|' is properly recognized, but background execution '&' may confuse the precedence logic.

This run method is completly independent of the rest of File::Unpack. It works both as a static function and as a method call. It is used internally by unpack, but is exported to be of use elsewhere.

Init is run after construction of redirects. Calling chdir() in init thus has no effect on redirects with relative paths.

Return value in scalar context is the first nonzero result code, if any. In list context all return values are returned.

fmt_run_shellcmd

File::Unpack::fmt_run_shellcmd( $m->{argvv} )

Static function to pretty print the return value $m of method find_mime_helper(); It formats a command array used with run() as a properly escaped shell command string.

mime_helper_dir mime_helper

$u->mime_helper_dir($dir, ...) $u->mime_helper($mime_name, $suffix_regexp, \@argv, @redir, ...)

Registers one or more directories where external MIME helper programs are found. Helpers plugins are shellscripts that server as specialized MIME type handlers for unpacking. A list of helpers comes builtin interfacing most well-known archivers. This list can be appended to using the mime_helper_dir() or mime_helper() methods. Multiple directories can be registered, They are searched in reverse order, i.e. last added takes precedence. Any external MIME helper takes precedence over built-in code.

The suffix_regexp is used to derive the destination name from the source name. It is not used for selecting helpers.

When collecting external helper scripts via mime_helper_dir(), there is no suffix_regexp. Instead, external helper scripts can explicitly create a toplevel directory with the desired name.

Helpers are mapped to MIME types by their mime_name. The name can be constructed from the MIME type by replacing the '/' with a '=' character, and by using the word 'ANY' as a wildcard component. The '=' character is interpreted as an implicit '=ANY+' if needed.

 Examples:

  Mimetype                   helper names tried from top to bottom
  -----------------------------------------------------------------
  image/png                  image=png 
                              image=ANY 
                               image
                                ANY=png
                                 ANY=ANY
                                  ANY

  application/vnd.oasis+zip  application=vnd.oasis+zip 
                              application=ANY+zip
                               application=ANYzip
                                application=zip
                                 application=ANY
                                      ...
  

A trailing '=ANY' is implicit, as shown by these examples. The rules for precedence are this:

  • Search in the latest directory is exhaused first, then the previously added directory is considered in turn, up to all directories have been traversed, or until a matching helper is found.

  • A matching name with wildcards has lower precedence than a matching name without.

  • A wildcard before the '=' sign lowers precedence more than one after it.

The mapping takes place when mime_helper_dir is called. Adding helper scripts to a directory afterwards has no effect. mime_helper does not do any implicit expansions. Call it multiple times with the same helper command and different names if needed. The default argument list is "%(src)s %(destfile)s %(destdir)s %(mime)s %(descr)s %(configdir)s" -- this is applied, if no args are given and no redirections are given. See also unpack for more semantics and how a helper should behave.

Both methods return an ARRAY-ref of HASHes describing all known (old and newly added) mime helpers.

list

Returns an ARRAY of preformatted patterns and MIME helpers.

Example:

  printf @$_ for $u->list(); 

find_mime_helper

$u->find_mime_helper($mimetype)

Returns a MIME helper suitable for unpacking the given $mimetype. If called in list context, a second return value indicates which mime helpers would be suitable, but could not be found in the system.

minfree

$u->minfree(factor => 10, bytes => '100M', percent => '3%', warning => sub { .. })

THESE TESTS ARE TO BE IMPLEMENTED.

Guard the filesystem (destdir) against becoming full during unpack. Before unpacking each source archive, the free space is measured and compared against three conditions:

  • The archive size multiplied with the given factor must fit into the filesystem.

  • The given number of bytes (in optional K, M, G, or T units) must be free.

  • The filesystem must have at least the given free percentage. The '%' character is optional.

The warning method is called if any of the above conditions fail. Its signature is: &warning->($pathname, $full_percentage, $free_bytes, $free_inodes); It is expected to print an appropriate warning message, and delay a few seconds. It should return 0 to cause a retry. It should return nonzero to continue unpacking. The default warning method prints a message to STDERR, waits 30 seconds, and returns 0.

The filesystem may still become full and unpacking may fail, if e.g. factor was chosen lower than the average compression ratio of the archives.

mime

$u->mime($filename)

$u->mime(file => $filename)

$u->mime(buf => "#!/bin ...", file => "what-was-read")

$u->mime(fd => \*STDIN, file => "what-was-opened")

Determines the MIME type (and optionally additional information) of a file. The file can be specified by filename, by a provided buffer or an opened file descriptor. For the latter two cases, specifying a filename is optional, and used only for diagnostics.

mime uses libmagic by Christos Zoulas exposed via File::LibMagic and also uses the shared-mime-info database from freedesktop.org exposed via File::MimeInfo::Magic, if available. Either one is sufficient, but having both is better. LibMagic sometimes says 'text/x-pascal', although we have a .desktop file, or says 'text/plain', but has contradicting details in its description.

File::MimeInfo::Magic::magic is consulted where the libmagic output is dubious. E.g. when the desciption says something interesting like 'Debian binary package (format 2.0)' but the mimetype says 'application/octet-stream'. The combination of both libraries gives us excellent reliability in the critical field of MIME type recognition.

This implementation also features multi-level MIME type recognition for efficient unpacking. When e.g. unpacking a large bzipped tar archive, this saves us from creating a huge temporary tar-file which unpack would extract in a second step. The multi-level recognition returns 'application/x-tar+bzip2' in this case, and allows for a MIME helper to e.g. pipe the bzip2 contents into tar (which is exactly what 'tar jxvf' does, making a very simple and efficient MIME helper).

mime returns a 3 or 4 element arrayref with mimetype, charset, description, diff; where diff is only present when the libfile and shared-mime-info methods disagree.

In case of 'text/plain', an additional rule based on file name suffix is used to allow recognition of well known plain text pack formats. We return 'text/x-suffix-XX+plain', where XX is one of the recognized suffixes (in all lower case and without the dot). E.g. a plain mmencoded file has no header and looks like 'plain/text' to all the known magic libraries. We recognize the suffixes .mm, .b64, and .base64 for this (case insignificant). A similar rule exitst for 'application/octect-stream'. It may trigger e.g. for LZMA compressed files which fail to provide a magic number.

Examples:

 [ 'text/x-perl', 'us-ascii', 'a /usr/bin/perl -w script text']

 [ 'text/x-mpegurl', 'utf-8', 'M3U playlist text', 
   [ 'text/plain', 'application/x-mpegurl']]

 [ 'application/x-tar+bzip2, 'binary', 
   "bzip2 compressed data, block size = 900k\nPOSIX tar archive (GNU)", ...]

AUTHOR

Juergen Weigert, <jnw at cpan.org>

BUGS

The implementation of mime is an ugly hack. We suffer from the existence of multiple file magic databases, and multiple conflicting implementations. With Perl we have at least 5 modules for this; here we use two.

The builtin list of MIME helpers is incomplete. Please submit your handler code.

Please report any bugs or feature requests to bug-file-unpack at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Unpack. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

RELATED MODULES

While designing File::Unpack, a range of other perl modules were examined. Many modules provide valuable service to File::Unpack and became dependencies or are recommended. Others exposed drawbacks during closer examination and may find some of their wheels re-invented here.

Used Modules

File::LibMagic

This is the prefered mimetype engine. It disregards the suffix, recognizes more types than any of the alternatives, and uses exactly the same engine as /usr/bin/file in openSUSE systems. It also returns charset and description information. We crossreference the description with the mimetype to detect weaknesses, and consult File::MimeInfo::Magic and some own logic, for e.g. detecting LZMA compression which fails to provide any recognizable magic. Required if you use mime; otherwise not a hard requirement.

File::MimeInfo::Magic

Uses both magic information and file suffixes to determine the mimetype. Its magic() function is used in a few cases, where File::LibMagic fails. E.g. as of June 2010, libmagic does not recognize 'image/x-targa'. File::MimeInfo::Magic may be slower, but it features the shared-mime-info database from freedesktop.org . Recommended if you use mime.

String::ShellQuote

Used to call external MIME helpers. Required.

BSD::Resource

Used to reliably restrict the maximum file size. Recommended.

File::Path

mkpath(). Required.

Cwd

fast_abs_path(). Required.

JSON

Used for formatting the logfile. Required.

Modules Not Used

Archive::Extract

Archive::Extract tries first to determine what type of archive you are passing it, by inspecting its suffix. 'Maybe this module should use something like "File::Type" to determine the type, rather than blindly trust the suffix'. [quoted from perldoc]

Set $Archive::Extract::PREFER_BIN to 1, which will prefer the use of command line programs and won't consume so much memory. Default: use "Archive::Tar".

Archive::Zip

If you are just going to be extracting zips (and/or other archives) you are recommended to look at using Archive::Extract . [quoted from perldoc] It is pure perl, so it's a lot slower then your '/usr/bin/zip'.

Archive::Tar

It is pure Perl, so it's a lot slower then your "/bin/tar". It is heavy on memory, all will be read into memory. [quoted from perldoc]

File::MMagic, File::MMagic::XS, File::Type

Compared to File::LibMagic and File::MimeInfo::Magic, these three are inferior. They often say 'text/plain' or 'application/octet-stream' where the latter two report useful mimetypes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc File::Unpack

You can also look for information at:

SOURCE REPOSITORY

http://search.cpan.org/search?query=File%3A%3AUnpack

http://github.com/jnweiger/perl-File-Unpack

git clone https://github.com/jnweiger/perl-File-Unpack.git

ACKNOWLEDGEMENTS

MIME type recognition relies heavily on libmagic by Christos Zoulas. I had long hesitated implementing File::Unpack, but set to work, when I dicovered that File::LibMagic brings your library to perl. Thanks Christos. And thanks for tcsh too.

LICENSE AND COPYRIGHT

Copyright 2010,2011,2012,2013 Juergen Weigert.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.