The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::Locate::Iterator -- read "locate" database with an iterator

SYNOPSIS

 use File::Locate::Iterator;
 my $it = File::Locate::Iterator->new;
 while (defined (my $entry = $it->next)) {
   print $entry,"\n";
 }

DESCRIPTION

File::Locate::Iterator reads a "locate" database file in iterator style. Each next() call on the iterator returns the next entry from the database.

    /
    /bin
    /bin/bash
    /bin/cat

Locate databases normally hold filename strings as a way of finding files by name faster than searching through all directories. Optional glob, suffix and regexp options on the iterator can restrict the entries returned.

Although it's called a database, the format is only actually a long list of filenames with some "front coding" compression to save space. There's no random access and any search requires a scan through the file from the start. Generally this is still much faster than an equivalent traversal through the directory structure of an entire file system (find etc).

See examples/native.pl for a simple sample read, or examples/mini-locate.pl for a whole program like the real locate.

Only "LOCATE02" format files are supported, per current versions of GNU locate, not the previous "slocate" format.

Iterators from this module are stand-alone and don't need any of the Perl iterator frameworks. But see Iterator::Locate, Iterator::Simple::Locate and MooseX::Iterator::Locate to inter-operate with those others. Those frameworks include ways to grep, map and otherwise manipulate iterations.

Forks and Threads

If an iterator using a file handle is cloned to a new thread or to a process level fork() then generally it can be used by the parent or the child but not both. The underlying file descriptor position is shared by parent and child, so when one of them reads it will upset the position for the other. This sort of thing affects almost all code working with file handles across fork() and threads. Perhaps some thread CLONE code here could let threads work correctly (but slower), but a fork() is probably doomed.

Iterators using mmap work correctly for both forks and threads, except that the size calculation and sharing for if_sensible is not thread-aware beyond the mmaps existing when the thread is spawned. (File::Map knows the mmaps across all threads, but currently does not reveal them.)

Taint Mode

Under taint mode (see "Taint mode" in perlsec), strings read from a file or file handle are always tainted, the same as other file input. Taintedness of a database_str string propagates to the entry strings returned.

For database_str_ref, the initial taintedness of the database string propagates to the entries. If you untaint it during iteration then subsequent entries returned are still tainted because the front-coding of the database format means subsequent entries may still use data back from when the input was tainted. Perhaps entries should follow an untaint of the database string, but normally you'd expect an untaint to be worked out before beginning iteration. In all cases a rewind() will reset to the new taintedness of the database string.

For reference, taint mode is only a small slowdown for the XS iterator code, and usually (it seems) only a little more for the pure Perl.

Other Notes

The locate database format is only designed to be read forwards, hence no prev() method on the iterator. The start of a previous record can't be distinguished by its content, and the "front coding" means the state at a given point may depend on records an arbitrary distance back too. A "tell" which gave file position plus state would be possible, though perhaps some "clone" of the whole iterator would be more use.

On some systems, mmap() may be a bit too effective, giving a process more of the CPU than other processes which make periodic read() system calls. This is a matter of OS scheduling, but you might have to apply some nice or ionice if doing a lot of mmapped work (see nice(1), ionice(1), "setpriority" in perlfunc, and ioprio_set(2)).

FUNCTIONS

Constructor

$it = File::Locate::Iterator->new (key=>value,...)

Create and return a new locate database iterator object. The following optional key/value pairs can be given,

database_file (filename)
database_fh (handle ref)

The file to read, either as filename or file handle. The default file is the default_database_file() below.

    $it = File::Locate::Iterator->new
            (database_file => '/foo/bar.db');

A filehandle is read with the usual PerlIO so it can use layers and come from various sources, but it should be in binary mode (see "binmode" in perlfunc and ":raw" in PerlIO).

database_str (string)
database_str_ref (ref to string)

The database contents to read in the form of a byte string.

    $it = File::Locate::Iterator->new
      (database_str => "\0LOCATE02\0\0/hello\0\006/world\0");

database_str is copied in the iterator. database_str_ref can be used to act on a given scalar without copying,

    my $str = "\0LOCATE02\0\0/hello\0\006/world\0";
    $it = File::Locate::Iterator->new
            (database_str_ref => \$str);

For database_str_ref, if the underlying scalar is tied or has other magic then that is re-run for each access, in the usual way. This might be a good thing, or you might prefer the copying of database_str in that case.

suffix (string)
suffixes (arrayref of strings)
glob (string)
globs (arrayref of strings)
regexp (string or regexp object)
regexps (arrayref of strings or regexp objects)

Restrict the entries returned to those with given suffix(es) or matching the given glob(s) or regexp(s). For example,

    # C code files on the system, .c and .h
    $it = File::Locate::Iterator->new
            (suffixes => ['.c','.h']);

If multiple patterns or suffixes are given then matches of all of them are returned.

Globs are in the style of the locate program which means fnmatch with no options (see File::FnMatch). The match is on the full entry if there's wildcards ("*", "?" or "["), or any part if the pattern is a fixed string.

    glob => '*.c'  # .c files, no .cxx files
    glob => '.c'   # fixed str, foo.cxx matches too

Globs should be byte strings (not wide chars) since that's how the database entries are handled. Suspect fnmatch has no notion of charset coding for its strings and patterns.

use_mmap (string, default "if_sensible")

Whether to use mmap() to access the database. mmap is fast and resource-efficient, when available. To use mmap you must have the File::Map module (version 0.38 or higher), the file must fit in available address space, and if a database_fh handle then no transforming PerlIO layers. The use_mmap choices are

    undef           \
    "default"       | use mmap if sensible
    "if_sensible"   /
    "if_possible"   use mmap if possible, otherwise file I/O
    0               don't use mmap
    1               must use mmap, croak if cannot
    

Settings default, undef or omitted mean if_sensible. if_sensible uses mmap if available, and the file size is reasonable, and for database_fh if it isn't already using an :mmap layer. if_possible uses mmap whenever it can be done, without those qualifiers.

    $it = File::Locate::Iterator->new
            (use_mmap => 'if_possible');

If multiple iterators access the same file then they share the mmap. The size check for if_sensible counts space in all distinct File::Locate::Iterator mappings and won't go beyond 1/5 of available data space. Data space is assumed to be a quarter of the wordsize, so for a 32-bit system an mmap total at most 200Mb.

if_possible and if_sensible both only mmap ordinary files because generally the file size on char specials is not reliable.

$filename = File::Locate::Iterator->default_database_file()

Return the default database file used for new above. This is meant to be the same as the locate program uses and currently means

    $ENV{'LOCATE_PATH'}            if that env var set
    /var/cache/locate/locatedb     otherwise

Perhaps in the future it might be possible to check how findutils has been installed rather than assuming /var/cache/locate/.

Operations

$entry = $it->next()

Return the next entry from the database, or no values at end of file. No values means undef in scalar context or an empty list in array context so you can loop with either

    while (defined (my $filename = $it->next)) ...

or

    while (my ($filename) = $it->next) ...

The return is a byte string since it's normally a filename and Perl handles filenames as byte strings.

$it->rewind()

Rewind $it back to the start of the database. The next $it->next call will return the first entry.

This is only possible when the underlying database file or handle is seekable, ie. seek() works. This means a plain file, or a seekable char special, or PerlIO layers with seek support.

ENVIRONMENT VARIABLES

LOCATE_PATH

Default locate database.

FILES

/var/cache/locate/locatedb

Default locate database, if LOCATE_PATH environment variable not set.

OTHER WAYS TO DO IT

File::Locate reads a locate database with callbacks instead. Whether you want callbacks or an iterator is generally a matter of personal preference. Iterators let you write your own loop, and can have multiple searches in progress simultaneously.

The speed of an iterator is about the same as callbacks when File::Locate::Iterator is built with its XS code.

Iterators are good for cooperative coroutining like POE or Gtk where state must be held in some sort of variable to be progressed by calls from the main loop. Note that next() will block on reading from the database, so the database should generally be a plain file rather than a socket or something, so as not to hold up a main loop.

If you have the recommended File::Map module then iterators share an mmap() of the database file. Otherwise the database file is a separate open handle in each iterator, meaning a file descriptor and PerlIO buffering each. Sharing a handle and having each seek to its desired position would be possible, but a seek drops buffered data so would be slower. Maybe some hairy PerlIO or IO::Handle trickery could transparently share an fd and keep buffered blocks from multiple file positions.

SEE ALSO

Iterator::Locate, Iterator::Simple::Locate, MooseX::Iterator::Locate

File::Locate, locate(1), locatedb(5), GNU Findutils manual, File::FnMatch, File::Map

HOME PAGE

http://user42.tuxfamily.org/file-locate-iterator/index.html

COPYRIGHT

Copyright 2009, 2010, 2011, 2012, 2013, 2014, 2017, 2018, 2019 Kevin Ryde

File-Locate-Iterator is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3, or (at your option) any later version.

File-Locate-Iterator is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with File-Locate-Iterator. If not, see http://www.gnu.org/licenses/