The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DataStore::CAS::FS - Virtual Filesystem backed by Content-Addressable Storage

VERSION

version 0.0100

SYNOPSIS

  # Create a new empty filesystem
  my $casfs= DataStore::CAS::FS->new(
    store => DataStore::CAS::Simple->new(
      path => './foo/bar',
      create => 1,
      digest => 'SHA-256'
    )
  );
  
  # Open an existing root directory on an existing store
  $casfs= DataStore::CAS::FS->new( store => $cas, root_dir => $digest_hash );
  
  # --- These pass through to the $cas module
  
  $hash= $casfs->put("Blah"); 
  $hash= $casfs->put_file("./foo/bar/baz");
  $file= $casfs->get($hash);
  
  # Open a path within the filesystem
  $handle= $casfs->path('1','2','3','myfile')->open;
  
  # Make some changes
  $casfs->apply_path(['1', '2', 'myfile'], { ref => $some_new_file });
  $casfs->apply_path(['1', '2', 'myfile_copy'], { ref => $some_new_file });
  # Commit them
  $casfs->commit();

DESCRIPTION

DataStore::CAS::FS extends the content-addressable API to support directory objects which let you store store traditional file hierarchies in the CAS, and look up files by a path name (so long as you know the hash of the root).

The methods provided allow you to traverse the virtual directory hierarchy, make changes to it, and commit the changes to create a new filesystem snapshot. The DataStore::CAS backend provides readable and seekable file handles. There is *not* any support for access control, since those concepts are system dependent. The module DataStore::CAS::FS::Fuse (not yet written) will have an implementation of permission checking appropriate for Unix.

The directories can contain arbitrary metadata, making them suitable for backing up filesystems from Unix, Windows, or other environments. You can also pick directory encoding plugins to more efficiently encode just the metadata you care about.

Each directory is serialized into a file which is stored in the CAS like any other, resulting in a very clean implementation. You cannot determine whether a file is a directory or not without the context of the containing directory, and you need to know the digest hash of the root directory in order to browse the full filesystem. On the up side, you can store any number of filesystems in one CAS by maintaining a list of roots.

The root's digest hash is affected by all the content of the entire tree, so the root hash will change each time you alter any directory in the tree. But, any unchanged files in that tree will be re-used, since they still have the same digest hash. You can see great applications of this design in a number of version control systems, notably Git.

ATTRIBUTES

store

Read-only. An instance of a class implementing DataStore::CAS.

root_entry

A DataStore::CAS::DirEnt object describing the root of the tree. Must be of type "dir". Should have a name of "", but not required. You can pick an arbitrary directory for a chroot-like-effect, but beware of broken symlinks.

root_entry refers to an **immutable** directory. If you make in-memory overrides to the filesystem using apply_path or the various convenience methods, root_entry will continue to refer to the original static filesystem. If you then commit() those changes, root_entry will be updated to refer to the new filesystem.

You can create a list of filesystem snapshots by saving a copy of root_entry each time you call commit(). They will all continue to exist within the CAS. Cleaning up the CAS is left as an exercise for the reader. (though utility methods to help with this are in the works)

case_insensitive

Read-only. Defaults to false. If set to true in the constructor, this causes all directory entries to be compared in a case-insensitive manner, and all directory objects to be loaded with case-insensitive lookup indexes.

hash_of_null

Read-only. Passes through to store->hash_of_null

hash_of_empty_dir

This returns the canonical digest hash for an empty directory. In other words, the return value of

  put_scalar( DataStore::CAS::FS::DirCodec::Minimal->encode([],{}) ).

This value is cached for performance.

It is possible to encode empty directories with any plugin, so not all empty directories will have this key, but any time the library knows it is writing an empty directory, it will use this value instead of recalculating the hash of an empty dir.

dir_cache

Read-only. A DataStore::CAS::FS::DirCache object which holds onto recently used directory objects. This object can be used in multiple CAS::FS objects to make the most of the cache.

METHODS

new

  $fs= $class->new( %args | \%args )

Parameters:

store - required

An instance of (a subclass of) DataStore::CAS

root_entry - required

An instance of DataStore::CAS::FS::DirEnt, or a hashref of DirEnt fields, or an empty hash if you want to start from an empty filesystem, or a DataStore::CAS::FS::Dir which you want to be the root directory (or a DataStore::CAS::File object that contains a serialized Dir) or or a digest hash of that File within the store.

root - alias for root_entry

get

Alias for store->get

get_dir

  $dir= $fs->get_dir( $digest_hash_or_File, \%optional_flags );

This returns a de-serialized directory object found by its hash. It is a shorthand for 'get' on the Store, and deserializing enough of the result to create a usable Dir object (or subclass).

Also, this method caches recently used directory objects, since they are immutable. (but woe to those who break the API and modify their directory objects!)

Returns undef if the digest hash isn't in the store, but dies if an error occurs while decoding one that exists.

put

Alias for store->put

put_scalar

Alias for store->put_scalar

put_file

Alias for store->put_file

put_handle

Alias for store->put_handle

validate

Alias for store->validate

path

  $path= $fs->path( @path_names )

Returns a DataStore::CAS::FS::Path object which provides frendly object-oriented access to several other methods of CAS::FS. This object does *nothing* other than curry parameters, for your convenience. In particular, the path isn't resolved until you try to use it, and might not be valid.

See "resolve_path" for notes about @path_names. Especially note that your path needs to start with the volume name, which will usually be ''. Note that you get this already if you take an absolute path and pass it to File::Spec->splitdir.

resolve_path

  $path_array= $fs->resolve_path( \@path_names, \%optional_flags )
  $path_array= $fs->resolve_path( $path_string, \%optional_flags )

Returns an arrayref of DataStore::CAS::FS::DirEnt objects corresponding to the canonical absolute specified path, starting with the root_entry.

First, a note on @path_names: you need to specify the volume, which for UNIX is the empty string ''. While volumes might seem like an unnecessary concept, and I wasn't originally going to include that in my design, it helped in 2 major ways: it allows us to store a regular ::DirEnt for the root directory (which is useful for things like permissions and timestamp) and allows us to record general metadata for the filesystem as a whole, within the ->metadata of the volume_dir. As a side benefit, Windows users might appreciate being able to save backups of multiple volumes in a way that preserves their view of the system. As another side benefit, it is compatible with File::Spec->splitdir.

Next, a note on resolving paths: This function will follow symlinks in much the same way Linux does. If the path you specify ends with a symlink, the result will be a DirEnt describing the symlink. If the path you specify ends with a symlink and a "" (equivalent of ending with a '/'), the symlink will be resolved to a DirEnt for the target file or directory. (and if it doesn't exist, you get an error)

Also, its worth noting that the directory objects in DataStore::CAS::FS are strictly a tree, with no back-reference to the parent directory. So, ".." in the path will be resolved by removing one element from the path. HOWEVER, this still gives you a kernel-style resolve (rather than a shell-style resolve) because if you specify "/1/foo/.." and foo is a symlink to "/1/2/3", the ".." will back you up to "/1/2/" and not "/1/".

The tree-with-no-parent-reference design is also why we return an array of the entire path, since you can't take a final directory and trace it backwards.

If the path does not exist, or cannot be resolved for some reason, this method will either return undef or die, based on whether you provided the optional 'nodie' flag.

Flags:

no_die => $bool

Return undef instead of dying

error_out => \$err_variable

If set to a scalar-ref, the scalar ref will receive the error message, if any. You probably want to set 'nodie' as well.

partial => $bool

If the path doesn't exist, any missing directories will be given placeholder DirEnt objects. You can test whether the path was resolved completely by checking whether $result->[-1]->type is defined.

mkdir => 1 || 2

If mkdir is 1, missing directories will be created on demand.

If mkdir is 2,

set_path

  $fs->set_path( \@path, $Dir_Entry \%optional_flags )
  # always returns '1'

Temporarily override a directory entry at @path. If $Dir_Entry is false, this will cause @path to be unlinked. If the name of Dir_Entry differs from the final component of @path, it will act like a rename (which is the same as just unlinking the old path and creating the new path) If Dir_Entry is missing a name, it will default to the final element of @path.

path may be either an arrayref of names, or a string which will be split by File::Spec.

$Dir_Entry is either an instance of DataStore::CAS::FS::DirEnt, or a hashref of the fields to create one.

No fields of the old dir entry are used; if you want to preserve some of them, you need to do that yourself (see clone) or use the update_path() method.

If @path refers to nonexistent directories, they will be created as with a virtual "mkdir -p", and receive the default metadata of $flags{default_dir_fields} (by default, nothing) If $path travels through a non-directory (aside from symlinks, unless $flags{follow_symlinks} is set to 0) this will throw an exception, unless you specify $flags{force_create} which causes an offending directory entry to be overwritten by a new subdirectory.

Note in particluar that if you specify

  apply_path( "/a_symlink/foo", $Dir_Entry, { follow_symlinks => 0, force_create => 1 })

"a_symlink" will be deleted and replaced with an actual directory.

None of the changes from apply_path are committed to the CAS until you call commit(). Also, root_entry does not change until you call commit(), though the root entry shown by "resolve_path" does.

You can return to the last committed state by calling rollback(), which is conceptually equivalent to $fs= DataStore::CAS::FS->new( $fs->root_entry ).

update_path

  $fs->update_path( \@path, \%changes, \%optional_flags )
  $fs->update_path( \@path, \@changes, \%optional_flags )

Like "set_path", but it applies a hashref (or arrayref) of $changes to the directory entry which exists at the named path. Use this to update a few attributes of a directory entry without overwriting the entire thing.

mkdir

  $fs->mkdir( \@path )

Convenience method to create an empty directory at path.

touch

  $fs->touch( \@path )

Convenience method to update the timestamp of the directory entry at path, possibly creating it (as an empty file)

  $fs->unlink( \@path )

Convenience method to remove the directory entry at path.

rmdir

Alias for unlink

rollback

  $fs->rollback();

Revert the FS to the state of the last commit, or the initial state.

This basically just discards all the in-memory overrides created with "apply_path" or its various convenience methods.

commit

  $fs->commit();

Merge all in-memory overrides from "apply_path" with the directories they override to create new directories, and store those new directories in the CAS.

After this operation, the root_entry will be changed to reflect the new tree.

PATH OBJECTS

path_names

Arrayref of path parts

path_ents

Arrayref of DirEnt objects resolved from the path_names. Lazy-built, so it might die when accessed.

filesystem

Reference to the DataStore::CAS::FS it was created by.

path_name_list

Convenience list accessor for path_names arrayref

path_ent_list

Convenience list accessor for path_ents arrayref

final_ent

Convenience accessor for final element of path_ents

type

Convenience accessor for the type field of the final element of path_ents

resolve

  $path->resolve()

Call "resolve_path" for path_names, and cache the result in the path_ents attribute. Also returns path_ents.

path

  $path->path( \@sub_path )

Get a sub-path from this path. Returns another Path object.

file

  $file= $path->file();

Returns the DataStore::CAS::File of the final element of path_ents, or dies trying.

open

  $handle= $path->open

Alias for $path->file->open

DIRECTORY CACHE

Directories are uniquely identified by their hash, and directory objects are immutable. This creates a perfect opportunity for caching recent directories and reusing the objects.

When you call $fs->get_dir($hash), $fs keeps a weak reference to that directory which will persist until the directory object is garbage collected. It will ALSO hold a strong reference to that directory for the next N calls to $fs->get_dir($hash), where the default is 64. You can change how many references $fs holds by setting $fs->dir_cache->size(N).

The directory cache is *not* global, and a fresh one is created during the constructor of the FS, if needed. However, many FS instances can share the same dir_cache object, and FS methods that return a new FS instance will pass the old dir_cache object to the new instance.

If you want to implement your own dir_cache, don't bother subclassing the built-in one; just create an object that meets this API:

new

  $cache= $class->new( %fields )
  $cache= $class->new( \%fields )

Create a new cache object. The only public field is size.

size

Read/write accessor that returns the number of strong-references it will hold.

clear

Clear all strong references and clear the weak-reference index.

get

  $cached_dir= $cache->get( $digest_hash )

Return a cached directory, or undef if that directory has not been cached.

put

  $dir= $cache->put( $dir )

Cache the Dir object (and return it)

UNICODE vs. FILENAMES

Background

Unix operates on the philosophy that filenames are just bytes. Much of Unix userspace operates on the philosophy that these bytes should probably be valid UTF-8 sequences (but of course, nothing enforces that). Other operating systems, like modern Windows, operate on the idea that everything is Unicode and some backward-compatible APIs exist which can represent the Unicode as Latin1 or whatnot on a best-effort basis. I think the "Unicode everywhere" philosophy is arguably a better way to go, but as this tool is primarily designed with Unix in mind, and since it is intended for saving backups of real filesystems, it needs to be able to accurately store exactly what it find in the filesystem. Essentially this means it neeeds to be *able* to store invalid UTF-8 sequences, -or- encode the octets as unicode codepoints up to 0xFF, and later know to write them out to the filesystem as octets instead of UTF-8.

Use Cases

The primary concern is the user's experience when using this module. While Perl has decent support for Unicode, it requires all filenames to be strings of bytes. (i.e. strings with the unicode flag turned off) Any time you pass a unicode string to a Perl function like open() or rename(), perl converts it to a UTF-8 string of octets before performing the operation. This gives you the desired result in Unix. Unfortunately, Perl in Windows doesn't fare so well, because it uses Windows' non-unicode API. Reading filenames with non-latin1 characters returns garbage, and creating files with unicode strings containing non-latin1 characters creates garbled filenames. To properly handle unicode outside of latin1 on Windows, you must avoid the Perl built-ins and tap directly into the wide-character Windows API.

This creates a dilema: Should filenames be passed around the DataStore::CAS::FS API as unicode, or octets, or some auto-detecting mix? This dilema is further complicated because users of the library might not have read this section of documentation, and it would be nice if The Right Thing happened by default.

Imagine a scenario where a user has a directory named "\xDC" (U with an umlaut in latin-1) and another directory named "\xC3\x9C" (U with an umlaut in UTF-8). "readdir" will report these as the strings I've just written, with the unicode flag off. Modern Unix will render the first as a "?" and the other as the U with umlaut, because it expects UTF-8 in the filesystem.

If you have the perl string "\xDC" with the UTF-8 flag off, and you try creating that file, it will create the file names "\xDC". However if you have that same logical string with the UTF-8 flag on, it will become the file name "\x3C\x9C"!

If a user is *unaware* of unicode issues, it might be better to pass around strings of octets. Example: the user is in "/home/\xC3\x9C", and calls "Cwd". They get the string of octets "/home/\xD0". They then concatenate this string with unicode "\x{1234}". Perl combines the two as "/home/\x{C3}\x{9C}/\x{1234}", however the C3 and 9C just silently went from octets to unicode codepoints. When the user tries opening the file, it surprises them with "No such file or directory", because it tried opening "/home/\xC3\x83\xC2\x9C/\xE1\x88\xB4".

On Windows, perl is just generally broken for high-unicode filenames. Pure-ascii works fine, but ascii is a non-issue either way. Those who need unicode support will have found it from other modules, and be looking for this section of documentation.

Interesting reading for Windows: http://www.perlmonks.org/?node_id=526169

However, all this conjecture assumes a person is trying to read and write virtual items out to their filesystem. Since this module also provides that, maybe people will use the ready-built implementation and this is a non-issue.

Storage Formats

The storage format is supposed to be platform-independent. JSON seems like a good default encoding, however it requires strings to be in Unicode. When you encode a mix of unicode and octet strings, Perl's unicode flag is lost and when reading them back out you can't tell which were which. This means that if you take a unicode-as-octets filename and encode it with JSON and decode it again, perl will mangle it when you attempt to open the file, and fail. It also means that unicode-as-octets filenames will take extra bytes to encode.

The other option is to use a plain unicode string where possible, but names which are not valid UTF-8 are encoded as structures which can be restored when decoding the JSON.

Conclusion

In the end, I came up with a module called DataStore::CAS::FS::InvalidUTF8. It takes a filename in native encoding, and tries to parse it as UTF-8. If it succeeds, it returns the string. If it fails, it returns the string wrapped by InvalidUTF8, with special concatenation and comparison operators.

The directory coders are written to properly save and restore these objects.

The scanner for Windows platforms will read the UTF-16 from the Windows API, and convert it to UTF-8 to match the behavior on Unix. The Extractor on Windows will reverse this process. Extracting files with invalid UTF-8 on Windows will fail.

The default storage format uses a Unicode-only format, and a special notation to represent strings which are not unicode (See TO_JSON in InvalidUtf8. Other formats (Minimal and Unix) always store octets, and then re-detect UTF-8 when decoding the directory.

SEE ALSO

Brackup - A similar-minded backup utility written in Perl, but without the separation between library and application and with limited FUSE performance.

http://git-scm.com - The world-famous version control tool

http://www.fossil-scm.org - A similar but lesser known version control tool

https://github.com/apenwarr/bup - A fantastic idea for a backup tool, which operates on top of git packfiles, but has some glaring misfeatures that make it unsuitable for general purpose use. (doesn't save metadata? no way to purge old backups??)

http://rdiff-backup.nongnu.org/ - A popular incremental backup tool that works great on the small scale but fails badly at large-scale production usage. (exit 0 sometimes even when the backup fails? chance of leaving the backup in a permanently broken state if interrupted? record deleted files... with files, causing spool directory backups to contain 600,000 files in one directory? nothing to optimize the case where a user renames a dir with 20GB of data in it?)

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Michael Conrad, and IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.