The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Crypt::IDA - Michael Rabin's Information Dispersal Algorithm

SYNOPSIS

  use Crypt::IDA ":default";

  $source=fill_from_string ($string,$align);
  $source=fill_from_fh     ($fh,$align,$offset);
  $source=fill_from_file   ($filename,$align,$offset);
  $sink=empty_to_string    (\$string);
  $sink=empty_to_fh        ($fh,$offset);
  $sink=empty_to_file      ($filename,$mode,$offset);
  ($key,$mat,$bytes) = ida_split ( ... );
  $bytes             = ida_combine ( ... );

DESCRIPTION

This module splits a secret into one or more "shares" which have the property that if a certain number of shares (the "quorum" or "threshold") are presented in the combine step, the secret can be recovered. The algorithm should be cryptographically secure in the sense that if fewer shares than the quorum are presented, no information about the secret is revealed.

EXPORT

No methods are exported by default. All methods may be called by prefixing the method names with the module name, eg:

 $source=Crypt::IDA::fill_from_string($string,$align)

Alternatively, routines can be exported by adding ":default" to the "use" line, in which case the routine names do not need to be prefixed with the module name, ie:

  use Crypt::IDA ":default";
  
  $source=fill_from_string ($string,$align);
  # ...

Some extra ancillary routines can also be exported with the ":extras" (just the extras) or ":all" (":extras" plus ":default") parameters to the use line. See the section "ANCILLARY ROUTINES" for details.

GENERAL OPERATION

Naming conventions

The following variable names will be used throughout this documentation:

  $n        number of shares to create
  $k        number of shares needed to combine (ie, the "quorum")
  $w        width (treat input/output as being 1, 2 or 4-byte words)
  $align    null-pad input up a multiple of $align bytes
  $key      key parameter used to create transform matrix
  $mat      transform matrix/inverse transform matrix

Stream-processing design pattern

Rather than implement separate ida_split_string or ida_split_file methods, a more flexible design which decouples the details of the input and output streams from the actual processing is used. In terms of design patterns, the ida_split and ida_combine routines operate as stream processors, and use user-supplied callbacks to read some bytes of input or write some bytes of output. Therefore, the split and combine tasks break down into three steps:

1. Set up fill handler(s)
2. Set up empty handler(s)
3. Call processing routine (ida_split/ida_combine)

For the ida_split routine, a single "fill" handler is required, while one or more "empty" handlers are required (one for each share to be output). For the ida_combine routine, the opposite is true: one or more "fill" handlers are required (corresponding to the shares to be combined), while a single "empty" handler (for the recombined secret) is required.

The fill_from_* and empty_to_* routines create callbacks which can be used by ida_split/ida_combine. Routines are provided for creating callbacks given a string, filename or open file handle. Custom callbacks (such as for reading/writing on a network socket connection or an IPC message queue) can also be written quite easily. See the section "Writing Custom Callbacks" for more details.

Keys and transform matrices

Both the split and combine algorithms operate by performing matrix multiplication over the input. In the case of the split operation, the transform matrix has n rows (n=number of shares) and k columns (k=quorum), while in the case of combine operation, the transform matrix has k rows and k columns. Either operation is described simply as the matrix multiplication:

 transform x input = output

The input matrix always has k rows, while the output matrix will have n rows for the split operation, or k rows for the combine operation. Both input and output matrices will have a number of columns related to the length of the secret being split/combined. (Actually, in this implementation, the number of columns in the input/output matrix is unimportant, since the matrices are treated as circular buffers with columns being reused when necessary, but the general idea still stands).

The transform matrix must have the property that any subset of k rows represent linearly independent basis vectors. If this is not the case then the transform cannot be reversed. This need not be a concern for most users of this module, since if the ida_split routine is not provided a transform matrix parameter, it will generate one of the appropriate form, which guarantees (in the mathematical sense) that the process is reversible. However, understanding the requirement of linear independence is important in case a user-supplied matrix is provided, and also for understanding the "key" parameter to the split/combine routines. A "key" is defined as a list of field elements (ie, 8-bit, 16-bit or 32-bit values):

 x , x ,  ... , x , y , y , ... , y
  1   2          n   1   2         k

whose values must all be distinct. If a key is supplied to the split routine, these values are used to create a Cauchy-form transform matrix:

              k columns
 
 |     1        1             1     |
 |  -------  -------  ...  -------  |
 |  x1 + y1  x1 + y2       x1 + yk  |
 |                                  |
 |     1        1             1     |
 |  -------  -------  ...  -------  |
 |  x2 + y1  x2 + y2       x2 + yk  |  n rows
 |                                  |
 |     :        :      :      :     |
 |                                  |
 |     1        1             1     |
 |  -------  -------  ...  -------  |
 |  xn + y1  xn + y2       xn + yk  |

This matrix has the desired property that any subset of k rows are linearly independent, which leads to the property that the transform can be reversed (via being able to create an inverse matrix from those k rows). The actual requirements for a Cauchy-form matrix are slightly more complicated than saying that all xi, yi values be distinct, but they reduce to exactly that for Galois Fields.

If the same key is supplied to the combine routine (along with a list of rows to be created), the appropriate submatrix is generated and inverted. This inverted matrix is then used as the transform matrix for the combine algorithm.

For more information, see the sections for "KEY MANAGEMENT", "SPLIT OPERATION", "COMBINE OPERATION" and the "SEE ALSO" section.

FILL/EMPTY CALLBACKS

"Fill" Callbacks

Fill callbacks are created by a call to one of:

 $source=fill_from_string ($string,$align);
 $source=fill_from_fh     ($fh,$align,$offset);
 $source=fill_from_file   ($filename,$align,$offset);

The $string, $fh and $filename parameters should be self-explanatory. The $offset parameter, where supplied, specifies an offset to seek to in the file/file handle before any reading takes place. The $offset parameter may be omitted, in which case it defaults to 0, i.e., the start of the file.

The $align parameter specifies that the input is to be null-padded up to a multiple of $align bytes, should it not already be so-aligned. The ida_split routine requires that the input be a multiple of $k * $w bytes in length, so this is the usual value to pass for the $align parameter. If $align is not specified, it defaults to 1, meaning the input is byte-aligned, and no padding bytes will be placed at the end of the file/string. Also note that if an $align parameter is used in conjunction with the $offset parameter, that the input will be aligned to be a multiple of $align bytes starting from $offset, and not from the start of the file.

Also, be aware that if the input secret needs to be padded before calling ida_split that you will need to have some way of removing those padding bytes after recovering the secret with ida_combine. This can be accomplished by removing null bytes from the end of the secret (provided trailing nulls in the original secret are prohibited), or (as is preferable for splitting/combining binary files), by recording the original (unpadded) size of the secret and truncating the reconstituted secret down to that size after calling ida_combine.

"Empty" Callbacks

The following routines are available for creating "empty" callbacks:

  $sink=empty_to_string    (\$string);
  $sink=empty_to_fh        ($fh,$offset);
  $sink=empty_to_file      ($filename,$perm,$offset);

All parameters with the same name are the same as in the "fill" callbacks. Additionally, note that:

  • empty_to_string requires a reference to a string variable (to enable the callback to modify the string);

  • empty handlers do not pad the output stream, so they don't need an $align parameter; and

  • empty_to_file takes a $perm (file permission) parameter, which defaults to 0644 if not specified.

As with the fill handler routines, these routines return a hash reference (which contains the actual callback) if successful, or undef if there was a problem (in which case file-related problems will be reported in $!).

The empty_to_file routine will create the output file if it does not already exist.

SPLIT OPERATION

The template for a call to ida_split, showing all default values, is as follows:

 ($key,$mat,$bytes) = ida_split (
     quorum => undef,
     shares => undef,
     width => undef,      # operate on words of 1, 2 or 4 bytes
     # supply a key, a matrix, or  neither (in
     # which case a key will be randomly generated)
     key => undef,
     matrix => undef,
     # optionally specify which shares to produce
     sharelist => undef,  # [ $row1, $row2, ... ]
     # source, sinks
     filler => undef,
     emptiers => undef,   # [ $empty1, $empty2, ... ]
     # misc. options
     rand => "/dev/urandom",
     bufsize => 4096,
     bytes => 0,
     # byte order flags
     inorder => 0,
     outorder => 0,
 );

Many of the parameters above have already been described earlier. The new parameters introduced here are:

  • If the key and matrix parameters are unset, then a random key/transform matrix will be generated and used to create the shares. For a discussion of cases where you might want to override this behaviour and supply your own values for these parameters, see the "Keys and transform matrices" and "KEY MANAGEMENT" sections.

  • sharelist is a reference to a list of rows to be operated on; if specified, only shares corresponding to those rows in the transform matrix will be created.

  • emptiers is a reference to a list of empty callbacks. The list should contain one empty callback for each share to be produced.

  • rand can be set to "rand", in which case Perl's default rand() function will be used for generating keys. Otherwise, the parameter is taken to be a file containing the random number to be used (/dev/urandom and /dev/random are special devices on Linux systems which generate different random numbers each time they are read; see their man pages for details).

  • bufsize specifies the size of the input and output buffer matrices in terms of the number of columns. Any integer value greater than 0 is allowed. Larger buffer size should improve the performance of the algorithm, as fewer I/O calls and matrix multiply calls (on larger chunks of data) will be required.

  • bytes specifies how many bytes of input to read. This may be set to zero to indicate that all bytes up to EOF should be read. This value must be a multiple of quorum x width

  • inorder and outorder can be used to specify the byte order of the input and output streams, respectively. The values can be set to 0 (stream uses native byte order), 1 (stream uses little-endian byte order) or 2 (stream uses big-endian byte order). If these values are set to 1 or 2 and that byte order is different from the system's byte order then bytes within words will be swapped to the correct order before being written to the input buffer or written out from the output buffer. These options have no effect when the width is set to 1 byte.

The function returns three return values, or undef if there was an error. The return values are:

  • $key, which will be undef if the user supplied a matrix parameter, or the value of the key used to create the transform matrix otherwise

  • $mat, which is the matrix used to create the shares. If the user specified a sharelist option, then this returned matrix will include only the rows of the transform matrix specified in the sharelist (in the order specified).

  • $bytes, which is the number of input bytes actually read from the input.

Since the "key" parameter may not be returned in all cases, the preferred method for detecting failure of the routine is to check whether the $mat parameter returns undef, as in the following:

 ($key,$mat,$bytes) = ida_split ( ... );
 unless (defined ($mat)) {
  # handle ida_split failure
  # ...
 }

This should work in all cases, regardless of whether a key or matrix was supplied or automatically generated.

COMBINE OPERATION

The template for a call to ida_combine is as follows:

 $bytes_read = ida_combine (
     quorum => undef,
     width => undef,
     shares => undef,     # use in conjunction with key
     # supply either a key or a pre-inverted matrix
     key => undef,
     matrix => undef,
     sharelist => undef,  # use in conjunction with key
     # sources, sink
     fillers => undef,    # [$filler1, $filler2, ... ]
     emptier => undef,
     # misc options
     bufsize => 4096,
     bytes => 0,
     # byte order flags
     inorder => 0,
     outorder => 0,
 );

Most options should be obvious, but note:

  • fillers should be a reference to a list of k fill handlers, corresponding to the shares to be combined

  • if a matrix is supplied, it must be the inverse of some submatrix (some k rows) of the original transform matrix; the ida_combine routine does not invert a supplied matrix.

  • if a key parameter is supplied, both the shares parameter and sharelist parameter must be given

  • shares is the total number of shares that were created when calling ida_split. This is required when passing a key parameter in order to facilitate error checking, since it provides a means of checking which values in the key should represent x_i values, and which should represent y_i values.

  • sharelist is also required when a key is passed, as otherwise the routine does not know which k shares are being combined. The sharelist parameter should be a list of k distinct row numbers. Also, each element in the sharelist array should match up to the appropriate element in the fillers array.

  • if a key parameter is supplied, the ida_combine routine will generate the appropriate transform matrix and its inverse (in contrast to the case where a matrix parameter is supplied).

The return value is the number of bytes actually written to the output stream, or undef if there was an error. As noted earlier, this value may be larger than the initial size of the secret due to padding to align the input to quorum x width bytes. Strategies for dealing with this have also been discussed.

ANCILLARY ROUTINES

The extra routines are exported by using the ":extras" or ":all" parameter with the initial "use" module line. With the exception of key-generation and key-testing routines, these will probably not be needed for most applications. The extra routines are as follows:

 $rng=ida_rng_init($bytes,$source)
 $val=$rng->();                 # get a new random value
 $rng->("anything");            # decommision $rng

This routine initialises a source of random numbers. The $bytes parameter should be 1, 2 or 4. The (optional) $source parameter has the same semantics as the "random" parameter to ida_split (ie, use "rand" or the name of a file to read random bytes from).

 ida_fisher_yates_shuffle(\@list,$howmany);

Shuffles elements of @list1 and (optionally) deletes all but $howmany of those elements. Shuffling (and deletion) is done in-place on the passed list.

 $keyref=ida_generate_key($k, $n, $w, $rng);

Generates a new random key. All parameters are required. Result is a reference to a list of $k + $n distinct elements.

 if (ida_check_key($keyref)) {
   # check of key failed
   # ...
 }

Takes a reference to a list of $k + $n elements and checks that the list is a valid key. Returns 0 to indicate that the key is valid.

Note that version 0.05 of Math::FastGF2 added two new constructors, new_cauchy and new_inverse_cauchy, to create Cauchy form transform matrices and their inverse from a given "key". These routines can be used as an alternative to using ida_generate_key to randomly generate a key as above, or in conjunction with it. More on keys follows in the next section.

KEY MANAGEMENT

The ida_split routine takes a secret and creates several "shares" from it. However, unless information about the contents of the original transform matrix (or the "key" from which it is derived) is available, it will be impossible to reconstruct the secret from these shares at a later time. This module does not implement any particular key management strategy. It is up to the application using the module to implement their own key management strategy by saving information about the transform matrix (or key) either passed into or returned from the call to ida_split. Broadly speaking, there are two approaches which may be used:

1. Caller creates/saves the $key parameter at a secure central location, and associates each of the created shares with a row number (0 .. $n-1); or
2. Caller creates/saves the 'matrix' parameter/$mat return value and stores an association between each share and the corresponding row of that matrix.

In the first approach, a central authority is required to securely store the key (in either $key or $mat forms). This will mitigate against fully exploiting the "Distributed/Dispersed" part of the algorithm, though, since the protocol has a single point of failure/attack and the central authority is required to reconstruct the secret.

In the second approach, the matrix rows can be effectively associated with the shares by transmitting them to each recipient shareholder along with the shares themselves. Shareholders can then reconstruct the secret without the participation of the Dealer (the creator of the shares).

An implementation of the second approach (sans any distribution mechanism) is provided in the Crypt::IDA::ShareFile module.

Adding extra shares at a later time

Adding extra shares to an IDA system is possible, and might be desireable if the goal is to increase the level of redundancy of stored data. If the Dealer keeps a copy of the original key, then it is possible to create extra shares at a later time. It is, however, impossible to modify the quorum value without destroying all shares and recreating the threshold system with the new quorum value. In order to add new shares at a later time, it is also assumed that the Dealer continues to have access to the secret.

The procedure for adding a share is to randomly determine a new value x_n+1 which is different from any existing x_i, y_i, and to insert it into the correct position in the key list, ie:

 x , x ,  ... , x , x   , y , y , ... , y
  1   2          n   n+1   1   2         k

The new key can be passed to ida_split in order to create the new share (or shares; several new x-values may be inserted). The sharelist option may be passed to ida_split to instruct the routine to create only the new shares, and avoid re-creating any existing shares.

If the original matrix was stored, but not the key which it was created from, the situation is more complex. Creating new shares can be accomplished by creating a new matrix row, and then testing that each subset of k rows from this new matrix are linearly independent (eg, by proving that each of the submatrices can be inverted). This module does not implement such a feature, so it is up to the user to provide their own code to do this, should it be required. Alternatively, consider storing the key instead of the resulting matrix, as above.

In the event of lost or stolen shares

The security of this system relies on an attacker not being able to collect k shares. If some shares are lost or thought to have been stolen (or copied), then the security of the secret may be at risk. This is not a matter of insecurity in the algorithm. Rather, it increases the risk since an attacker (who may be a shareholder) now potentially has to find or steal fewer shares in order to reconstruct the secret. In cases like this, the pragmatic approach will be to destroy any existing shares of the secret and to recreate a new set of shares and distribute them to the shareholders. Different parameters for the quorum and number of shares may be chosen at this time, as may the distribution of shares among shareholders.

TECHNICAL DETAILS

This section is intended mainly for the curious: mainly those wishing to extend the module or write code which can inter-operate with the implementation presented here.

Field implementation

All operations are carried out by treating input bytes/words as being polynomials in GF(2^8), GF(2^16) or GF(2^32). The implementation of arithmetic in these fields is handled by the Math::FastGF2 and Math::FastGF2::Matrix modules. The irreducible field polynomials used by those modules are:

                8    4    3
   GF(2^8)     x  + x  + x  + x + 1     (0x11b)
 
                16    5    3
   GF(2^16)    x   + x  + x  + x  + 1   (0x1002b)
 
                32    7    3    2
   GF(2^32)    x   + x  + x  + x  + 1   (0x10000008d)

Anyone wishing to implement compatible encoders/decoders should ensure that the polynomials used in their implementation match these.

Reed-Solomon Encoding

Although this module was not written specifically with Reed-Solomon Erasure Codes in mind, the underlying arithmetic and matrix multiplication is the same as Rabin's IDA. The module can, with a little work, be made to work as a Reed-Solomon encoder/decoder. The basic idea (which may be implemented in a future release) is to first create either a Cauchy-form matrix as described above, or a Vandermonde matrix of the following form:

 |     0    1    2          k-1  |
 |    0    0    0   ...    0     |
 |                               |
 |     0    1    2          k-1  |
 |    1    1    1   ...    1     |
 |                               |
 |    :    :    :    :     :     |
 |                               |
 |     0    1    2          k-1  |
 |  n-1  n-1  n-1   ...  n-1     |

All arithmetic operations should, of course, be performed on Galois Field elements, rather than integers. Whichever matrix form is chosen, the next step is to use elementary column operations on the matrix until the top k rows form an identity matrix (again, all arithmetic must treat the numbers as Galois Field polynomials). The matrix can then be passed along to ida_split, which will generate n shares as usual, with the first k of these being unencrypted slices of the original file (ie, containing every k'th character/word, starting at offset 0, 1, ... , k-1) with the remaining n - k shares being the Erasure Codes. As with Rabin's scheme, any k of the shares may be presented to ida_combine (along with the appropriate inverse matrix calculated from k rows of the split transform matrix) to reconstruct the original file.

Note that version 0.05 of Math::FastGF2 added a new new_vandermonde constructor methods to help with performing Reed-Solomon encoding.

Writing Custom Callbacks

The following code can be used as a template for writing routines which create fill/empty callbacks which can be passed to ida_split and ida_combine:

 sub create_my_ida_callback {
   my ($arg1, $arg2, ... ) = @_;

   # do some initialisation based on args, such as opening files,
   # connecting to a database, saving details of an IPC message queue,
   # etc.
 
   # for "fill" callbacks:
   return {
           SUB => sub {
             my $len=shift;
             my $string;

             # some code to get/produce up to $len bytes of input and
             # place it in $string.  Since this is a closure, the
             # callback has access to the initial $arg1, $arg2 and any
             # other variables created in the enclosing routine.

             if ($some_error_occurred) {
               return undef;
             } elsif ($no_more_input) {
               return "";
             } else {
               return $string;
             }
           },
          };
 
   # for "empty" callbacks:
   return {
           SUB => sub {
             my $string=shift;

             # some code to save/consume new data bytes passed in
             # $string. The routine should return the number of bytes
             # actually processed, or undef if there was a (write)
             # error.

             if ($some_error_occurred) {
               # optionally set $! with error message
               return undef;
             } else {
               return $number_of_bytes_actually_saved;
             }
           },
          };
 }

Note that callbacks are written using Perl's closure mechanism rather than by passing simple code (subroutine) references. This is done to allow the callback to be "customised" to use a particular file handle, string, etc., as input or output. See the "What's a closure?" section in the Perl FAQ for more details on this construct. Also, a hashref is used rather than a simple coderef since the processing routines use that hash to store bookkeeping information such as buffer fill levels, read/write pointers and so on.

The ida_split and ida_combine routines are written in such a way that they handle most of the logic involved with buffering of input and output and the details of reading/writing values in the input/output matrices. This means that callbacks can usually be kept very simple and, for the most part, stateless. Specifically, if an empty handler cannot process all bytes of input presented to it in one call, it simply returns the number of bytes it actually processed. There is no need for it to save any unprocessed part of the input, and doing so will probably result in an error. Since the calling routines know how many unprocessed bytes remain in the output buffer, it will arrange so that the next time that particular empty handler is called, it will receive those unprocessed bytes at the start of its input string.

The one exception to this "stateless" operation is in the case where a fill handler must pad the input stream to be a multiple of $align bytes. This can be accomplished by either pre-padding the input stream in the callback initialisation phase (when the stream length is known in advance, such as with a string input), or by maintaining "bytes read" and "EOF" state variables, updating them after every call to the callback, and using them to return extra padding bytes after the stream's natural end-of-file, where appropriate.

Please consult the source code for the existing fill_from_* and empty_to_* callback creation code for working examples.

KNOWN BUGS

There may be a weakness in the current implementation of the random key generation routine when a 1-byte word size is used, in that regardless of the random-number generator parameter passed to ida_split, the routine will always use Perl's internal rand() function. The code currently includes a workaround which should at least prevent a sequence-prediction attack on the RNG. While I can see no way to effectively attack the current implementation (due to the security afforded by the simple act of dispersing shares), it is clearly desirable to use the highest-quality RNG available in all cases, and this should be implemented in a future release. For the moment, if the possibility of a weakness in this implementation is unacceptable, it can be avoided simply by using a width parameter of 2 or 4, in which cases the high-quality RNG will always be used.

On a related note, defaulting to the use of /dev/urandom instead of /dev/random may be considered a bug by some people.

SEE ALSO

"Efficient dispersal of information for security, load balancing, and fault tolerance", by Michael O. Rabin. JACM Volume 36, Issue 2 (1989).

Description of the Information Dispersal Algorithm, which this module implements. This should be a faithful implementation of the original idea, although the issue of padding is not covered sufficiently in the paper, and this may be a point of divergence from the original intention.

http://parchive.sourceforge.net/

A similar project, in that it uses 16-bit Galois Field arithmetic to perform the same kinds of arithmetic operations on input files. The polynomial used in parchive (0x1100b) is different from the one used here (0x1002b), however. Also, parchive uses the more traditional Reed-Solomon mode of operation, with an emphasis on forward error-correction, whereas this module focuses more on the creation of distributed shares in order to achieve secure distribution of a secret.

http://www.cs.utk.edu/~plank/plank/papers/SPE-9-97.html "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems", by James S. Plank.

The description of the use of a particular Vandermonde matrix to guarantee linear independence of each row in the transform matrix (as described above) is due to a later erratum to this paper. My description of Reed-Solomon coding in general also follows this paper. The use of Cauchy-form matrices for guaranteeing linear independence (in both IDA and RS modes) also seems to be widely-known as well.

FUTURE VERSIONS

It is likely that the following changes/additions will be made in future versions:

  • Update the matrix processing code so that it can detect when padding of input is required and handle it by itself. The changes required to implement this can be made in such a way as to preserve compatibility with any code implemented using the current semantics.

  • Offer the choice of padding input with random padding rather than null padding. While it's beyond the scope of this document to present an analysis of the algorithm from a cryptographic standpoint, it may be possible that padding with predictable zero bytes may weaken the security of this implementation. Padding with random data should remove that potential weakness.

  • Force or give the option of always using the highest-quality RNG available (see "KNOWN BUGS").

  • Give the option of using other RNG sources (such as for inferior platforms which do not have an equivalent of /dev/[u]random)

AUTHOR

Declan Malone, <idablack@sourceforge.net>

COPYRIGHT AND LICENSE

Copyright (C) 2009-2019 by Declan Malone

This package is free software; you can redistribute it and/or modify it under the terms of version 2 (or, at your discretion, any later version) of the "GNU General Public License" ("GPL").

Please refer to the file "GNU_GPL.txt" in this distribution for details.

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.