The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=head1 NAME

Catmandu::Introduction - An introduction to Catmandu

=head1 INTRODUCTION

Importing, transforming, storing and indexing data should be easy

Catmandu provides a suite of Perl modules to ease the import, storage,
retrieval, export and transformation of metadata records. Combine Catmandu
modules with web application frameworks such as PSGI/Plack, document stores such
as MongoDB or CouchDB and full text indexes as ElasticSearch or Solr to create a rapid development
environment for digital library services such as institutional repositories and
search engines.

=head1 WHERE DO WE USE IT?

In the LibreCat project it is our goal to provide in open source a set of
programming components to build up digital libraries services suited to your
local needs.  Here is an example of the projects we are working on:

LibreCat-Catalog : a next generation institutional repository (in development).

LibreCat-Citation : a CSL based citation list (in development)

LibreCat-Search : an ElasticSearch based front-end for institutional repositories.

LibreCat-Grim : a Solr/IIPImage based image database.

LibreCat-Archive : a Fedora Commons based digital repository (in development).

LibreCat-Images : a MediaMosa based digitization workflow engine (in development).

=head1 WHY DO WE USE IT?

=head2 Extract, Transform and Load

                 +--<--+               +--<--+
                 | Fix |               | Fix |
  +----------+   +-->--+   +-------+   +-->--+   +----------+
  | Importer |------------>| Store |------------>| Exporter |
  +----------+             +-------+             +----------+

To create a search engine, one of your first tasks will be to import data from
various sources, map the fields to a common data model and post it to a
full-text search engine.  Perl modules such as L<WebService::Solr> or
L<ElasticSearch> provide easy access to your favorite document stores, but you
keep writing a lot of boilerplate code to create the connections, transform the
incoming data into the correct format, validating and uploading and indexing the
data in the database. Next morning you are asked to provide a fast dump of
records into an Excel worksheet.  After some fixes are applied you are asked to
upload it into your database. Again you hit Emacs or Vi and provide an ad-hoc
script. In our LibreCat group we saw this workflow over and over.  We tried to
abstract this problem to a set of Perl tools which can work with library data
such as MARC, Dublin Core, RIS, protocols such as OAI-PMH, SRU and
repositories such as DSpace and Fedora.  In data warehouses these processes are
called B<ETL>, B<E>xtract, B<T>ransform, B<L>oad.  Many tools currenty exist for ETL
processing but none adress typical library data models and services.

=head2 Copy and Paste

As programmers, we would like to reuse our code and algorithms as easy as
possible.  In fast application development you typically want to copy and paste
parts of existing code in a new project.  In Catmandu we use a functional style
of programming to keep our code tight and clean and suitable for copy and
pasting.  When working with library data models we use native Perl hashes and
arrays to pass data around.  In this way adhere to the rationale of Alan J.
Perlis: "It is better to have 100 functions operate on one data structure than
to have 10 functions operate on 10 data structures." Our functions are all based
on a few primary data structures on which we define many functions (map, count,
each, first, take, ...)

=head2 Schemaless databases

Working with native Perl hashes and arrays we would like to use an easy
mechanism to store and index this data in a database of choice.  In the past it
was a nuisance to create database schemas and indexes to store and search your
data.  Certainly in institutional repositories this can be an ongoing job for a
programmer because the metadata schemas are not fixed in time. Any new report
will require you to add new data fields and new relations for which you need to
change your database schema.  With the introduction of schemaless databases the
storage of complex records is really easy.  Create a Perl hash execute the
function C<add> and your record is stored into the database.  Execute C<get> to
load a Perl hash from the database in memory. With our ElasticSearch plugin we
even can provide you a CQL style query language for retrieval.

  my $obj = { name => { last => 'Bond' , full => 'James Bond' }, occupation => 'Secret Agent' };
  $store->bag->add($obj);

  $store->bag->search(cql_query => 'name.last = Bond')->each(sub {
    my $obj = shift;
    printf "%s\n", $obj->{name}->{full};
  });

=head1 GETTING STARTED

To get Catmandu running on your system you need to clone the code from
github.com and build and install it.

  $ git clone git@github.com:LibreCat/Catmandu.git
  $ cd Catmandu
  $ perl Build.PL
  $ ./Build
  $ ./Build test
  $ ./Build install

=head1 Importer

L<Importers|Catmandu::Importer> are Catmandu packages to read data into an application.
We provide importers for MARC, JSON, YAML, BibTeX, CSV, Excel but also Atom and OAI-PMH
endpoints. As an example, lets create a Perl script to read a YAML file containing an array
of values.  We use the C<each> function to loop through all the items

  #!/usr/bin/env perl

  use Catmandu::Importer::YAML;

  my $importer = Catmandu::Importer::YAML->new(file => "./test.yaml");

  my $count = $importer->each(sub {
     my $obj = shift;
     # .. your code ..
  });

  say "Read: $count YAML items";

Running this script using this test.yaml file

  first: Albert
  last: Einstein
  job: Physicist
  ---
  first: Super
  last: Mario
  job: Action hero
  ...

you should see as output:
C<Read: 2 YAML items>

Here is an example script to read 10 records from an OAI-PMH endpoint into an
application:

  #!/usr/bin/env perl

  use Catmandu::Importer::OAI;

  my $importer = Catmandu::Importer::OAI->new(url => 'http://biblio.ugent.be/oai');

  my $count = $importer->take(10)->each(sub {
     my $obj = shift;
     # .. your code ..
  });

  say "Read sample of $count OAI items";

=head1 Iterable

The L<Catmandu::Iterable> package provides many list methods to process large
streams of records.  Most of the methods are lazy if the underlying datastream
supports it.  While all of the data in Catmandu are native Perl hashes and
arrays it can be impratical to load a result set of thousands of records into
memory.  Most Catmandu packages such as L<Catmandu::Importer>,
L<Catmandu::Exporter> and L<Catmandu::Store> provide therefore an
L<Catmandu::Iterable> implementation.

Using a Mock importer we can generate some Perl hashes on-the-fly and show the
functionality provided by Iterable:

  use Catmandu::Importer::Mock;
  my $it = Catmandu::Importer::Mock->new(size => 10);

  # With each you can loop over all the items in an iterator:

  $it->each(sub {
     printf "My n is %d\n" , shift->{n};
  });

Using any, many, all you can test for the existence of items in an Iterator:

  my $answer = $it->any(sub { shift->{n} > 4});
  printf "Iterator contains n > 4 = %s\n" , $answer ? 'TRUE' : 'FALSE';

  my $answer = $it->many(sub { shift->{n} > 8});
  printf "Iterator contains n > 8 = %s\n" , $answer ? 'TRUE' : 'FALSE';

  my $answer = $it->all(sub { shift->{n} =~ /^\d+$/});
  printf "Iterator contains only digits = %s\n" , $answer ? 'TRUE' : 'FALSE';

Map and reduce are functions that evaluate a function on all the items in an
iterator to procude a new iterator or a summary of the results:

  # $it contains: [ { n => 1 } , { n => 2 } , ... ];
  my $ret = $it->map(sub {
       my $hash = shift;
       { n => $hash->{n} * 2 }
  });

  # $ret contains : [ { n => 2 } , { n => 4 } , ... ];

  my $result = $it->reduce(0,sub {
       my $prev = shift;
       my $this = shift->{n} * 2;
       $prev + $this;
  });
  printf "SUM [ Iterator * 2] = %d\n" , $result

The Iterable package provides many more functions such as: C<to_array>,
C<count>, C<each>, C<first>, C<slice>, C<take>, C<group>, C<tap>, C<detect>,
C<select>, C<reject>, C<any>, C<many>, C<all>, C<map>, C<reduce> and C<invoke>.

=head1 Exporter

L<Exporters|Catmandu::Exporter> are Catmandu packages to export data from an
application.  As input they can get native Perl hashes or arrays but also
Iterators to stream huge data sets.

Here is an example using our Mock importer to stream 1 million Perl hashes
through an Exporter:

  use Catmandu::Exporter::YAML

  my $exporter = Catmandu::Exporter::YAML->new();
  $exporter->add_many(Catmandu::Importer::Mock->new(size => 1000000));

Catmandu provides exporters for BibTeX, CSV, JSON, RIS, XLS and YAML.  If you
need a special exporter for your own format you could use the Template exporter
which uses Template Toolkit.

As an example lets create an exporter for a Perl array of hashes $data using a
template:

  use Catmandu::Exporter::Template;

  my $data = [
   { name => { first => 'James' , last => 'Bond' } , occupation => 'Secret Agent' } ,
   { name => { first => 'Ernst' , last => 'Blofeld' } , occupation => 'Supervillain' } ,
  ];
  my $exporter = Catmandu::Exporter::Template->new(template => '/home/me/example.tt');
  $exporter->add_many($data);

The template example.tt will be rendered for every hash in the array $data (or
for every item in an Iterable $data).

  <character>
   <name>[% name.last %], [% name.first %]</name>
   <occupation>[% occupation %]</occupation>
  </character>

=head1 Fix

L<Fixes|Catmandu::Fix> can be used for easy data manipulation by non
programmers.  Using a small Perl DSL language librarians can use Fix routines
to manipulate data objects.  A plain text file of fixes can be created to
specify all the data manipulations that need to be executed to transform the
data in the desired format.

As an example we will import data from a MARC file and change some metadata
fields using Fix routines. Here is the code to run the example:

  use Catmandu::Fix;
  use Catmandu::Importer::MARC;
  use Data::Dumper;

  my $fixer = Catmandu::Fix->new(fixes => ['marc.fix']);
  my $it    = Catmandu::Importer::MARC->new(file => 'marc.txt', type => 'ALEPHSEQ');

  $fixer->fix($it)->each(sub {
     my $obj = shift;
     print Dumper($obj);
  });

The output of this script should generate something like this:

  $VAR1 = {
            '_id' => '000000043',
            'my' => {
                      'authors' => [
                                     'Patrick Hochstenbachhttp://ok',
                                     'Patrick Hochstenbach2My bMy eMy codeGhent1971',
                                     'Patrick Hochstenbach3',
                                     'Stichting Ons Erfdeel'
                                   ],
                      'language' => 'dut',
                      'subjects' => [
                                      'MyTopic1',
                                      'MyTopic2',
                                      'MyTopic3',
                                      'MyTopic4'
                                    ],
                      'stringy' => 'MyTopic1; MyGenre1; MyTopic2; MyGenre2; MyTopic3; MyTopic4; MyGenre4'
                    }
          };

We need two files as input: marc.txt is a file containing MARC records
and marc.fix contains the fixes that need to be applied to each MARC record.
Lets take a look at the contents of this marc.fix file:

  marc_map('100','my.authors.$append');
  marc_map('710','my.authors.$append');
  marc_map('600x','my.subjects.$append');
  marc_map('008_/35-37','my.language');
  marc_map('600','my.stringy', -join => "; ");
  marc_map('199','my.condition', -value => 'ok');

  remove_field('record');

The fixes in this file are specialized in MARC processing.  In line 1 we map the
contents of the MARC-100 field into a deeply neested Perl hash with key
'authors'.  In line 3 we map the contents of the MARC-600 x-subfield into the
'subjects' field.  In Line 4 we read characters 35 to 37 from the MARC-008
control field into the 'language' key.

A Catmandu Fix provides also many functions to manipulate Perl hashes.  The
remove_field, as shown above in the fix file, will remove a key from a Perl
hash.  Other fix function are: add_field, capitalize, clone, collapse,
copy_field, downcase, expand, join_field, move_fild, remove_field, replace_all,
retain_field, set_field, split_field, substring, trim and upcase.

=head1 Store

As explained in the introduction, one of the rationales for creating Catmandu is
to ease the serialization of records in our database of choice. The introduction
of schemaless databases made the storage of complex records quite easy. Before
we delve into this type of database we need to show you what syntax Catmandu is
using to store data.

As example lets create the most simple storage mechanism possible, an in memory
hash. We use this mock 'database' to show some of the features that any
L<Catmandu::Store> has. First we will create a YAML importer as shown above to
import records into an in memory hash store:

  use Catmandu::Importer::YAML;
  use Catmandu::Store::Hash;
  use Data::Dumper;

  my $importer = Catmandu::Importer::YAML->new(file => "./test.yaml");
  my $store    = Catmandu::Store::Hash->new();

  # Store an iterable
  $store->bag->add_many($importer);

  # Store an array of hashes
  $store->bag->add_many([ { name => 'John' } , { name => 'Peter' }]);

  # Store one hash
  $store->bag->add( { name => 'Patrick' });

  # Commit all changes
  $store->bag->commit;

Each L<Catmandu::Store> has one or more compartments (e.g. tables) to store data
called bag. We use the function C<add_many> to store each item in the importer
Iterable into the Store. We can also store an array of Perl hashes with the same
command. Or store a single hash with the C<add> method.

Each bag is an Iterator so you can apply any of the C<each>, C<any>, C<all>,...
methods shown above to read data from a bag.

  $store->bag->take(3)->each(sub {
    my $obj = shift;
    #.. your code
  });

When you store a perl Hash into a L<Catmandu::Store> then an identifier field '_id'
gets added to your perl Hash that can be used to retrieve the item at a later
stage. Lets take a look at the identifier and how it can be used.

  # First store a perl hash and return the stored item which includes the stored identifier
  my $item = $store->bag->add( { name => 'Patrick' });

  # This will show you an UUID like '414003DC-9AD0-11E1-A3AD-D6BEE5345D14'...
  print $item->{_id} , "\n";

  # Now you can use this identifier to retrieve the object from the store
  my $item2 = $store->bag->get('414003DC-9AD0-11E1-A3AD-D6BEE5345D14');

And that is how it works. L<Catmandu::Store> has some more functionality to delete
items and query the store (if the backend supports it), but this is how you can
store very complex Perl structures in memory or on disk with just a few lines of
code. As a complete example we can show how easy it is to store data in a
fulltext search engine like L<ElasticSearch>.

In this example we will download ElasticSearch version 0.19.3 and install it on
our system:

  $ wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.19.3.tar.gz
  $ tar zxvf elasticsearch-0.19.3.tar.gz
  $ cd elasticsearch-0.19.3
  $ bin/elasticsearch

After running the last command C<bin/elasticsearch> we have started the search
daemon. Now we can index some data with Catmandu:

  use Catmandu::Importer::YAML;
  use Catmandu::Store::ElasticSearch;

  my $importer = Catmandu::Importer::YAML->new(file => './test.yaml');
  my $store    = Catmandu::Store::ElasticSearch->new(index_name => 'demo');

  $store->bag->add_many($importer);

  $store->commit;

All records in the file 'test.yaml' should be available now index. We can test
this by executing a new script to read all records stored in the store:

  use Catmandu::Store::ElasticSearch;
  use Data::Dumper;

  my $store = Catmandu::Store::ElasticSearch->new(index_name => 'demo');

  $store->bag->each(sub {
    my $obj = shift;
    print Dumper($obj);
  });

If everything work correct you should something like this:

  $VAR1 = {
            'first' => 'Charly',
            '_id' => '96CA6692-9AD2-11E1-8800-92A3DA44A36C',
            'last' => 'Parker',
            'job' => 'Artist'
          };
  $VAR1 = {
            'first' => 'Joseph',
            '_id' => '96CA87F8-9AD2-11E1-B760-84F8F47D3A65',
            'last' => 'Ratzinger',
            'job' => 'Pope'
          };
  $VAR1 = {
            'first' => 'Albert',
            '_id' => '96CA83AC-9AD2-11E1-B1CD-CC6B8E6A771E',
            'last' => 'Einstein',
            'job' => 'Physicist'
          };

The ElasticSearch store even provides an implementation of the Lucene and CQL
query language:

  my $hits = $store->bag->searcher(query => 'first:Albert');

  $hits->each(sub {
    my $obj = shift;
    printf "%s %s\n" , $obj->{first} , $obj->{last};
  });

This last example will print 'Albert Einstein' as result. Clinton Gormley did
some great work in providing a Perl client for ElasticSearch. Searching complex
objects can be done by using a dot syntax e.g. C<record.titles.0.subtitle:"My
Funny Valentine">. The beauty of ElasticSearch is that it is completely
painless to setup and requires no schema: indexing data is simply done by using
JSON over HTTP. All your fields are indexed automatically.

=cut