Thomas Berger > OAI-Harvester > Net::OAI::Record::NamespaceFilter

Download:
OAI-Harvester-1.20.tar.gz

Dependencies

Annotate this POD

Website

CPAN RT

Open  0
View/Report Bugs
Module Version: 1.20   Source  

NAME ^

Net::OAI::Record::NamespaceFilter - general filter class based on namespace URIs

SYNOPSIS ^

 $plug = Net::OAI::Record::NamespaceFilter->new(); # Noop

 $multihandler = Net::OAI::Record::NamespaceFilter->new(
    'http://www.openarchives.org/OAI/2.0/oai_dc/' => 'Net::OAI::Record::OAI_DC',
    'http://www.openarchives.org/OAI/2.0/provenance' => 'MySAX::ProvenanceHandler'
   );

 $saxfilter = new SOME_SAX_Filter;
 ...
 $filter = Net::OAI::Record::NamespaceFilter->new(
    '*' => $saxfilter, # '*' for any namespace
   );

 $filter = Net::OAI::Record::NamespaceFilter->new(
   '*' => sub { my $x = ""; 
                return XML::SAX::Writer->new(Output => \$x);
              };
  );

DESCRIPTION ^

It will forward any element belonging to a namespace from this list to the associated SAX filter and all of the element's children (regardless of their respective namespace) to the same one. It can be used either as a metadataHandler or recordHandler.

This SAX filter takes a hashref namespaces as argument, with namespace URIs for keys ('*' for "any namespace") and the values are either

undef

Matching elements and their subelements are suppressed.

If the list of namespaces ist empty or undefined is connected to the filter, it effectively acts as a plug to Net::OAI::Harvester. This might come handy if you are planning to get to the raw result by other means, e.g. by tapping the user agent or accessing the result's xml() method:

 $plug = Net::OAI::Record::NamespaceFilter->new();
 $harvester = Net::OAI::Harvester->new( [
     baseURL => ...,
     ] );

 $tapped_by_ua = "";
 open ($TAP, ">", \$tapped_by_ua);
 $harvester->userAgent()->add_handler(response_data => sub { 
        my($response, $ua, $h, $data) = @_;
        print $TAP $data;
     });

 $list = $harvester->listRecords( 
    metadataPrefix  => 'a_strange_one',
    recordHandler => $plug,
  );

 print $tapped_by_ua; # complete OAI response
 print $list->xml();  # should be exactly the same

Comment: This is quite an efficient way of not processing the XML content of OAI records received.

a class name of a SAX filter

As usual for any record element of the OAI response a new instance is created.

  # end_document() of instances of MyWriter returns something meaningful...
  $consumer = Net::OAI::Record::NamespaceFilter->new('*'=> 'MyWriter');

  $filter = Net::OAI::Record::NamespaceFilter->new(
      '*' => $consumer
    );
 
  $list = $harvester->listAllRecords( 
     metadataPrefix  => 'oai_dc',
     recordHandler => $filter,
   );

  while( $r = $list->next() ) {
     next if $r->status() eq "deleted";
     $xmlstringref = $r->recorddata()->result('*');
     ...
  };

Note: The handlers are instantiated for each single OAI record in the response and will see one start_document() and end_document() event in any case (this behavior is different from that of handler class names directly specified as metadataHandler or recordHandler for a request: instances from those constructions will never see such events).

a code reference for an constructor

Must return a SAX filter ready to accept a new document.

The following example returns a string serialization for each single record:

 # end_document() events will return \$x
 $constructor = sub { my $x = ""; 
                      return XML::SAX::Writer->new(Output => \$x);
                    };
 $filter = Net::OAI::Record::NamespaceFilter->new(
      '*' => $constructor
   );
 
 $list = $harvester->listRecords( 
     metadataPrefix  => 'oai_dc',
     recordHandler => $filter,
  );

 while( $r = $list->next() ) {
     $xmlstringref = $r->recorddata()->result('*');
     ...
  };

Comment: This example shows an approach to insulate the "true contents" of individual response records without having to provide a SAX handler class of one's own (just the addidtional prerequisite of XML::SAX::Writer). But what you get is a serialized XML document which then has to be parsed for further processing ...

an already instantiated SAX filter

As usual in this case no start_document() and end_document() events are forwarded to the filter.

 open $fh, ">", $some_file;
 $builder = XML::SAX::Writer->new(Output => $fh);
 $builder->start_document();
 $rootEL = { Name => 'collection',
           LocalName => 'collection',
        NamespaceURI => "http://www.loc.gov/MARC21/slim",
              Prefix => "",
          Attributes => {}
              };
 $builder->start_element( $rootEL );

 # filter for OAI-Namespace in records: forward all
 $filter = Net::OAI::Record::NamespaceFilter->new(
      'http://www.loc.gov/MARC21/slim' => $builder);

 $list = $harvester->listRecords( 
     metadataPrefix  => 'a_strange_one',
     metadataHandler => $filter,
  );
 # handle resumption tokens if more than the first
 # chunk shall be stored into $fh ....

 $builder->end_element( $rootEL );
 $builder->end_document();
 close($fh);
 # ... process contents of $some_file

In this example calling the result() method for individual records in the response will probably not be of much use.

Caution: Depending on the namespaces specified, even a handlers which are freshly instantiated for each OAI record might be fed with more than one top-level XML element.

METHODS ^

new( [%namespaces] )

Creates a Handler suitable as recordHandler or metadataHandler. %namespaces has namespace URIs for keys and values according to the four types described as above.

result ( [namespace] )

If called with a namespace, it returns the result of the handler, i.e. what end_document() returned for the record in question. Otherwise it returns a hashref for all the results with the corresponding namespaces as keys.

AUTHOR ^

Thomas Berger <ThB@gymel.com>

syntax highlighting: