The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::Walk::Extracted - An extracted dataref walker

perl version Build Status Coverage Status github version CPAN version kwalitee

SYNOPSIS

This is a contrived example! For a more functional (complex/useful) example see the roles in this package.

        package Data::Walk::MyRole;
        use Moose::Role;
        requires '_process_the_data';
        use MooseX::Types::Moose qw(
                        Str
                        ArrayRef
                        HashRef
                );
        my $mangle_keys = {
                Hello_ref => 'primary_ref',
                World_ref => 'secondary_ref',
        };

        #########1 Public Method      3#########4#########5#########6#########7#########8

        sub mangle_data{
                my ( $self, $passed_ref ) = @_;
                @$passed_ref{ 'before_method', 'after_method' } =
                        ( '_mangle_data_before_method', '_mangle_data_after_method' );
                ### Start recursive parsing
                $passed_ref = $self->_process_the_data( $passed_ref, $mangle_keys );
                ### End recursive parsing with: $passed_ref
                return $passed_ref->{Hello_ref};
        }

        #########1 Private Methods    3#########4#########5#########6#########7#########8

        ### If you are at the string level merge the two references
        sub _mangle_data_before_method{
                my ( $self, $passed_ref ) = @_;
                if(
                        is_Str( $passed_ref->{primary_ref} ) and
                        is_Str( $passed_ref->{secondary_ref} )          ){
                        $passed_ref->{primary_ref} .= " " . $passed_ref->{secondary_ref};
                }
                return $passed_ref;
        }

        ### Strip the reference layers on the way out
        sub _mangle_data_after_method{
                my ( $self, $passed_ref ) = @_;
                if( is_ArrayRef( $passed_ref->{primary_ref} ) ){
                        $passed_ref->{primary_ref} = $passed_ref->{primary_ref}->[0];
                }elsif( is_HashRef( $passed_ref->{primary_ref} ) ){
                        $passed_ref->{primary_ref} = $passed_ref->{primary_ref}->{level};
                }
                return $passed_ref;
        }

        package main;
        use MooseX::ShortCut::BuildInstance qw( build_instance );
        my      $AT_ST = build_instance(
                        package         => 'Greeting',
                        superclasses    => [ 'Data::Walk::Extracted' ],
                        roles           => [ 'Data::Walk::MyRole' ],
                );
        print $AT_ST->mangle_data( {
                        Hello_ref =>{ level =>[ { level =>[ 'Hello' ] } ] },
                        World_ref =>{ level =>[ { level =>[ 'World' ] } ] },
                } ) . "\n";



        #################################################################################
        #     Output of SYNOPSIS
        # 01:Hello World
        #################################################################################

DESCRIPTION

This module takes a data reference (or two) and recursivly travels through it(them). Where the two references diverge the walker follows the primary data reference. At the beginning and end of each branch or node in the data the code will attempt to call a method on the remaining unparsed data.

Acknowledgement of MJD

This is an implementation of the concept of extracted data walking from Higher-Order-Perl Chapter 1 by Mark Jason Dominus. The book is well worth the money! With that said I diverged from MJD purity in two ways. This is object oriented code not functional code. Second, when taking action the code will search for class methods provided by (your) role rather than acting on passed closures. There is clearly some overhead associated with both of these differences. I made those choices consciously and if that upsets you do not hassle MJD!

What is the unique value of this module?

With the recursive part of data walking extracted the various functionalities desired when walking the data can be modularized without copying this code. The Moose framework also allows diverse and targeted data parsing without dragging along a kitchen sink API for every use of this class.

Extending Data::Walk::Extracted

All action taken during the data walking must be initiated by implementation of action methods that do not exist in this class. It usually also makes sense to build an initial action method as well. The initial action method can do any data-preprocessing that is useful as well as providing the necessary set up for the generic walker. All of these elements can be combined with this class using a Moose role , by extending the class, or it can be joined to the class at run time. See MooseX::ShortCut::BuildInstance . or Moose::Util for more class building information. See the parsing flow to understand the details of how the methods are used. See methods used to write roles for the available methods to implement the roles.

Then, Write some tests for your role!

Recursive Parsing Flow

Initial data input and scrubbing

The primary input method added to this class for external use is refered to as the 'action' method (ex. 'mangle_data'). This action method needs to receive data and organize it for sending to the start method for the generic data walker. Remember if more than one role is added to Data::Walk::Extracted for a given instance then all methods should be named with consideration for other (future?) method names. The '$conversion_ref' allows for muliple uses of the core data walkers generic functions. The $conversion_ref is not passed deeper into the recursion flow.

Assess and implement the before_method

The class next checks for an available 'before_method'. Using the test;

        exists $passed_ref->{before_method};

If the test passes then the next sequence is run.

        $method = $passed_ref->{before_method};
        $passed_ref = $self->$method( $passed_ref );

If the $passed_ref is modified by the 'before_method' then the recursive parser will parse the new ref and not the old one. The before_method can set;

        $passed_ref->{skip} = 'YES'

Then the flow checks for the need to investigate deeper.

Test for deeper investigation

The code now checks if deeper investigation is required checking both that the 'skip' key = 'YES' in the $passed_ref or if the node is a base ref type. If either case is true the process jumps to the after method otherwise it begins to investigate the next level.

Identify node elements

If the next level in is not skipped then a list is generated for all paths in the node. For example a 'HASH' node would generate a list of hash keys for that node. SCALAR nodes will generate a list with only one element containing the scalar contents. UNDEF nodes will generate an empty list.

Sort the node as required

If the list should be sorted then the list is sorted. ARRAYS are hard sorted. This means that the actual items in the (primary) passed data ref are permanantly sorted.

Process each element

For each identified element of the node a new $data_ref is generated containing data that represents just that sub element. The secondary_ref is only constructed if it has a matching type and element to the primary ref. Matching for hashrefs is done by key matching only. Matching for arrayrefs is done by position exists testing only. No position content compare is done! Scalars are matched on content. The list of items generated for this element is as follows;

    before_method => -->name of before method for this role here<--

    after_method => -->name of after method for this role here<--

    primary_ref => the piece of the primary data ref below this element

    primary_type => the lower primary (walker) ref type

    match => YES|NO (This indicates if the secondary ref meets matching critera)

    skip => YES|NO Checks the three skip attributes against the lower primary_ref node. This can also be set in the 'before_method' upon arrival at that node.

    secondary_ref => if match eq 'YES' then built like the primary ref

    secondary_type => if match eq 'YES' then calculated like the primary type

    branch_ref => stack trace

A position trace is generated

The current node list position is then documented and pushed onto the array at $passed_ref->{branch_ref}. The array reference stored in branch_ref can be thought of as the stack trace that documents the node elements directly between the current position and the initial (or zeroth) level of the parsed primary data_ref. Past completed branches and future pending branches are not maintained. Each element of the branch_ref contains four positions used to describe the node and selections used to traverse that node level. The values in each sub position are;

        [
                ref_type, #The node reference type
                the list item value or '' for ARRAYs,
                        #key name for hashes, scalar value for scalars
                element sequence position (from 0),
                        #For hashes this is only relevent if sort_HASH is called
                level of the node (from 0),
                        `#The zeroth level is the initial data ref
        ]

Going deeper in the data

The down level ref is then passed as a new data set to be parsed and it starts at the before_method again.

Actions on return from recursion

When the values are returned from the recursion call the last branch_ref element is poped off and the returned data ref is used to replace the sub elements of the primary_ref and secondary_ref associated with that list element in the current level of the $passed_ref. If there are still pending items in the node element list then the program processes them too

Assess and implement the after_method

After the node elements have all been processed the class checks for an available 'after_method' using the test;

        exists $passed_ref->{after_method};

If the test passes then the following sequence is run.

        $method = $passed_ref->{after_method};
        $passed_ref = $self->$method( $passed_ref );

If the $passed_ref is modified by the 'after_method' then the recursive parser will parse the new ref and not the old one.

Go up

The updated $passed_ref is passed back up to the next level .

Attributes

Data passed to ->new when creating an instance. For modification of these attributes see Public Methods. The ->new function will either accept fat comma lists or a complete hash ref that has the possible attributes as the top keys. Additionally some attributes that have the following prefixed methods; get_$name, set_$name, clear_$name, and has_$name can be passed to _process_the_data and will be adjusted for just the run of that method call. These are called one shot attributes. Nested calls to _process_the_data will be tracked and the attribute will remain in force until the parser returns to the calling 'one shot' level. Previous attribute values are restored after the 'one shot' attribute value expires.

sorted_nodes

    Definition: If the primary_type of the $element_ref is a key in this attribute hash ref then the node list is sorted. If the value of that key is a CODEREF then the sort sort function will called as follows.

            @node_list = sort $coderef @node_list

    For the type 'ARRAY' the node is sorted (permanantly) by the element values. This means that if the array contains a list of references it will effectivly sort against the ASCII of the memory pointers. Additionally the 'secondary_ref' node is not sorted, so prior alignment may break. In general ARRAY sorts are not recommended.

    Default {} #Nothing is sorted

    Range This accepts a HashRef.

    Example:

            sorted_nodes =>{
                    ARRAY   => 1,#Will sort the primary_ref only
                    HASH    => sub{ $b cmp $a }, #reverse sort the keys
            }

skipped_nodes

    Definition: If the primary_type of the $element_ref is a key in this attribute hash ref then the 'before_method' and 'after_method' are run at that node but no parsing is done.

    Default {} #Nothing is skipped

    Range This accepts a HashRef.

    Example:

            sorted_nodes =>{
                    OBJECT => 1,#skips all object nodes
            }

skip_level

    Definition: This attribute is set to skip (or not) node parsing at the set level. Because the process doesn't start checking until after it enters the data ref it effectivly ignores a skip_level set to 0 (The base node level). The test checks against the value in last position of the prior trace array ref + 1.

    Default undef = Nothing is skipped

    Range This accepts an integer

skip_node_tests

    Definition: This attribute contains a list of test conditions used to skip certain targeted nodes. The test can target an array position, match a hash key, even restrict the test to only one level. The test is run against the latest branch_ref element so it skips the node below the matching conditions not the node at the matching conditions. Matching is done with '=~' and so will accept a regex or a string. The attribute contains an ArrayRef of ArrayRefs. Each sub_ref contains the following;

      $type - This is any of the identified reference node types

      $key - This is either a scalar or regex to use for matching a hash key

      $position - This is used to match an array position. It can be an integer or 'ANY'

      $level - This restricts the skipping test usage to a specific level only or 'ANY'

    Example:

            [
                    [ 'HASH', 'KeyWord', 'ANY', 'ANY'],
                    # Skip the node below the value of any hash key eq 'Keyword'
                    [ 'ARRAY', 'ANY', '3', '4'], ],
                    # Skip the node stored in arrays at position three on level four
            ]

    Range An infinite number of skip tests added to an array

    Default [] = no nodes are skipped

change_array_size

    Definition: This attribute will not be used by this class directly. However the Data::Walk::Prune role may share it with other roles in the future so it is placed here so there will be no conflicts. This is usually used to define whether an array size shinks when an element is removed.

    Default 1 (This probably means that the array will shrink when a position is removed)

    Range Boolean values.

fixed_primary

    Definition: This means that no changes made at lower levels will be passed upwards into the final ref.

    Default 0 = The primary ref is not fixed (and can be changed) 0 -> effectively deep clones the portions of the primary ref that are traversed.

    Range Boolean values.

Methods

Methods used to write roles

These are methods that are not meant to be exposed to the final user of a composed role and class but are used by the role to excersize the class.

_process_the_data( $passed_ref, $conversion_ref )

    Definition: This method is the gate keeper to the recursive parsing of Data::Walk::Extracted. This method ensures that the minimum requirements for the recursive data parser are met. If needed it will use a conversion ref (also provided by the caller) to change input hash keys to the generic hash keys used by this class. This function then calls the actual recursive function. For an overview of the recursive steps see the flow outline.

    Accepts: ( $passed_ref, $conversion_ref )

      $passed_ref this ref contains key value pairs as follows;

        primary_ref - a dataref that the walker will walk - required

          review the $conversion_ref functionality in this function for renaming of this key.

        secondary_ref - a dataref that is used for comparision while walking. - optional

          review the $conversion_ref functionality in this function for renaming of this key.

        before_method - a method name that will perform some action at the beginning of each node - optional

        after_method - a method name that will perform some action at the end of each node - optional

        [attribute name] - supported attribute names are accepted with temporary attribute settings here. These settings are temporarily set for a single "_process_the_data" call and then the original attribute values are restored.

      $conversion_ref This allows a public method to accept different key names for the various keys listed above and then convert them later to the generic terms used by this class. - optional

      Example

              $passed_ref ={
                      print_ref =>{
                              First_key => [
                                      'first_value',
                                      'second_value'
                              ],
                      },
                      match_ref =>{
                              First_key       => 'second_value',
                      },
                      before_method   => '_print_before_method',
                      after_method    => '_print_after_method',
                      sorted_nodes    =>{ Array => 1 },#One shot attribute setter
              }
      
              $conversion_ref ={
                      primary_ref     => 'print_ref',# generic_name => role_name,
                      secondary_ref   => 'match_ref',
              }

    Returns: the $passed_ref (only) with the key names restored to the ones passed to this method using the $conversion_ref.

_build_branch( $seed_ref, @arg_list )

    Definition: There are times when a role will wish to reconstruct the data branch that lead from the 'zeroth' node to where the data walker is currently at. This private method takes a seed reference and uses data found in the branch ref to recursivly append to the front of the seed until a complete branch to the zeroth node is generated. The branch_ref list must be explicitly passed.

    Accepts: a list of arguments starting with the $seed_ref to build from. The remaining arguments are just the array elements of the 'branch ref'.

    Example:

            $ref = $self->_build_branch(
                    $seed_ref,
                    @{ $passed_ref->{branch_ref}},
            );

    Returns: a data reference with the current path back to the start pre-pended to the $seed_ref

_extracted_ref_type( $test_ref )

    Definition: In order to manage data types necessary for this class a data walker compliant 'Type' tester is provided. This is necessary to support a few non perl-standard types not generated in standard perl typing systems. First, 'undef' is the UNDEF type. Second, strings and numbers both return as 'SCALAR' (not '' or undef). Much of the code in this package runs on dispatch tables that are built around these specific type definitions.

    Accepts: It receives a $test_ref that can be undef.

    Returns: a data walker type or it confesses.

_get_had_secondary

    Definition: during the initial processing of data in _process_the_data the existence of a passed secondary ref is tested and stored in the attribute '_had_secondary'. On occasion a role might need to know if a secondary ref existed at any level if it it is not represented at the current level.

    Accepts: nothing

    Returns: True|1 if the secondary ref ever existed

_get_current_level

    Definition: on occasion you may need for one of the methods to know what level is currently being parsed. This will provide that information in integer format.

    Accepts: nothing

    Returns: the integer value for the level

Public Methods

add_sorted_nodes( NODETYPE => 1, )

    Definition: This method is used to add nodes to be sorted to the walker by adjusting the attribute sorted_nodes.

    Accepts: Node key => value pairs where the key is the Node name and the value is 1. This method can accept multiple key => value pairs.

    Returns: nothing

has_sorted_nodes

    Definition: This method checks if any sorting is turned on in the attribute sorted_nodes.

    Accepts: Nothing

    Returns: the count of sorted node types listed

check_sorted_nodes( NODETYPE )

    Definition: This method is used to see if a node type is sorted by testing the attribute sorted_nodes.

    Accepts: the name of one node type

    Returns: true if that node is sorted as determined by sorted_nodes

clear_sorted_nodes

    Definition: This method will clear all values in the attribute sorted_nodes. and therefore turn off all cleared sorts.

    Accepts: nothing

    Returns: nothing

remove_sorted_node( NODETYPE1, NODETYPE2, )

    Definition: This method will clear the key / value pairs in sorted_nodes for the listed items.

    Accepts: a list of NODETYPES to delete

    Returns: In list context it returns a list of values in the hash for the deleted keys. In scalar context it returns the value for the last key specified

set_sorted_nodes( $hashref )

    Definition: This method will completely reset the attribute sorted_nodes to $hashref.

    Accepts: a hashref of NODETYPE keys with the value of 1.

    Returns: nothing

get_sorted_nodes

    Definition: This method will return a hashref of the attribute sorted_nodes

    Accepts: nothing

    Returns: a hashref

add_skipped_nodes( NODETYPE1 => 1, NODETYPE2 => 1 )

    Definition: This method adds additional skip definition(s) to the skipped_nodes attribute.

    Accepts: a list of key value pairs as used in 'skipped_nodes'

    Returns: nothing

has_skipped_nodes

    Definition: This method checks if any nodes are set to be skipped in the attribute skipped_nodes.

    Accepts: Nothing

    Returns: the count of skipped node types listed

check_skipped_node( $string )

    Definition: This method checks if a specific node type is set to be skipped in the skipped_nodes attribute.

    Accepts: a string

    Returns: Boolean value indicating if the specific $string is set

remove_skipped_nodes( NODETYPE1, NODETYPE2 )

    Definition: This method deletes specificily identified node skips from the skipped_nodes attribute.

    Accepts: a list of NODETYPES to delete

    Returns: In list context it returns a list of values in the hash for the deleted keys. In scalar context it returns the value for the last key specified

clear_skipped_nodes

    Definition: This method clears all data in the skipped_nodes attribute.

    Accepts: nothing

    Returns: nothing

set_skipped_nodes( $hashref )

    Definition: This method will completely reset the attribute skipped_nodes to $hashref.

    Accepts: a hashref of NODETYPE keys with the value of 1.

    Returns: nothing

get_skipped_nodes

    Definition: This method will return a hashref of the attribute skipped_nodes

    Accepts: nothing

    Returns: a hashref

set_skip_level( $int )

    Definition: This method is used to reset the skip_level attribute after the instance is created.

    Accepts: an integer (negative numbers and 0 will be ignored)

    Returns: nothing

get_skip_level()

    Definition: This method returns the current skip_level attribute.

    Accepts: nothing

    Returns: an integer

has_skip_level()

    Definition: This method is used to test if the skip_level attribute is set.

    Accepts: nothing

    Returns: $Bool value indicating if the 'skip_level' attribute has been set

clear_skip_level()

    Definition: This method clears the skip_level attribute.

    Accepts: nothing

    Returns: nothing (always successful)

set_skip_node_tests( ArrayRef[ArrayRef] )

    Definition: This method is used to change (completly) the 'skip_node_tests' attribute after the instance is created. See skip_node_tests for an example.

    Accepts: an array ref of array refs

    Returns: nothing

get_skip_node_tests()

    Definition: This method returns the current master list from the skip_node_tests attribute.

    Accepts: nothing

    Returns: an array ref of array refs

has_skip_node_tests()

    Definition: This method is used to test if the skip_node_tests attribute is set.

    Accepts: nothing

    Returns: The number of sub array refs there are in the list

clear_skip_node_tests()

    Definition: This method clears the skip_node_tests attribute.

    Accepts: nothing

    Returns: nothing (always successful)

add_skip_node_tests( ArrayRef1, ArrayRef2 )

    Definition: This method adds additional skip_node_test definition(s) to the the skip_node_tests attribute list.

    Accepts: a list of array refs as used in 'skip_node_tests'. These are 'pushed onto the existing list.

    Returns: nothing

set_change_array_size( $bool )

    Definition: This method is used to (re)set the change_array_size attribute after the instance is created.

    Accepts: a Boolean value

    Returns: nothing

get_change_array_size()

    Definition: This method returns the current state of the change_array_size attribute.

    Accepts: nothing

    Returns: $Bool value representing the state of the 'change_array_size' attribute

has_change_array_size()

    Definition: This method is used to test if the change_array_size attribute is set.

    Accepts: nothing

    Returns: $Bool value indicating if the 'change_array_size' attribute has been set

clear_change_array_size()

    Definition: This method clears the change_array_size attribute.

    Accepts: nothing

    Returns: nothing

set_fixed_primary( $bool )

    Definition: This method is used to change the fixed_primary attribute after the instance is created.

    Accepts: a Boolean value

    Returns: nothing

get_fixed_primary()

    Definition: This method returns the current state of the fixed_primary attribute.

    Accepts: nothing

    Returns: $Bool value representing the state of the 'fixed_primary' attribute

has_fixed_primary()

    Definition: This method is used to test if the fixed_primary attribute is set.

    Accepts: nothing

    Returns: $Bool value indicating if the 'fixed_primary' attribute has been set

clear_fixed_primary()

    Definition: This method clears the fixed_primary attribute.

    Accepts: nothing

    Returns: nothing

Definitions

node

Each branch point of a data reference is considered a node. The possible paths deeper into the data structure from the node are followed 'vertically first' in recursive parsing. The original top level reference is considered the 'zeroth' node.

base node type

Recursion 'base' node types are considered to not have any possible deeper branches. Currently that list is SCALAR and UNDEF.

Supported node walking types

ARRAY
HASH
SCALAR
UNDEF

Other node support

Support for Objects is partially implemented and as a consequence '_process_the_data' won't immediatly die when asked to parse an object. It will still die but on a dispatch table call that indicates where there is missing object support, not at the top of the node. This allows for some of the skip attributes to use 'OBJECT' in their definitions.

Supported one shot attributes

explanation

sorted_nodes
skipped_nodes
skip_level
skip_node_tests
change_array_size
fixed_primary

Dispatch Tables

This class uses the role Data::Walk::Extracted::Dispatch to implement dispatch tables. When there is a decision point, that role is used to make the class extensible.

Caveat utilitor

This is not an extention of Data::Walk

The core class has no external effect. All output comes from additions to the class.

This module uses the 'defined or' ( //= ) and so requires perl 5.010 or higher.

This is a Moose based data handling class. Many coders will tell you Moose and data manipulation don't belong together. They are most certainly right in speed intensive circumstances.

Recursive parsing is not a good fit for all data since very deep data structures will fill up a fair amount of memory! Meaning that as the module recursively parses through the levels it leaves behind snapshots of the previous level that allow it to keep track of it's location.

The passed data references are effectivly deep cloned during this process. To leave the primary_ref pointer intact see fixed_primary

Build/Install from Source

1. Download a compressed file with the code

2. Extract the code from the compressed file. If you are using tar this should work:

        tar -zxvf Data-Walk-Extracted-v0.xx.xx.tar.gz

3. Change (cd) into the extracted directory

4. Run the following commands

    (For Windows find what version of make was used to compile your perl)

            perl  -V:make

    (then for Windows substitute the correct make function (ex. s/make/dmake/g))

        >perl Makefile.PL

        >make

        >make test

        >make install # As sudo/root

        >make clean

SUPPORT

TODO

    1. provide full recursion through Objects

    2. Support recursion through CodeRefs (Closures)

    3. Add a Data::Walk::Diff Role to the package

    4. Add a Data::Walk::Top Role to the package

    5. Add a Data::Walk::Thin Role to the package

    6. Convert test suite to Test2 direct usage

AUTHOR

Jed Lund
jandrew@cpan.org

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

This software is copyrighted (c) 2012, 2016 by Jed Lund.

Dependencies

SEE ALSO