The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::TreeBuilderX::ASP_NET - Scrape ASP.NET/VB.NET sites which utilize Javascript POST-backs.

SYNOPSIS

        my $ua = LWP::UserAgent->new;
        my $resp = $ua->get('http://uniqueUrl.com/Server.aspx');
        my $root = HTML::TreeBuilder->new_from_content( $resp->content );
        my $a = $root->look_down( _tag => 'a', id => 'nextPage' );
        my $aspnet = HTML::TreeBuilderX::ASP_NET->new({
                element   => $a
                , baseURL =>$resp->request->uri ## takes into account posting redirects
        });
        my $resp = $ua->request( $aspnet->httpResponse );

        ## or the easy cheating way see the SEE ALSO section for links
        my $aspnet = HTML::TreeBuilderX::ASP_NET->new_with_traits( traits => ['htmlElement'] );
        $form->look_down(_tag=> 'a')->httpResponse

DESCRIPTION

Scrape ASP.NET sites which utilize the language's __VIEWSTATE, __EVENTTARGET, __EVENTARGUMENT, __LASTFOCUS, et al. This module returns a HTTP::Response from the form with the use of the method ->httpResponse.

In this scheme many of the links on a webpage will apear to be javascript functions. The default Javascript function is __doPostBack(eventTarget, eventArgument). ASP.NET has two hidden fields which record state: __VIEWSTATE, and __LASTFOCUS. It abstracts each link with a method that utilizes an HTTP post-back to the server. The Javascript behind __doPostBack simply appends __EVENTTARGET=$eventTarget&__EVENTARGUMENT=$eventArgument onto the POST request from the parent form and submits it. When the server receives this request it decodes and decompresses the __VIEWSTATE and uses it along with the new __EVENTTARGET and __EVENTARGUMENT to perform the action, which is often no more than serializing the data back into the __VIEWSTATE.

Sometimes developers cloak the __doPostBack(target,arg) with names akin to changepage(arg) which simply call __doPostBack("target", arg). This module will handle this use case as well using the explicit an eventTriggerArugment in the constructor.

This flow is a bane on RESTLESS http and makes no sense whatsoever. Thanks Microsoft.

      .-------------------------------------------------------------------.
      |                            HTML FORM 1                            |
      | <form action="Server.aspx" method="post">                         |
      | <input type="hidden" name="__VIEWSTATE" value="encryptedXML-FOO"> |
      | <a>1</a> |                                                        |
      | <a href="javascript:__doPostBack('gotopage','2')">2</a>           |
      | ...                                                               |
      '-------------------------------------------------------------------'
                                        |
                                        v
                       _________________________________
                       \                                \
                        ) User clicks the link named "2" )
                       /________________________________/
                                        |
                                        v
   .------------------------------------------------------------------------.
   | POST http://aspxnonsensery/Server.aspx                                 |
   | Content-Length: 2659                                                   |
   | Content-Type: application/x-www-form-urlencoded                        |
   |                                                                        |
   | __VIEWSTATE=encryptedXML-FOO&__EVENTTARGET=gotopage1&__EVENTARGUMENT=2 |
   '------------------------------------------------------------------------'
                                        |
                                        v
    .----------------------------------------------------------------------.
    |                             HTML FORM 2                              |
    |                       (different __VIEWSTATE)                        |
    | <form action="Server.aspx" method="post">                            |
    | <input type="hidden" name="__VIEWSTATE" value="encryptedXML-BAR">    |
    | <a href="javascript:__doPostBack('gotopage','1')">1</a> |            |
    | <a>2</a>                                                             |
    | ...                                                                  |
    '----------------------------------------------------------------------'

METHODS

IN ADDITION TO ALL OF THE METHODS FROM HTTP::Request::Form

->new({ hashref })

Takes a HashRef, returns a new instance some of the possible key/values are:

form => $htmlElement

optional: You explicitly send the HTML::Elmenet representing the form. If you do not one will be implicitly deduced from the $self->element, making element=>$htmlElement a requirement

eventTriggerArgument => $hashRef

Not needed if you supply an element. This takes a HashRef and will create HTML::Elements that mimmick hidden input fields. From which to tack onto the $self->form.

element => $htmlElement

Not needed if you send an eventTriggerArgument. Attempts to deduce the __EVENTARGUMENT and __EVENTTARGET from the 'href' attribute of the element just as if the two were supplied explicitly. It will also be used to deduce a form by looking up in the HTML tree if one is not supplied.

debug => *0|1

optional: Sends the debug flag H:R:F, default is off.

baseURL => $uri

optional: Sets the base of the URL for the post action

->httpRequest

Returns an HTTP::Request object for the HTTP POST

->hrf

Explicitly return the underlying HTTP::Request::Form object. All methods fallback here anyway, but this will return that object directly.

FUNCTIONS

None of these are exported...

createInputElements( {eventTarget => eventArgument} )

Helper function takes two values in an HashRef. Assumes the key is the __EVENTTARGET and value the __EVENTARGUMENT, returns two HTML::Element pseudo-input fields with the information.

parseDoPostBack( $str )

Accepts a string that is often the "href" attribute of an HTTP::Element. It simple parses out the call to Javascript, using regexes, and makes the two args useable to perl in the form of an HashRef.

SEE ALSO

HTML::TreeBuilderX::ASP_NET::Roles::htmlElement

For an easy way to glue the two together

HTTP::Request

For the object the method htmlElement returns

HTTP::Request::Form

For a base class, to which all methods are valid

HTML::Element

For the base class of all HTML tokens

AUTHOR

Evan Carroll, <me at evancarroll.com>

BUGS

None, though *much* more support should be added to ->element. Not everthing is a simple anchor tag.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc HTML::TreeBuilderX::ASP_NET

You can also look for information at:

COPYRIGHT & LICENSE

Copyright 2008 Evan Carroll, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.