Toby Inkster > Web-Magic-0.009 > Web::Magic::Examples

Download:
Web-Magic-0.009.tar.gz

Annotate this POD

Website

View/Report Bugs
Source  

NAME ^

Web::Magic::Examples - using Web::Magic in practice

ASSUMPTIONS ^

Most of these examples assume you have something like this near the top of your script:

 use 5.010;
 use strict;
 use utf8::all;
 use Web::Magic
   -quotelike => 'web',
   -feature   => [qw(HTML XML RDF JSON YAML Feeds)];

WEB::MAGIC VERSUS MECH ^

Bulk Downloading

Mech:

 #!/usr/bin/perl -w
 
 use strict;
 use WWW::Mechanize;
 
 my $start = "http://www.stevemcconnell.com/cc2/cc.htm";
 
 my $mech = WWW::Mechanize->new( autocheck => 1 );
 $mech->get( $start );
 
 my @links = $mech->find_all_links( url_regex => qr/\d+.+\.pdf$/ );
 
 for my $link ( @links ) {
   my $url = $link->url_abs;
   my $filename = $url;
   $filename =~ s[^.+/][];
   
   print "Fetching $url";
   $mech->get( $url, ':content_file' => $filename );
   
   print "   ", -s $filename, " bytes\n";
 }

Web::Magic:

 web <http://www.stevemcconnell.com/cc2/cc.htm>
   -> assert_success
   -> findnodes('~links')
   -> map(sub { Web::Magic->new($_) })
   -> grep(sub { $_->uri =~ /.pdf$/ })
   -> foreach(sub {
        printf("Fetching %s", $_->uri);
        my $filename = ($_->uri->path_segments)[-1]
        $_->save_as($filename);
        printf("   %d bytes\n", -s $filename);
      });

This example (from the Mech documentation) doesn't actually work now, as Steve McConnell has reorganised his website.

WEB::MAGIC VERSUS MOJO ^

Web scraping

Mojo:

  # Fetch web site
  my $ua = Mojo::UserAgent->new;
  my $tx = $ua->get('mojolicio.us/perldoc');

  # Extract title
  say 'Title: ', $tx->res->dom->at('head > title')->text;

  # Extract headings
  $tx->res->dom('h1, h2, h3')->each(sub {
    say 'Heading: ', shift->all_text;
  });

Web::Magic doesn't look radically different:

 my $tx = web <http://mojolicio.us/perldoc>;
 
 say "Title: ", $tx->querySelector('head > title')->textContent;
 
 $tx->querySelectorAll('h1, h2, h3')->foreach(sub {
   say 'Heading: ', $_->textContent;
 });

One interesting feature is that it's not until the "say" line that Web::Magic actually performs a network request. That means that it's "cheap" to do things like this in Web::Magic:

 my $r1 = web <http://example.com/r1>;
 my $r2 = web <http://example.com/r2>;
 say($config->{choice} ? $r1 : $r2);

Web::Magic won't waste resources by fetching both $r1 and $r2. Instantiating a Web::Magic object is not much more expensive than a URI object.

JSON web services

Mojo:

  # Fresh user agent
  my $ua = Mojo::UserAgent->new;

  # Fetch the latest news about Mojolicious from Twitter
  my $search = 'http://search.twitter.com/search.json?q=Mojolicious';
  for $tweet (@{$ua->get($search)->res->json->{results}}) {

    # Tweet text
    my $text = $tweet->{text};

    # Twitter user
    my $user = $tweet->{from_user};

    # Show both
    my $result = "$text --$user";
    utf8::encode $result;
    say $result;
  }

Web::Magic:

 my $search = web <http://search.twitter.com/search?q=Mojolicious>;
 
 # Show tweets.
 say "$_->{text} -- $_->{from_user}"
   for @{ $search->{results} };

Note that in the Web::Magic example, the URL doesn't include the ".json". Web::Magic knows you want JSON because you tried to dereference $search as a hashref, so it tells Twitter you want JSON (via the HTTP Accept request header), and thus you get JSON automatically.

We could have just as easily done:

 my $search = web <http://search.twitter.com/search?q=Mojolicious>;
 
 # Show tweets.
 printf("%s -- %s\n", $_->title, $_->author)
   for $search->entries;

And you'd get Atom automatically. Or maybe you prefer RSS... the code is slightly more involved, but end result is the same:

 my $search = web <http://search.twitter.com/search?q=Mojolicious>;
 
 # Show tweets.
 printf("%s -- %s\n", $_->title, $_->author)
   for $search->assert_content_type('application/rss+xml')->entries;

Twitter do supply an awful lot of formats, don't they?

MORE EXAMPLES ^

XML web services

Get the temperature from Yahoo Weather.

 my %opts = (
   'w'  => 26191, # Yahoo "Where on Earth ID"
   'u'  => 'c',   # Unit: 'c' or 'f'
   );
 
 my ($temperature) = Web::Magic
   -> new(q<http://weather.yahooapis.com/forecastrss>, %opts)
   -> findnodes('//yweather:condition/@temp')
   -> map(sub { $_->value });
 
 say "Temperature is $temperature";

Find links

Finds all links on Google's homepage.

 web <http://www.google.co.uk/>
   -> assert_success
   -> assert_content_type('text/html')
   -> make_absolute_urls
   -> findnodes('~links')
   -> foreach(sub {
           printf "%s <%s>\n",
             $_->{title} || $_->textContent,
             $_->{href},
      })
   ;

Semantic Web

 use RDF::Trine qw/variable statement/;
 
 my $foaf    = RDF::Trine::Namespace->new('http://xmlns.com/foaf/0.1/');
 my $pattern = RDF::Trine::Pattern->new(
     statement(variable('w'), $foaf->name, variable('x')),
     statement(variable('w'), $foaf->homepage, variable('y')),
     );
 
 web <http://example.com/foaf.rdf>
   -> get_pattern($pattern)
   -> each(sub {
           my $result = shift;
           printf("%s has homepage %s.\n", $result->{x}, $result->{y});
      })
   ;

Tapping

Web::Magic provides a Ruby-inspired tap method (see Object::Tap). This enables coolness like:

 web <http://www.perlmonks.org/>
   -> assert_success
   -> assert_content_type('text/html')
   -> make_absolute_urls
   -> tap(sub {
           $_ -> findnodes('~links')
              -> foreach(sub {
                      printf "%s <%s>\n",
                      $_->{title} || $_->textContent,
                      $_->{href},
                 })
      })
   -> tap(sub {
           $_ -> findnodes('~images')
              -> foreach(sub {
                      printf "IMG: <%s>\n",
                      $_->{src},
                 })
      })
   ;

Basically, the coderef passed to tap is executed after assigning $_ to point to the Web::Magic object. Then the Web::Magic object itself is returned so that tap can be chained.

Arguably this takes chaining too far. ;-)

Modifying a document with WebDAV

Simple example of downloading an existing HTML document, using the DOM to modify it, and then re-upload it via HTTP PUT.

  my $doc = Web::Magic
    -> new('http://example.com/mydoc.html')
    -> to_dom;
  
  $doc->querySelector('div#footer')->appendText($disclaimer);
  
  Web::Magic
    -> new('http://example.com/mydoc.html')
    -> PUT($doc)
    -> Content_Type('text/html');

Another example. Here we download JSON, add some data and re-upload in YAML to a different server.

  my $data = Web::Magic
    -> new('http://tobyink.example.com/me.json')
    -> to_hashref;
  
  push @{ $data->{tel} }, $phone_number;
  
  Web::Magic
    -> new('http://team.example.com/tobyink.yaml')
    -> PUT($data)
    -> Content_Type('text/x-yaml');

POSTing data

A cool thing about Web::Magic is that it will happily accept references, even blessed objects, as body data for HTTP POST, PUT, etc requests. How exactly that is serialized depends on what sort of reference you've given it, and what request Content-Type header you set.

"application/x-www-form-urlencoded" (yes, that's a horrible name) is the most commonly used content type. It's the default for HTML form submissions. It's basically a bunch of key-value pairs, but unlike Perl hashes, keys may be repeated. You can submit this format using:

  web <http://example.com/>
    -> POST({ key1 => 'value1', key2 => 'value2' })
    -> Content_Type('application/x-www-form-urlencoded');

Though that's the default Content-Type, so you don't need to set it explicitly.

  web <http://example.com/>
    -> POST({ key1 => 'value1', key2 => 'value2' });

You may pass an arrayref instead of a hashref. (This helps if you need to repeat a key.)

  web <http://example.com/>
    -> POST([ key1 => 'value1', key2 => 'value2' ]);

Most web browsers do support another media type for POSTing data: "multipart/form-data". This has an advantage over "application/x-www-form-urlencoded" in that it allows file uploads. To use "multipart/form-data", you need to specify the Content-Type explicitly.

  web <http://example.com/>
    -> POST([ key1 => 'value1', key2 => 'value2' ])
    -> Content_Type('multipart/form-data');

As per "application/x-www-form-urlencoded", you may supply either a hashref or arrayref.

Note how so far all values have been simple scalars; to upload a file, your value must be an arrayref. The first element of the array is required. It is the filename of the file to read data from. The second (optional) element of the array if the filename you want to provide to the server. Further elements are treated as key=>value pairs which set additional headers for the uploaded file. (Each uploaded file has its own set of headers.) The most useful header to set is Content-Type.

  my @upload = ('/etc/motd', 'motd.txt', Content_Type => 'text/plain');
  web <http://example.com/>
    -> POST([ key1 => 'value1', key2 => \@upload ])
    -> Content_Type('multipart/form-data');

A future version of Web::Magic will hopefully add more flexibility, such as the ability to pass a file handle.

While "application/x-www-form-urlencoded" and "multipart/form-data" are common, HTTP does allow arbitrary media types to be POSTed. Here's an example of POSTing "text/plain":

  web <http://example.com/>
    -> POST("Hello world!\n")
    -> Content_Type('text/plain');

Note that this is not the same as uploading a plain text file using "multipart/form-data". Only POST types other than the "big two" if you know that you're definitely supposed to.

As per the "text/plain" example above, the general pattern is that you supply the HTTP body as a simple string. However, for certain content types, you can supply hashrefs, arrayrefs or blessed objects.

Here's a simple Atom Publishing Protocol example, taking the first entry from a local Atom feed, and POSTing it to an APP collection.

  my $dom   = XML::LibXML->load_xml(location => 'file.atom');
  my $entry = $dom->getElementsByTagName('entry')->get_node(1);
  my $title = $entry->getElementsByTagName('title')->get_node(1);
  web <http://example.com/app/>
    -> POST($entry)
    -> Content_Type('application/atom+xml; type=entry')
    -> Slug($title->textContent);

Of course, the POST method is nothing special. All of this works with other HTTP methods too, such as PUT. However for some HTTP methods such as GET and HEAD it is highly unusual (though permissible) to include a request body.

SEE ALSO ^

Web::Magic.

AUTHOR ^

Toby Inkster <tobyink@cpan.org>.

COPYRIGHT AND LICENCE ^

This document is copyright (c) 2012 by Toby Inkster.

This is document is part of a free software project; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

This document is additionally available under the Creative Commons Attribution-ShareAlike 2.0 UK: England & Wales (CC BY-SA 2.0) licence.

DISCLAIMER OF WARRANTIES ^

THIS DOCUMENT IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

syntax highlighting: