The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

LWPng

This note describe the redesign of the LWP perl modules in order to add full support for the HTTP/1.1 protocol. The main change is the adoption of an event driven framework. This allows us to support multiple connections within a single client program. It was also a prerequisite for supporting HTTP/1.1 features like persistent connections and pipelining.

This note assume that you are familiar with the concepts and programming interface of the previous versions of the libwww-perl modules (referred to as "LWP5" below). LWPng will probably become LWP6 when it is ready for general use.

HTTP/1.1

RFC 2068 is the proposed standard for the Hypertext Transfer Protocol version 1.1, usually denoted HTTP/1.1. The document is currently revised by the IETF and a draft standard document is expected soon?? The latest draft is currently <draft-ietf-http-v11-spec-rev-03.txt>

The HTTP/1.1 protocol use the same basic message format as earlier versions of the protocol and HTTP/1.1 clients/servers can easily adopt to peers which only know about the older versions of the protocol. HTTP/1.1 adds some new methods, some new status codes, and some new headers (it also eliminates some old unused or broken headers). One important change is that the Host header is now mandatory which make it possible to serve multiple domains from a single server without allocating a new IP-address to each. Another change is support for partial content. The support for caching and proxies has also been much improved on. There is also a standard mechanism of switching from HTTP/1.1 to some other (hopefully more suitable) protocol on the wire.

IMHO, the most important change with HTTP/1.1 is the introduction of persistent connections. This means that more than one request/response exchange can take place on a single TCP connection between a client and a server. This improves performance and generally interacts much better with how TCP works underneath. This also means that the peers must be able to tell the extent of the messages on the wire. In HTTP/1.0 the only way to do this was by using the Content-Length header and by closing the connection (which was only an option for the server). Use of the Content-Length header is not appropriate when the length of the message can not be determined in advance. HTTP/1.1 introduce two new ways to delimit messages; the chunked transfer encoding and self delimiting multipart content types. The chunked transfer encoding means that the message is broken into chunks of arbitrary sizes and that each chunk is preceded by a line specifying the number of bytes in the chunk. The multipart types use a special boundary bytepattern as a delimiter for the messages.

With persistent connections one can improve performance even more by the use of a technique called "pipelining". This means that the client sends multiple requests down the connections without waiting for the response of the first request before sending the second. This can have a dramatic effect on the throughput for high latency links. [NOTE-pipelining-970624]

Event driven programming model

A prerequisite for any sensible support for HTTP persistent connections is to be able to handle both reading and writing at the same time. You would also like to manage idle connections with some timeout. This requires the adoption of an event driven model or a model based on separate threads of control. I have chosen to use the event driven model. Another benefit from this is that we are able to support multiple parallel connections for free and that it is supposed to integrate much easier with GUI toolkits like Tk.

Let's investigate what impact the event driven framework has on the programming model. The basic model for sending requests and receiving responses in LWP5 used to be:

  $res = $ua->request($req);   # return when response is available
  if ($res->is_success) {
      #...
  }

With the new event driven framework it becomes:

  $ua->spool($req1);   # returns immediately
  $ua->spool($req2);   # can send multiple request in parallel
  #...

  mainloop->run;       # return when all connections are gone

Request objects are created and then handed off to the $ua which will queue them up for processing. As you can see, there is no longer any natural place to test the outcome of the requests. What happen is that the requests live their own lives and they will be notified (though a method call) when the corresponding responses are available. You, the application programmer, will have to set up event handlers (in the requests) that react to these events.

Luckily, this does not mean that all old programs must be rewritten. The following show one way to emulate something very close to the old behavior:

  my $res;
  my $req = LWP::Request->new(GET => $url);
  $req->{'done_cb'} = sub { $res = shift; }

  $ua->spool($req);
  mainloop->one_event until $res;

  if ($res->is_success) {
      #...
  }

and this will in fact be used to emulate the old $ua->request() and $ua->simple_request() interfaces. The goal is to be able to completely backwards compatible with the LWP5 modules.

CLASSES

LWP::Request

As you can see from the example above we use the class name LWP::Request (as opposed to HTTP::Request) for the requests created. LWP::Request is a subclass of HTTP::Request, thus it have all the same methods and attributes as HTTP::Request and then some more. The most important of these additions are two callback methods that will be invoked as the response is received:

   $req->response_data($data, $res);
   $req->response_done($res);

The response_data() callback method is invoked repeatedly as parts of the content of the response becomes available. The first time it is invoked is right after the headers in the response message has been parsed. At this point $res will be a reference to a HTTP::Response object with response code and headers initialized, but the message content will be empty. The default implementation of response_data() just appends the data passed to the content of the $res object. It also supports a registered callback function ('data_cb') that will be invoked if defined.

The response_done() callback method is invoked when the whole response has been received. It is guaranteed that it will be invoked exactly once for each request spooled (even for requests that fails.) The default implementation will set up the $res->request and $res->previous links and will handle redirects and unauthorized responses by automatically respooling a (slightly) modified copy of the original requests. It also supports a registered callback function ('done_cb') that will invoked if defined, but only for the last response in case of redirect chains.

As an application programmer you can either subclass LWP::Request, to provide your own versions of response_data() and response_done(), or you can just register callback functions. The last option is probably to be preferred.

The LWP::Request object also provide a few more attributes that might be of interest. The $req->priority is a number that can be used to select which request goes first when multiple are spooled at the same time. Requests with the smallest numbers go first. The default priority happens to be 99.

The $req->proxy attribute tells us if we are going to pass the request to an proxy server instead of the server implied by the URL. If $req->proxy is TRUE, then it should be the URL of the proxy to send the request too.

LWP::MainLoop

The event oriented framework is based on a single common object provided by the LWP::MainLoop module that will watch external IO descriptors (sockets) and timers. When events occur, then registered functions are called and these might in turn call other event handling functions at a higher level and so on.

In order for this to work, the mainloop object needs to be in control when nothing else happens and especially when you expect some kind of protocol handling to take place. This is achieved by repeatedly calling the mainloop->one_event method until we are satisfied. Each call will wait until the next event is available, then invoke the corresponding callback function and then return. The one_event() interface is handy because it can be applied recursively and you can set up inner event loops in event handlers invoked by some outer event loop.

The call mainloop->run is a shorthand for a common form of this loop. It will call mainloop->one_event until there is no registered IO handles (sockets) and no timers left.

The following program shows how you can register your own callbacks. For instance as here the application might want to be able to read commands from the terminal.

  use LWP::MainLoop qw(mainloop);

  mainloop->readable(\*STDIN, \&read_and_do_cmd);
  mainloop->run;

  sub read_and_do_cmd
  {
     my $cmd;
     my $n = sysread(STDIN, $cmd, 512);
     chomp($cmd);

     if ($cmd eq "q") {
         exit;
     } elsif ($cmd =~ /^(get|head|trace)\s+(\S+)/i) {
         $ua->spool(LWP::Request->new(uc($1) => $2));
     } ...

  }

Currently LWPng use its own private event loop implementation. The plan is to adopt the event loop implementation used by the Tk extension, or more precisely Perl will hopefully get a standard event loop that both Tk and LWPng can use. This should allow applications that happily mix the Tk and LWPng modules.

LWP::UA

The LWP::UA represent the main programmer interface towards the network. (We currently use a different name for this object than LWP5 because we are not providing a backwards compatible interface yet. Besides, you would probably like to be able to have both LWP5 and LWPng installed side by side for some time still.)

The main interface to the LWP::UA object is the following methods.

 $ua->spool
 $ua->reschedule
 $ua->stop
 

The $ua->spool() takes one or more LWP::Request object references as argument and arrange for these request to be sent to the servers specified by the URL attribute of the requests.

The $ua->reschedule is usually called automatically as requests are spooled. It will invoke the scheduler whose task is to determine when and how many connections are set up to the different servers we want to communicate with. The scheduler can also decide to kill idle connections. The scheduler is a separate entity because you should be able to replace when you want a different scheduling policy.

The $ua->stop method will propagate too all active and idle connections and kill them. All currently spooled request will return with an error.

The LWP::UA maintains a structure that makes it possible to attach attributes to the URL name space at various hierarchal levels. This is implemented by an object of the URI::Attr class. The main function of this object is too look up attributes that apply to a given URL. The following attribute names are currently used:

default_headers

The default_headers attribute is a hash that is used to initialize static default values for the given headers in requests spooled. More specific values take precedence over less specific. This is how both User-Agent and From headers are added to the requests now.

realms

At the server level we maintain a hash of authentication objects indexed by "realm". Authentication objects are able to initialize the Authorization header in a request.

realm

At the path level we can assign the attribute "realm". This is used to look up one of the authentication objects in the server level realms hash. This effectively assigns the given as the top of some protection space named by the realm value.

proxy

Set which proxy to use for the URLs.

proxy_realms

Same as "realms" by only applies when the server is accessed as a proxy.

proxy_realm

You can assign parts of the URL space to different proxy "realms". This only make sense when the proxy attribute is used.

Explain spool request hooks... All modification to requests as they are spooled are handled by "spool_request" hooks that can be installed and deinstalled as you please. The hook can also complete the handling of a request without any real spooling taking place. For instance a cache check hook could return a response right away if it was available. If it was not available in the cache it could register hooks on the request object that would update the cache as the response came back. Other possibilities are hooks that adds some headers or that block some part of the URL space.

The LWP::UA also maintain a set of 'connection parameters' that can be adjusted. The parameters can be set for individual servers or globally. Individual settings override the global ones. Not all connections types care about all of them.

ReqLimit

This is a number indicating how many requests a single connection will be used for. When the limit has reached, then the connection will close itself. You can get non-persistent connections by specifying this value to be 1.

ReqPending

How many request can be send on a single connections before we have to wait for response from the server. This controls the degree of pipelining.

Timeout

For how long will we wait with no activity on the line, before signaling an error (and closing the connection).

IdleTimeout

When the request queue is empty, connections go to the idle state. This specify how long before the connection is closed if no new work arrives.

Internals overview

For each server that the $ua is going to talk to it maintain a LWP::Server object. This object holds a queue of requests not yet processes. The $ua->spool() method mainly move the request to the correct queue.

A LWP::Server can also create one or more LWP::Conn::HTTP objects that each represent a network connection to the server. The connection objects are were all the action takes place. They will fetch work (request) from the server queue, talk the network protocol and create response objects.

 LWP::UA
 LWP::Server
 LWP::Conn:XXX

 URI::Attr
 LWP::StdSched

LWP::Conn interface

The LWP::Conn objects represent a connection to some server where one or more request/response exchanges can take place. There are different subclasses for various types of the underlying (network) protocols. (Talking about 'subclasses' is kind of a lie, since the base-class does not really manifest itself as any real code.)

LWP::Conn objects conform to the following interfaces when interacting with their manager object (passed in as parameter during creation). For the normal setup, then manager will be a LWP::Server object.

A LWP::Conn object is constructed with the new() method. It takes hash-style arguments and the 'ManagedBy' parameter is really the only mandatory one. It should be an reference to the manager object that will get method callbacks when various events happen.

  $conn = LWP::Conn::XXX->new(MangagedBy => $mgr,
                              Host => $host,
                              Port => $port,
                              ...);

The constructor will return a reference to the LWP::Conn object or undef. If a connection object is returned, then the manager should wait for callbacks methods to be invoked on itself. A return of undef will either indicate than we can't connect to the specified server or that all requests has already been processed. A manager can know the difference based on whether get_request() has been invoked on it or not.

The following methods are invoked by the created LWP::Conn object on their manager. The first two manage the request queue. The last three let the manager be made aware of the state of the connection.

  $mgr->get_request($conn);
  $mgr->pushback_request($conn, @requests);

  $mgr->connection_active($conn);
  $mgr->connection_idle($conn);
  $mgr->connection_closed($conn);

The get_request() method should return a single LWP::Request object or undef if there are no more requests to process. It is passed a reference to the connection object as argument. If the connection objects discover that it has been too greedy (calling get_request() too much), then it might want to return unprocessed request back to the manager. It does so by calling the pushback_request() method with a reference to itself and one or more request objects as arguments. The first request obtained by get_request() should never be pushed back.

The following two methods can be invoked (usually by the manager) on a living $conn object. The activate() method can be invoked on a 'idle' connection to make it start calling get_request() again. The stop() kills the connection (whatever state it is in).

  $conn->activate;
  $conn->stop;

When a connection has received a response, then it will invoke the following two methods on the request object (obtained using get_request()).

  $req->response_data($data, $res);
  $req->response_done($res);

The response_data() method is invoked repeatedly as the body content of the response is received from the network. Invocation of this method is optional and depends on the kind of connection object this is. The response_done() method is always invoked once for each request obtained. It is called when the complete response has been received.

LWP::Conn::HTTP

Well, this is a very special module. You don't usually have designs where the objects change their class all the time.

You should just know about 3 basic states that a connection object can be in and then think of writable(), readable() and inactive() as the three kind of events that we should be prepared to handle at any time.

1) Connecting (a non-blocking connect() has been called)

We are waiting for the socket to become writable (which means that the connect was successful.) readable cant happen. inactive() means failure to connect.

2) Idle (No work to do)

Either work arrive from the application or the socket will become readable. When we read we would expect to get 0 bytes as a signal that the server has closed the connection. We don't ask for the writable event in this state, because we have nothing to write.

3) Active (sending request(s), receiving header of first request)

If the socket becomes writable we send more request data until we are done. If the socket becomes readable we read data until we have seen a whole HTTP header and then switch to a (sub-state) depending on the kind of response we are reading. When a response has been completely received we go back to idle if there is not more requests to send.

The following is an attempt on a picture of the state transitions going on.

          START  ------> Connecting
            |                 |
            |                 |
            V                 |
                              V
          Idle  <-------------+---------<-----\
           /\(work?)                           \
          /  \                                  \
         /    |                                  \
        /     V                                   \
       /                                           \
      /   Active (sending request)     <-----------+ (more work?)
     |      |     \       \      -----             |
     |      |      \       \          \            |
     |      |       \       \          \           |
     |      |        |       |          |          |
     |      V        V       V          V          |
     |                                             |
     |    ConnLen  Chunked  Multipart  UntilEOF    |   (reading response)
     |      |         |        |          |        |
     |      |         |        |          |        |
     |      +-------->+------->+--------- | ------>+
     V                                    |
                                          |
   Closed  <------------------------------+
  (THE END)

This design was inspired by the "State" pattern described in "Design Patterns: Elements of Reusable Object-Oriented Software" (Gamma et.al). The description of the "State" pattern in this book says:

Intent

Allow an object to alter its behavior when its internal state changes. The object will appear to change its class.

Motivation

Consider a class TCPConnection that represents a network connection. A TCPConnection object can be in one of several different states: Established, Listening, Closed. When a TCPConnection object receives request from other objects, it responds differently depending on its current state. For example, the effect of an Open request depends on whether the connection is in its Closed state or its Established state. The state pattern describes how TCPConnection can exhibit different behavior in each state.

Lucky for me Perl is a language that allows me to change the class of a living object. That became handy in this situation.

Using classes to describe states also allows a natural description (and implementation) of substates that behaves like it's base-state for some events but modify the behavior for others.

AUTHOR

Copyright 1997-1998 Gisle Aas.

$Id: News,v 1.8 1998/05/02 06:52:08 aas Exp $