The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

News::Archive - archive news articles for later use

SYNOPSIS

  use News::Archive;
  my $archive = new News::Archive 
                ( 'basedir' => '/home/tskirvin/kiboze' );
 
  # Get a news article
  my $article = News::Article->new(\*STDIN);
  my $msgid = article->header('message-id');

  die "Already processed '$msgid'\n" 
                if ($archive->article( $messageid ));

  # Get the list of groups we're supposed to be saving the article into
  my @groups = split('\s*,\s*', $article->header('newsgroups') );
  map { s/\s+//g } @groups;

  # Make sure we're subscribed to these groups
  foreach (@groups) { $archive->subscribe($_) }

  # Actually save the article.
  my $ret = $archive->save_article( 
        [ @{$article->rawheaders}, '', @{$article->body} ], @groups );
  $ret ? print "Accepted article $messageid\n"
       : print "Couldn't save article $messageid\n";

See below for more options.

DESCRIPTION

News::Archive is a package for storing news articles in an accessible form. Articles are stored one-per-file, and are accessible by either message-ID or overview information. The files are then accessible with a Net::NNTP compatible interface, for easy access by other packages.

News::Archive keeps several files to keep track of its archives:

active file

Keeps track of all newsgroups we are "subscribed" to and all of the information that changes regularly - the number of articles we have archived, the current first and last article numbers, etc.

Watched over with News::Active.

history database

A simple database keeping track of articles by Message-ID. Makes access by ID easy, and ensures that we don't save the same article twice. The database chosen to maintain these is user-determined.

newsgroup file

Keeps track of more static information about the newsgroups we are subscribed to - descriptions, creation dates, etc.

Watched over with News::GroupInfo.

archive directory

Directory structure of all articles, with each article saved as a single textfile within a directory structure laid out at one section of the group name per directory, such as "rec/games/mecha". Crossposts are hardlinked to other directory structures.

Articles are actually divided into sub-directories containing up to 500 articles, to avoid Unix directory size performance limitations. Individual files are thus stored in a file such as "rec/games/mecha/1.500/1".

Each newsgroup also contains overview information, watched over with News::Overview. This overview file goes in the top of the structure, such as "rec/games/mecha/.overview".

You may note that these files are very similar to how INN does its work. This is intentional - this package is meant to act in many ways like a lighter-weight INN.

USAGE

Global Variables

The following variables are set within News::Archive, and are global throughout all invocations.

$News::Active::DEBUG

Default value for debug() in new objects.

$News::Active::HOSTNAME

Default value for hostname() in new objects. Obtained using Sys::Hostname::hostname().

$News::Active::HASH

The number of articles to keep in each directory. Default is 500; change this at your own peril, since things may get screwed up later if you change it after archiving any articles!

Basic Functions

These functions create and deal with the object itself.

new ( HASHREF )

Creates the News::Archive object. HASHREF contains initialization information for this object; currently supported options:

  basedir       Base directory for this object to work with.  
                Required; we will fail without this.
  archives      Location of the post archives.  Defaults to 
                $basedir/archives
  historyfile   Location of the history database.  Defaults to
                $basedir/historyfile
  activefile    Location of the active file.  Defaults to
                $basedir/active
  overfilename  File name for the overview database files in each
                newsgroup hierarchy.  Defaults to ".overview".
  db_type       The type of perl database we will use to store 
                files that need that level of service.  Defaults
                to 'DB_File' 
  groupinfofile Location of the groupinfo file.  Defaults to
                $basedir/newsgroups.
  hostname      String to use when a local hostname is required.  
                Defaults to $News::Archive::HOSTNAME.
  debug         Should we print debugging information?  Defaults to
                $News::Archive::DEBUG.
  readonly      Should we open this read-only?  

Returns the blessed object on success, or undef on failure.

activefile ()

Returns the News::Active object based on activefile, set in new(). If this object has not already been opened and created, creates it; otherwise, just returns the existing object. Passes on the 'readonly' flag.

activeclose ()

Writes out and closes the News::GroupInfo object.

groupinfo ()

Returns the News::GroupInfo object based on groupinfofile, set in new(). If this object has not already been opened and created, creates it; otherwise, just returns the existing object. Passes on the 'readonly' flag.

groupclose ()

Writes out and closes the News::GroupInfo object.

history ()

Returns a tied hashref based on historyfile, set in new(). If this object has not already been opened and created, creates it; otherwise, just returns the existing object.

debug ()

Returns true if we want to print debugging information, false otherwise. Used a lot internally, may also be used externally.

activeentry ( GROUP )

Returns the News::Active::Entry information for the given GROUP.

groupentry ( GROUP )

Returns the News::GroupInfo::Entry information for the given GROUP.

close ()

Close all open files.

Error Functions

These functions deal with the global error variable, which is currently not being used very effectively.

error ( [ERROR] )

Returns the text (a scalar) describing the last error message. If ERROR is offered, then it sets the error message to this first.

clear_error ()

Clears the error message.

Net::NNTP Equivalents

The following functions are the equivalent of the Net::NNTP commands; they are provided for compatibility with News::Web and other news functions. More information on their use is available in those manual pages.

article ( [ MSGID|MSGNUM ], [FH] )

Retrives the article indicated by MSGID or MSGNUM (Net::NNTP) as the headers, a blank line, and then the body of the article. Either prints it to FH (if offered) or returns an array reference containing the text.

Returns undef if the article is not found.

head ( [ MSGID|MSGNUM ], [FH] )

As with article(), but only returns the header of the article.

body ( [ MSGID|MSGNUM ], [FH] )

As with article(), but only returns the body of the article.

nntpstat ( [ MSGID|MSGNUM ] )

As with article(), but only returns the article's message-id. Returns undef if not set or the article didn't exist.

group ( [GROUP] )

Sets the current group pointer; necessary if we want to use article() or its ilk by message number and not message-ID. In array context, returns the active information of the group as a list (number of articles, first article number, last article number, group name). In scalar context, just returns the group name.

ihave ( MSGID, MESSAGE )

Writes an article to the archive with Message-ID MSGID. MESSAGE is the actual message. Invokes save_article().

(Note that this is preferred to post(), at least here, because it lets us tell much earlier if we don't want the article.)

last ()

Unimplemented.

date ()

Returns the local time (in seconds since the epoch).

postok ()

Returns 0; we don't want anything to get the idea that it can post.

authinfo ()

Unimplemented.

list ()

Same as active('*'), listing all active groups.

newgroups ()

Unimplemented.

newnews ()

Unimplemented.

newnews ()

Unimplemented.

post ( MESSAGE )

Writes an article to the archive. MESSAGE is the actual message. Invokes save_article().

slave ()

Unimplemented.

quit ()

Close the current connection; clear the current group, and reset the pointer. Returns 1.

newsgroups ( [PATTERN] )

Returns a hashref where the keys are the newsgroups that match the pattern PATTERN (uses active()), and the values are descriptiion text for the newsgroup.

distributions

Not implemented.

subscriptions ()

Returns a listref to all groups that we are subscribed to. This is not ideal; we may only want the ones that we have descriptions for, or a specific flag set in News::GroupInfo, or something. It works for now, though.

overview_fmt ()

Returns the overview format information from News::Overview, since that's what we're currently using.

active_times ( [PATTERN] )

Returns a hashref where the keys are the group names, and the values are the results from News::GroupInfo::Entry-arrayref()>.

active ( [PATTERN] )

Returns a hashref where the keys are the group names, and the values are the results from News::Active::Entry-arrayref()>.

xgtitle ( [PATTERN] )

Same as newsgroups()

xhdr ( HEADER, SPEC [, PATTERN] )
xover ( MATCH, HDR )

Gets information from the stored overview database. See News::Overview for more information on how this works.

xpath ( MID )

Returns the full path name on the server of the location of the given article.

xpat ( HEADER, SPEC [, PATTERN] )

Same as xhdr().

xrover ( SPEC )

Same as $self->xhdr('References', SPEC)

listgroup

Unimplemented.

reader ()

Unimplemented.

Archive Functions

The following functions actually deal with the archive itself.

save_article ( LINES [, GROUPS] )

Saves an article into the archive. LINEREF is an arrayref that is passed to News::Article; GROUPS is an array of groups that we want to save the article to, if not those listed in the Newsgroups: header.

The article is modified by adding hostname() onto the Path: header and creating a new Xref: header to match where we will save the article. The file is primarily linked to a single location, and hardlinks are made to the other locations. Overview information is generated for each group, history information is saved to ensure that we don't save the same article twice, and directories are created as needed.

Note that there are currently some race conditions possible with this function, which should be partially solved be adding file and directory locking.

subscribe ( GROUP )

Subscribe to the given GROUP, by adding information about the group to the active and groupinfo files and starting the directory tree.

unsubscribe ( GROUP )

Unsubscribe from GROUP, by removing information about it from the active and groupinfo files.

subscribed ( GROUP )

Returns 1 if we are subscribed to GROUP, 0 otherwise.

overview_add ( NUMBER, GROUP, ARTICLE )

Add information to GROUP's overview information regarding article NUMBER, which is ARTICLE. Just appends the information to the overview database; we don't need to do anything more at this point.

overview_read ( GROUP, MESSAGE-SPEC [, HDR ] )

Get the overview information from GROUP for the articles specified by MESSAGE-SPEC (see Net::NNTP). If HDR is offered, only return that header information. Mostly invokes xover().

NOTES

This module has grown out of my original kiboze.pl scripts, which accomplished essentially the same writing functions but none of the reading ones. While a write-only interface has been somewhat beneficial, this should be much more helpful.

TODO

Start using the AutoLoader (or something like it)

File locking across the board, along with read-only opens.

Close and re-open the databases periodically, to write stuff out while in the middle of an operation.

While we currently have basic hashing taking place on the newsgroups to prevent the directories from getting too large, it would be nice if this were instead done as a time-hash - that is, if the article was from 28 Apr 2004, we could make directories that looked like 2004.01.01 (yearly hashing), 2004.04.01 (monthly), or 2004.04.28 (daily).

More News::Web changes to better connect with News::Archive would be nice.

Using a different Overview format may make sense.

Offer some functions to rebuild overview information later.

Offer something to make default ~/.kibozerc files.

REQUIREMENTS

Net::NNTP::Functions, News::Article, News::Overview, News::Active, News::GroupInfo, DB_File

SEE ALSO

Modules: News::Active, News::GroupInfo, News::Article, News::Web, newslib, newsrecurse.pl

Scripts: kiboze.pl, newsarchive.pl, mbox2news.pl

AUTHOR

Tim Skirvin <tskirvin@killfile.org>

HOMEPAGE

http://www.killfile.org/~tskirvin/software/news-archive/

LICENSE

This code may be redistributed under the same terms as Perl itself.

COPYRIGHT

Copyright 2003-2004, Tim Skirvin.