News::Archive - archive news articles for later use
use News::Archive; my $archive = new News::Archive ( 'basedir' => '/home/tskirvin/kiboze' ); # Get a news article my $article = News::Article->new(\*STDIN); my $msgid = article->header('message-id'); die "Already processed '$msgid'\n" if ($archive->article( $messageid )); # Get the list of groups we're supposed to be saving the article into my @groups = split('\s*,\s*', $article->header('newsgroups') ); map { s/\s+//g } @groups; # Make sure we're subscribed to these groups foreach (@groups) { $archive->subscribe($_) } # Actually save the article. my $ret = $archive->save_article( [ @{$article->rawheaders}, '', @{$article->body} ], @groups ); $ret ? print "Accepted article $messageid\n" : print "Couldn't save article $messageid\n";
See below for more options.
News::Archive is a package for storing news articles in an accessible form. Articles are stored one-per-file, and are accessible by either message-ID or overview information. The files are then accessible with a Net::NNTP compatible interface, for easy access by other packages.
News::Archive keeps several files to keep track of its archives:
Keeps track of all newsgroups we are "subscribed" to and all of the information that changes regularly - the number of articles we have archived, the current first and last article numbers, etc.
Watched over with News::Active.
A simple database keeping track of articles by Message-ID. Makes access by ID easy, and ensures that we don't save the same article twice. The database chosen to maintain these is user-determined.
Keeps track of more static information about the newsgroups we are subscribed to - descriptions, creation dates, etc.
Watched over with News::GroupInfo.
Directory structure of all articles, with each article saved as a single textfile within a directory structure laid out at one section of the group name per directory, such as "rec/games/mecha". Crossposts are hardlinked to other directory structures.
Articles are actually divided into sub-directories containing up to 500 articles, to avoid Unix directory size performance limitations. Individual files are thus stored in a file such as "rec/games/mecha/1.500/1".
Each newsgroup also contains overview information, watched over with News::Overview. This overview file goes in the top of the structure, such as "rec/games/mecha/.overview".
You may note that these files are very similar to how INN does its work. This is intentional - this package is meant to act in many ways like a lighter-weight INN.
The following variables are set within News::Archive, and are global throughout all invocations.
Default value for debug() in new objects.
debug()
Default value for hostname() in new objects. Obtained using Sys::Hostname::hostname().
hostname()
Sys::Hostname::hostname()
The number of articles to keep in each directory. Default is 500; change this at your own peril, since things may get screwed up later if you change it after archiving any articles!
These functions create and deal with the object itself.
Creates the News::Archive object. HASHREF contains initialization information for this object; currently supported options:
HASHREF
basedir Base directory for this object to work with. Required; we will fail without this. archives Location of the post archives. Defaults to $basedir/archives historyfile Location of the history database. Defaults to $basedir/historyfile activefile Location of the active file. Defaults to $basedir/active overfilename File name for the overview database files in each newsgroup hierarchy. Defaults to ".overview". db_type The type of perl database we will use to store files that need that level of service. Defaults to 'DB_File' groupinfofile Location of the groupinfo file. Defaults to $basedir/newsgroups. hostname String to use when a local hostname is required. Defaults to $News::Archive::HOSTNAME. debug Should we print debugging information? Defaults to $News::Archive::DEBUG. readonly Should we open this read-only?
Returns the blessed object on success, or undef on failure.
Returns the News::Active object based on activefile, set in new(). If this object has not already been opened and created, creates it; otherwise, just returns the existing object. Passes on the 'readonly' flag.
activefile
Writes out and closes the News::GroupInfo object.
Returns the News::GroupInfo object based on groupinfofile, set in new(). If this object has not already been opened and created, creates it; otherwise, just returns the existing object. Passes on the 'readonly' flag.
groupinfofile
Returns a tied hashref based on historyfile, set in new(). If this object has not already been opened and created, creates it; otherwise, just returns the existing object.
historyfile
Returns true if we want to print debugging information, false otherwise. Used a lot internally, may also be used externally.
Returns the News::Active::Entry information for the given GROUP.
GROUP
Returns the News::GroupInfo::Entry information for the given GROUP.
Close all open files.
These functions deal with the global error variable, which is currently not being used very effectively.
Returns the text (a scalar) describing the last error message. If ERROR is offered, then it sets the error message to this first.
ERROR
Clears the error message.
The following functions are the equivalent of the Net::NNTP commands; they are provided for compatibility with News::Web and other news functions. More information on their use is available in those manual pages.
Retrives the article indicated by MSGID or MSGNUM (Net::NNTP) as the headers, a blank line, and then the body of the article. Either prints it to FH (if offered) or returns an array reference containing the text.
MSGID
MSGNUM
FH
Returns undef if the article is not found.
As with article(), but only returns the header of the article.
article()
As with article(), but only returns the body of the article.
As with article(), but only returns the article's message-id. Returns undef if not set or the article didn't exist.
Sets the current group pointer; necessary if we want to use article() or its ilk by message number and not message-ID. In array context, returns the active information of the group as a list (number of articles, first article number, last article number, group name). In scalar context, just returns the group name.
Writes an article to the archive with Message-ID MSGID. MESSAGE is the actual message. Invokes save_article().
MESSAGE
save_article()
(Note that this is preferred to post(), at least here, because it lets us tell much earlier if we don't want the article.)
post()
Unimplemented.
Returns the local time (in seconds since the epoch).
Returns 0; we don't want anything to get the idea that it can post.
Same as active('*'), listing all active groups.
active('*')
Writes an article to the archive. MESSAGE is the actual message. Invokes save_article().
Close the current connection; clear the current group, and reset the pointer. Returns 1.
Returns a hashref where the keys are the newsgroups that match the pattern PATTERN (uses active()), and the values are descriptiion text for the newsgroup.
PATTERN
active()
Not implemented.
Returns a listref to all groups that we are subscribed to. This is not ideal; we may only want the ones that we have descriptions for, or a specific flag set in News::GroupInfo, or something. It works for now, though.
Returns the overview format information from News::Overview, since that's what we're currently using.
Returns a hashref where the keys are the group names, and the values are the results from News::GroupInfo::Entry-arrayref()>.
News::GroupInfo::Entry-
Returns a hashref where the keys are the group names, and the values are the results from News::Active::Entry-arrayref()>.
News::Active::Entry-
Same as newsgroups()
newsgroups()
Gets information from the stored overview database. See News::Overview for more information on how this works.
Returns the full path name on the server of the location of the given article.
Same as xhdr().
xhdr()
Same as $self->xhdr('References', SPEC)
The following functions actually deal with the archive itself.
Saves an article into the archive. LINEREF is an arrayref that is passed to News::Article; GROUPS is an array of groups that we want to save the article to, if not those listed in the Newsgroups: header.
LINEREF
GROUPS
The article is modified by adding hostname() onto the Path: header and creating a new Xref: header to match where we will save the article. The file is primarily linked to a single location, and hardlinks are made to the other locations. Overview information is generated for each group, history information is saved to ensure that we don't save the same article twice, and directories are created as needed.
Note that there are currently some race conditions possible with this function, which should be partially solved be adding file and directory locking.
Subscribe to the given GROUP, by adding information about the group to the active and groupinfo files and starting the directory tree.
Unsubscribe from GROUP, by removing information about it from the active and groupinfo files.
Returns 1 if we are subscribed to GROUP, 0 otherwise.
Add information to GROUP's overview information regarding article NUMBER, which is ARTICLE. Just appends the information to the overview database; we don't need to do anything more at this point.
NUMBER
ARTICLE
Get the overview information from GROUP for the articles specified by MESSAGE-SPEC (see Net::NNTP). If HDR is offered, only return that header information. Mostly invokes xover().
MESSAGE-SPEC
HDR
xover()
This module has grown out of my original kiboze.pl scripts, which accomplished essentially the same writing functions but none of the reading ones. While a write-only interface has been somewhat beneficial, this should be much more helpful.
Start using the AutoLoader (or something like it)
File locking across the board, along with read-only opens.
Close and re-open the databases periodically, to write stuff out while in the middle of an operation.
While we currently have basic hashing taking place on the newsgroups to prevent the directories from getting too large, it would be nice if this were instead done as a time-hash - that is, if the article was from 28 Apr 2004, we could make directories that looked like 2004.01.01 (yearly hashing), 2004.04.01 (monthly), or 2004.04.28 (daily).
More News::Web changes to better connect with News::Archive would be nice.
Using a different Overview format may make sense.
Offer some functions to rebuild overview information later.
Offer something to make default ~/.kibozerc files.
Net::NNTP::Functions, News::Article, News::Overview, News::Active, News::GroupInfo, DB_File
Net::NNTP::Functions
Modules: News::Active, News::GroupInfo, News::Article, News::Web, newslib, newsrecurse.pl
Scripts: kiboze.pl, newsarchive.pl, mbox2news.pl
Tim Skirvin <tskirvin@killfile.org>
http://www.killfile.org/~tskirvin/software/news-archive/
This code may be redistributed under the same terms as Perl itself.
Copyright 2003-2004, Tim Skirvin.
To install News::Archive, copy and paste the appropriate command in to your terminal.
cpanm
cpanm News::Archive
CPAN shell
perl -MCPAN -e shell install News::Archive
For more information on module installation, please visit the detailed CPAN module installation guide.