View on
MetaCPAN is shutting down
For details read Perl NOC. After June 25th this page will redirect to
Tim Skirvin > newslib > News::Article::Clean


Annotate this POD

View/Report Bugs


News::Article::Clean - subroutines to clean news article headers


  use News::Article::Clean;
  my $article = new News::Article;
                $article->clean_header( 'References' );

  my $references = News::Article->clean_references($refstring);

See below for more subroutines.


News::Article::Clean is a package that helps clean up news articles for future posting. It can be used as part of a pre-posting script for local users, or as part of a moderation suite. It is intended as an add-on for News::Article.



This package offers the following subroutines within News::Article:

clean_newsgroups ( STRING [, STRING [, STRING [...]]] )

Takes an array of strings containing newsgroup names (separated by commas, as per standard Newsgroups: format), and returns either an array of valid newsgroup names or (in scalar context) a string with these names concatenated with ',' - ie, a proper Newsgroups: header.

clean_followupto ( STRING [, STRING [, STRING [...]]] )

Same as clean_newsgroups, except that, if any of the strings are "poster", then it just returns "poster".

clean_references ( MAXREF, STRING [, STRING [, STRING [...]]] )

Takes an array of strings containing message-IDs, and tries to manage them into a reasonable References: line. Message-IDs that don't patch RFC standards are trimmed; also only keeps MAXREF references (defaults to $News::Article::MAX_REFERENCES), trimming the extra to a single ID of the format <trimmed-COUNT@HOSTNAME> (COUNT is the total number of trimmed messages, and HOSTNAME is taken from $News::Article::MY_DOMAIN). Returns an array of complete References or (in scalar context) a string formatted for 80 columns, useful in the References: header.

clean_messageid ( STRING )

Takes a Message-ID in STRING; if the ID is not formatted correctly, it will make a new one using the same algorithm as News::Article's add_messageid(), with the prefix $News::Cleanheader::PREFIX and the domain $News::Cleanheader::MY_DOMAIN. There's nothing here to try to actually clean up the header yet.

clean_date ( STRING )

Takes a Date string from just about any known format and converts it to standard 1036-based time. Returns undef if it can't parse the format; but given that we're using Date::Parse, this shouldn't be much of a problem.

clean_subject ( STRING )

Reformats a Subject: string to have a standardized Re: format. It should probably get rid of REPOSTs (from Dave the Resurrector) too, but it doesn't yet.

This currently makes "Re: Rejection threshold" into "Re: jection threshold" This oughta be fixed. D'oh.

clean_from ( ADDRESS )

Takes a From: string and reformats it. If the email address is unqualified, it either adds $MY_DOMAIN (if it's a user of the system) or "unknown.invalid" to the address; if it can't find an address at all, it sets the address to "unknown@unknown.invalid". This obviously doesn't demunge addresses, but it's a start.

control ( STRING )

Checks over the given STRING to see if it's a valid Control: string. Returns undef if not.

Not currently very well done.

clean_distibution ( ARRAY )
clean_keywords ( ARRAY )

Takes an array of strings and returns a properly formatted array of their contents. And yes, these are the same function.

clean_header ( HEADER, ARGHASHREF, VALUE )

Basically a giant switch statement between all of the above. Passes VALUE into the functions if we get it; otherwise, we get it out of header(). Arguments come in ARGHASHREF.

Additional headers that can be cleaned with this:

  See-Also      Parsed with clean_references()
  Reply-To      Parsed with clean_from()
  Also-Control  Parsed with clean_control()
  Supersedes    Clears unless clean_messageid() doesn't change anything.

Headers that are known to be clear text (X-*, NNTP-*, Organization, Summary, Lines) have their leading and trailing whitespace trimmed. Other headers have nothing change at all.

Returns the updated information.

clean_head ( HEADER, ARGS )

Sets the value of HEADER to the response of clean_header( HEADER, ARGS). Basically a one-step helper function.

clean_head_all ()

Runs clean_head() on all headers in the message.

clean_body ()

Not yet done. Or really all that close. It does currently do its modifications in place...

clean_article ()

Runs clean_head_all() and clean_body() on the article.


The following global variables are added to News::Article when this package is loaded.


Used by clean_references() to determine what the maximum number of entries in the References: header should be. Defaults to 10, can be set within clean_references().


Used by clean_messageid(), clean_references(), clean_from(), and clean_control() as the default domain for IDs (see News::Article) and From: lines. Defaults to $ENV{'HOSTNAME'}, hostfqdn(), or 'broken-configuration'; this is something that you may want to set on your own.


Used by clean_messageid() as the default prefix new message-IDs (see News::Article)IDs and From: lines. Defaults to $ENV{'HOSTNAME'}, hostfqdn(), or 'broken-configuration'; this is something that you may want to set on your own.

$News::Article::GROUP_CHARS, $News::Article::TAG_CHAR, [...]

Defined in RFC1036bis and used here to decide what header text is valid and what is not. The full list of variables:



The RFC1036bis character formatting bits are fairly old, but seem to be fairly well in use across Usenet at this date. They may well be replaced sometime in the near future, though.


Date::Parse, News::Article




Finish off clean_body(). Put into NewsLib. Set version to 'v1.0'. Use the newer RFC when it's put out.


Tim Skirvin <>


This code may be redistributed under the same terms as Perl itself.


Copyright 1996-2004, Tim Skirvin.

syntax highlighting: