James E Keenan > Mail-Digest-Tools > Mail::Digest::Tools

Download:
Mail-Digest-Tools-2.12.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 2.12   Source  

NAME ^

Mail::Digest::Tools - Tools for digest versions of mailing lists

VERSION ^

This document refers to version 2.12 of digest.pl, released May 14, 2011.

SYNOPSIS ^

    use Mail::Digest::Tools qw( 
        process_new_digests
        reprocess_ALL_digests
        reply_to_digest_message
        repair_message_order
        consolidate_threads_multiple
        consolidate_threads_single
        delete_deletables
    );

%config_in and %config_out are two configuration hashes whose setup is discussed in detail below.

    process_new_digests(\%config_in, \%config_out);

    reprocess_ALL_digests(\%config_in, \%config_out);

    $full_reply_file = reply_to_digest_message(
        \%config_in, 
        \%config_out, 
        $digest_number, 
        $digest_entry, 
        $directory_for_reply,
    );

    repair_message_order(
        \%config_in, 
        \%config_out,
        {
            year   => 2004,
            month  => 01,
            day    => 27,
        }
    );

    consolidate_threads_multiple(
        \%config_in,
        \%config_out,
        $first_common_letters,  # optional integer argument; defaults to 20
    );

    consolidate_threads_single(
        \%config_in, 
        \%config_out, 
        [
            'first_dummy_file_for_consolidation.thr.txt',
            'second_dummy_file_for_consolidation.thr.txt',
        ],
    );

    delete_deletables(\%config_out);

DESCRIPTION ^

Mail::Digest::Tools provides useful tools for processing mail which an individual receives in a 'daily digest' version from a mailing list. Digest versions of mailing lists are provided by a variety of mail processing programs and by a variety of list hosts. Within the Perl community, digest versions of mailing lists are offered by such sponsors as Active State, Sourceforge, Yahoo! Groups and London.pm. However, you do not have to be interested in Perl to make use of Mail::Digest::Tools. Mail from any of the thousands of Yahoo! Groups, for example, may be processed with this module.

If, when you receive e-mail from the digest version of a mailing list, you simply read the digest in an e-mail client and then discard it, you may stop reading here. If, however, you wish to read or store such mail by subject, read on. As printed in a normal web browser, this document contains 40 pages of documentation. You are urged to print this documentation out and study it before using this module.

To understand how to use Mail::Digest::Tools, we will first take a look at a typical mailing list digest. We will then sketch how that digest looks once processed by Mail::Digest::Tool. We will then discuss Mail::Digest::Tool's exportable functions. Next, we will study how to prepare the two configuration hashes which hold the configuration data. Finally, we will provide some tips for everyday use of Mail::Digest::Tools.

A TYPICAL MAILING LIST DIGEST ^

Here is a dummied-up version of a typical mailing list digest as it appears once saved to a plain-text file. For illustrative purposes, let us suppose that the file is named: 'Perl-Win32-Users Digest, Vol 1 Issue 9999.txt'

    Send Perl-Win32-Users mailing list submissions to
    perl-win32-users@listserv.ActiveState.com

    When replying, please edit your Subject line so it is more specific
    than "Re: Contents of Perl-Win32-Users digest..."

    Today's Topics:

      1. Introducing Mail::Digest::Tools (James E Keenan)
      2. A Different Discussion (steve)
      3. Re:  Introducing Mail::Digest::Tools (David H Adler)

    ----------------------------------------------------------------------

    Message: 1
    From: "James E Keenan" <jkeen@some.web.address.com>
    To: <Perl-Win32-Users@listserv.activestate.com>
    Subject: Introducing Mail::Digest::Tools
    Date: Sat, 31 Jan 2004 14:10:20 -0600

    Mail::Digest::Tools is the greatest thing since sliced bread.
    Go download it now!

    ------------------------------

    Message: 2
    From: "steve" <steve@some.web.address.com>
    To: <Perl-Win32-Users@listserv.activestate.com>
    Subject: A Different Discussion
    Date: Sat, 31 Jan 2004 14:40:20 -0600

    This is a new topic.  I am not discussing Mail::Digest::Tools in this 
    submission.

    ------------------------------

    Message: 3
    From: "David H Adler" <dha@some.web.address.com>
    To: <Perl-Win32-Users@listserv.activestate.com>
    Subject: Re: Introducing Mail::Digest::Tools
    Date: Sat, 31 Jan 2004 14:50:20 -0600

    Jim, what's this nonsense about sliced bread.  Weren't you on the Atkins 
    diet?  Unlike beer, sliced bread is Off Topic.

    ------------------------------

    _______________________________________________
    Perl-Win32-Users mailing list
    Perl-Win32-Users@listserv.ActiveState.com
    To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

    End of Perl-Win32-Users Digest

Note that the digest has an overall structure, while each message within the digest has its own structure.

The digest's overall structure consists of:

The Typical Digest After Processing with Mail::Digest::Tools

Using the dummy messages provided above, typical use of Mail::Digest::Tools would produce (in a bare-bones configuration) the following results:

FUNCTIONS ^

Mail::Digest::Tools exports no functions by default. Each of its current seven functions is imported only on request by your script.

In everyday use, you will probably call just one of Mail::Digest::Tool's exportable functions in a particular Perl script. Typically, you will import the function as described in the SYNOPSIS above, populate two configuration hashes, and finally call the one function you have imported.

As will become evident, the most challenging part of using Mail::Digest::Tools is not calling the functions. Rather, it is the initial setup and testing of configuration files from which the two configuration hashes passed as arguments to the various Mail::Digest::Tools functions are drawn.

More on those configuration hashes later. For now, let's look at the exportable functions.

process_new_digests

    process_new_digests(\%config_in, \%config_out);

process_new_digests() is the Mail::Digest::Tools function which you will use most frequently on a daily basis. Based on information supplied in the two configuration hashes passed to it as arguments, process_new_digests() does the following:

reprocess_ALL_digests

    reprocess_ALL_digests(\%config_in, \%config_out);

reprocess_ALL_digests() is the Mail::Digest::Tools function which you should use ONLY when you are setting up and fine-tuning Mail::Digest::Tools to process a given digest -- and you should NEVER use it thereafter!

Why? Read on!

reprocess_ALL_digests() does almost exactly the same things as does process_new_digests(), but it does them on ALL digest files found in the directory in which you store such digests -- not just on those previously processed. But in the process it does not merely append new messages to already existing thread files, leaving older thread files untouched. Instead, reprocess_ALL_digests() WIPES OUT your entire directory of thread files and rebuilds it from scratch.

That's cool if you have retained all instances of a given digest which you wish to process into thread files. But if you've thrown out older instances of a given digest and call reprocess_ALL_digests(), you will not be able to process the messages contained in those discarded digests. The message sources are gone. That's cool once you're certain that you've got a given digest configured just the way you want it -- but not until that moment.

The ALL CAPS in reprocess_ALL_digests() is a little warning that this Mail::Digest::Tools function is very powerful, but potentially very dangerous. You are also alerted to this danger by this screen prompt which appears when you call this function:

     By default, this program processes only NEWLY ARRIVED
     [London.pm/other digest] files found in this directory.  Messages in
     these new digests are sorted and appended to the appropriate
     '.thr.txt' files in the 'Threads' subdirectory.

     However, by choosing method 'reprocess_ALL_digests()' you have
     indicated that you wish to process ALL digest files found in this     
     directory -- regardless of whether or not they have previously been
     processed.  This is recommended ONLY for initialization and testing 
     of this program.

     Since this will wipe out all threads files ('.thr.txt') as well -- 
     including threads files for which you no longer have their source 
     digest files -- please confirm that this is your intent by typing 
     ALL at the prompt.


                               GOT IT?

To proceed, you must type ALL in ALL CAPS, hit [Enter], then respond to yet another prompt:

     You have chosen to WIPE OUT all '.thr.txt' files currently
     existing in the 'Threads' subdirectory and reprocess all
     [London.pm/other digest] digest files from scratch.

     Please re-confirm your choice by once again typing 'ALL'
         and hitting [Enter]:

You must again type ALL in ALL CAPS and hit [Enter] to reprocess all digests. Should you fail to type ALL at both of these prompts, your script will default to process_new_digests() and only process newly arrived digest files.

reply_to_digest_message

    $full_reply_file = reply_to_digest_message(
        \%config_in, 
        \%config_out, 
        $digest_number, 
        $digest_entry, 
        $directory_for_reply,
    );

Once you have begun to follow discussion threads on a mailing list with the aid of Mail::Digest::Tools, you may wish to join the discussion and reply to a message.

If you tried to do this by hitting the 'Reply' button in your e-mail client, you would probably end up with a 'Subject' line in your e-mail that looked this:

    Re: london.pm digest, Vol 1 #1814 - 2 msgs

Needless to say, this is tacky. So tacky that many mailing list digest programs insert this message into each digest's headers:

    When replying, please edit your Subject line so it is more specific
    than "Re: Contents of london.pm digest, Vol 1, #xxxx..."

You don't want to be tacky; you want to be lazy. You want Perl to do the work of initiating an e-mail with a meaningful subject header for you. Mail::Digest::Tool's reply_to_digest_message does just this. It creates a plain-text file for you that has a meaningful subject line and prepends each line of the body of the message with \ >. You then open this plain-text file, edit it to reply to its contents, copy-and-paste it into your e-mail client, and send it.

The arguments passed to reply_to_digest_message() are:

repair_message_order

    repair_message_order(
        \%config_in, 
        \%config_out,
        {
            year   => 2004,
            month  => 01,
            day    => 27,
        }
    );

From time to time you may receive digest versions of mailing lists out of chronological/numerical sequence. This is especially true when e-mail traffic is being disrupted by worms or viruses. You may discover that you have received and processed

    london.pm digest, Vol 1 #1856 - 7 msgs
    london.pm digest, Vol 1 #1858 - 15 msgs

before realizing that you were missing

    london.pm digest, Vol 1 #1857 - 18 msgs

If you were to now process digest 1857 with process_new_digests(), messages from that digest would be appended to their respective thread files after messages from digest 1858. Since the whole point of Mail::Digest::Tools is to be able to read a discussion thread in chronological order, this would not be desirable.

Fortunately, you can fix this problem as follows:

consolidate_threads_multiple

    consolidate_threads_multiple(
        \%config_in,
        \%config_out,
    );

or

    consolidate_threads_multiple(
        \%config_in,
        \%config_out,
        $first_common_letters,  # optional integer argument
    );

As described above, Mail::Digest::Tool's process_new_digests() function will, to the greatest extent possible, delete extraneous words such as 'Re:' or 'Fwd:' from a message's subject so that all relevant postings on a given subject can be included in a single thread file. What happens when this is not sufficient? For example, suppose someone posts a message to a list with a slightly misspelled or altered subject line:

Mail::Digest::Tools offers two functions to address this problem. consolidate_threads_multiple() is the easier to use and will be discussed first. This function presumes that people who re-type e-mail subject lines when replying tend to type the first several words correctly, then make errors or alterations toward the end of the subject line. If the first n letters of the subject line of two or more messages are identical, there is a strong chance that the messages are discussing the same topic and should be posted to the same discussion thread. Mail::Digest::Tool's default value for n is 20, but you can set a different value for a particular digest by passing an optional third argument as shown above. consolidate_threads_multiple() accordingly:

consolidate_threads_single

    consolidate_threads_single(
        \%config_in, 
        \%config_out, 
        [
            'first_dummy_file_for_consolidation.thr.txt',
            'second_dummy_file_for_consolidation.thr.txt',
        ],
    );

Suppose that the thread files which you wish to consolidate have names whose spelling diverges before the 21st letter. The algorithm which consolidate_threads_multiple() applies would not detect the potential rationale for consolidation. This could happen when someone tries to change the subject of discussion from:

    Best book for extreme Newbie to programming

to:

    De incunabula nostra (Was Best book for extreme Newbie to programming)

Solution: Hard-code the files to be consolidated as elements of an anonymous array. Pass a reference to that anonymous array as the third argument to consolidate_threads_single() as shown above.

As with consolidate_threads_multiple(), the resulting consolidated file will bear the name of the source file containing the very first posting to the discussion thread. The files so consolidated will not automatically be deleted. Rather, they will be renamed with the extension .DELETABLE as a safety precaution and left for you to delete with delete_deletables().

delete_deletables

    delete_deletables(\%config_out);

Mail::Digest::Tools function delete_deletables() tidies up after use of either consolidate_threads_multiple() or consolidate_threads_single(). Unlike all other public functions provided by Mail::Digest::Tools, delete_deletables() needs to be passed a reference to only one of the two configuration hashes, viz., the 'out' configuration hash. The function simply changes to the directory where thread files for a given digest are stored and deletes all files with the extension .DELETABLE.

CONFIGURATION SETUP OVERVIEW ^

To use a Mail::Digest::Tool function, you need to answer two fundamental questions:

  1. What internal structure has the mailing list sponsor provided for a given digest?
  2. How do I want to structure the results of applying Mail::Digest::Tools to a particular digest on my system?

Each of these two questions breaks down into sub-parts. Their answers supply you with the information with which you will construct the two configuration hashes passed to most Mail::Digest::Tools functions. Let us take each in turn.

%config_in: THE INTERNAL STRUCTURE OF A DIGEST ^

The best way to learn about the internal structure of a mailing list digest (other than to study the application which created the digest in the first place) is to accumulate several instances of the digest on your system in a directory devoted to that purpose. Examine the way the digest's filename is formed. Then examine the digest file itself. You will soon pick up a feel for the structure of the digest, which will guide you in configuring Mail::Digest::Tools for your system. That configuration will take the form of a Perl hash which, for illustrative purposes, we shall here call %xxx_config_in where xxx is a short-hand title for a particular digest.

For heuristic purposes we will examine the characteristics of two mailing list digests which the author has been following and archiving for several years: ActiveState's 'Perl-Win32-Users' digest and Yahoo! Groups' Perl Beginners group digest.

Analysis of Digest's File Name

We must study a digest's file name in order to be able to write a pattern with which we will be able to distinguish a digest file from any non-digest file sitting in the same directory, as well as to be able to extract the digest number from that file name.

Once saved as plain-text files, Perl-Win32-Users digest files typically look like this in a directory:

    Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
    Perl-Win32-Users Digest, Vol 1 Issue 1772.txt

Similarly, the Perl Beginner digest files look like this:

    [PBML] Digest Number 1491.txt
    [PBML] Digest Number 1492.txt

To correctly identify Perl-Win32-Users digest files from any other files in the same directory, we compose a string which would form the core of a Perl regular expression, i.e., everything in a pattern except the outer delimiters. Internally, Mail::Digest::Tools passes the file name through a grep { /regexp/ } pattern, so the first key is called grep_formula.

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users Digest',
        ...
    );

The equivalent pattern for the Perl Beginners digest would be:

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        ...
    );

Note that the [ and ] characters have to be escaped with a \ backslash because they are normally metacharacters inside Perl regular expressions.

We next have to extract the digest number from the digest's file name. Certain mailing list programs give individual digests both a 'Volume' number as well as an individual digest number. Perl-Win32-Users typifies this. In the example above we need to capture both the 1 as volume number and 1771 as digest number. The next key in our configuration hash is called pattern_target:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users Digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        ...
    );

Note the two sets of capturing parentheses.

Other digests, such as those at Yahoo! Groups, dispense with a volume number and simply increment each digest number:

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        ...
    );

Note that this pattern_target contains only one pair of capturing parentheses.

Analysis of Digest's Internal Structure

A digest's internal structure is discussed in detail above (see 'A TYPICAL MAILING LIST DIGEST'). Here we need to identify two characteristics: the way the digest introduces its list of today's topics and the string it uses to delimit the list of today's topics from the first individual message in the digest and all subsequent messages from one another. Continuing with our two examples from above, we provide values for keys topics_intro and source_msg_delimiter:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        topics_intro            => 'Today\'s Topics:',
        source_msg_delimiter    => "--__--__--\n\n",
        ...
    );

Note the escaped ' apostrophe character in the value for key topics_intro.

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        topics_intro            => 'Topics in this digest:',
        source_msg_delimiter    => "________________________________________________________________________\n________________________________________________________________________\n\n",
        ...
    );

Note that the values provided for the respective source_msg_delimiter keys had to be double-quoted strings. That's because all such delimiters include two or more \n newline characters so that they form paragraphs unto themselves. Unless indicated otherwise, the values for all other values in the configuration hash are single-quoted strings.

Note: In early 2004, while Mail::Digest::Tools was being prepared for its initial distribution on CPAN, ActiveState changed certain features in the daily digest versions of its mailing lists. Hence, the code example presented above should not be 'copied-and-pasted' into a configuration hash with which you, the user, might follow the current Perl-Win32-Users digest. In particular, the source message delimiter was changed to a string of 30 hyphens followed by 2 \n newline characters:

    "------------------------------\n\n"

However, since it is not unheard of for contributors to a mailing list to use such a string of hyphens within their postings or signatures, using a string of hyphens is not a particularly apt choice for a source message delimiter. In this particular case, the author is getting better (but not fully tested) results by including an additional newline before the hyphen string in order to more uniquely identify the source message delimiter:

    "\n------------------------------\n\n"

Analysis of Individual Messages

The internal structure of an individual message within a digest is also discussed in detail above. Here we need to identify patterns with which we can extract the content of the message's headers.

Certain mailing list digest programs allow a wide variety of headers to appear in digested messages. The Perl-Win32-Users digest typifies this. Each message in a Perl-Win32_Users digest must have a message number and headers for the message's author, recipients, subject and date.

    Message: 1
    From: Chris Smithson <ChrisSmithson@some.web.address.com>
    To: "'Carter Kraus'" <carter@some.web.address.com>,
           "Perl-Win32-Users (E-mail)" <perl-win32-users@activestate.com>
    Subject: RE: OO Perl Issue.
    Date: Wed, 4 Feb 2004 14:17:24 -0600 

But a message in this digest may have additional headers for the author's organization, reply address and/or carbon-copy recipients.

    Message: 5
    Date: Wed, 4 Feb 2004 15:15:44 -0800
    From: Sam Spade <sspade@some.web.address.com>
    Organization: Some Web Address
    Reply-To: Sam Spade <sspade@some.web.address.com>
    To: "Time" <summers@some.web.address.com>
    CC: "Perl List" <perl-win32-users@listserv.activestate.com>
    Subject: Re: New IE Update causes script problems

Patterns are easily developed to capture this information and store it in the configuration hash:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        topics_intro            => 'Today\'s Topics:',
        source_msg_delimiter    => "--__--__--\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^From:\s+(.+)$',
        org_style_flag          => '^Organization:\s+(.+)$',
        to_style_flag           => '^To:\s+(.+)$',
        cc_style_flag           => '^CC:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^Date:\s+(.+)$',
        reply_to_style_flag     => '^Reply-To:\s+(.+)$',
        ...
    );

Other mailing list digest programs allow much fewer headers in digested messages. The Yahoo! Groups digests such as Perl Beginner typify this.

    Message: 4
       Date: Sun, 7 Dec 2003 19:24:03 +1100
       From: Philip Streets <phil@some.web.address.com.au>
    Subject: RH9.0, perl 5.8.2 and qmail-localfilter question

The patterns developed to capture this information and store it in the configuration hash would be as follows:

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        topics_intro            => 'Topics in this digest:',
        source_msg_delimiter    => "________________________________________________________________________\n________________________________________________________________________\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^\s+From:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^\s+Date:\s+(.+)$',
        ...
    );

Note that this pattern is written to expect 1 or more whitespaces at the beginning of the from_style_flag and the date_style_flag.

We could -- but do not need to -- add the following key-value pairs to the %pbml_config_in hash.

        org_style_flag          => undef,
        to_style_flag           => undef,
        cc_style_flag           => undef,
        reply_to_style_flag     => undef,

Inspection of Messages for Multipart MIME Content

Certain mailing lists allow subscribers to post messages in either plain-text or HTML. Certain lists allow subscribers to post attachments; others do not. When it comes to preparing digests of these messages, the programs which different lists take lead to different results. The most annoying situation occurs when a list allows a subscriber to post in 'multipart MIME format' and then fails to strip out the redundant HTML part after printing the needed plain-text part.

Example: An all too typical example from an older version of an ActiveState list digest. (ActiveState changed the format of its digests in early 2004 to strip out HTML attachments. Hence, the following code no longer accurately represents what a subscriber to an ActiveState digest will see. Other mailing lists still suffer from MIME bloat, however, so treat the following code as illustrative.) The message begins:

    Message: 1
    To: Perl-Win32-Users@activestate.com
    Subject: Can not tie STDOUT to scolled Tk widget
    From: John_Wonderman@some.web.address.ca
    Date: Thu, 15 Jan 2004 16:25:17 -0500
    This is a multipart message in MIME format.
    --=_alternative 00750F0485256E1C_=
    Content-Type: text/plain; charset="US-ASCII"
    Hi;
    I am trying to implement a scrolling text widget to capture output for for 
    at tk app. Without scrolling:
    my $text = $mw->Text(-width => 78,
           -height => 32,
           -wrap => 'word',
           -font => ['Courier New','11']
    )->pack(-side => 'bottom',
           -expand => 1,
           -fill => 'both',
    );
    ...

When the plain-text part of the message is finished, it is then repeated in HTML:

    --=_alternative 00750F0485256E1C_=
    Content-Type: text/html; charset="US-ASCII"
    <br><font size=2 face="Tahoma">Hi;</font>
    <p><font size=2 face="Tahoma">I am trying to implement a scrolling text
    widget to capture output for for at tk app. Without scrolling:</font>
    <p><font size=2 face="Bitstream Vera Sans Mono">my $text = $mw-&gt;Text(-width
    =&gt; 78,</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -height =&gt; 32,</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -wrap =&gt; 'word',</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -font =&gt; ['Courier New','11']</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">)-&gt;pack(-side =&gt;
    'bottom',</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -expand =&gt; 1,</font>
    <br><font size=2 face="Bitstream Vera Sans Mono">&nbsp; &nbsp; &nbsp; &nbsp;
    -fill =&gt; 'both',</font>

There is no reason to retain this bloat in your thread file. The digest providers should have stripped it out, but the program they were using failed to do so. Other digests, such as those at Yahoo! Groups, eliminate all this blather.

Now, with Mail::Digest::Tools, you can eliminate much of the bloat yourself. After examining 6-10 instances of a particular mailing list digest, you should be able to determine whether the digest needs a dose of digital castor oil or not, and you set key MIME_cleanup_flag accordingly. If the digest contains unnecessary multipart MIME content, you set this flag to 1; otherwise, to 0.

And with that you have completed your analysis of the internal structure of a given digest and entered the relevant information into the first configuration hash:

    %pw32u_config_in = (
        grep_formula            => 'Perl-Win32-Users digest',
        pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
        topics_intro            => 'Today\'s Topics:',
        source_msg_delimiter    => "--__--__--\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^From:\s+(.+)$',
        org_style_flag          => '^Organization:\s+(.+)$',
        to_style_flag           => '^To:\s+(.+)$',
        cc_style_flag           => '^CC:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^Date:\s+(.+)$',
        reply_to_style_flag     => '^Reply-To:\s+(.+)$',
        MIME_cleanup_flag       => 1,
    );

    %pbml_config_in = (
        grep_formula            => '\[PBML\]',
        pattern_target          => '.*\s(\d+)\.txt$',
        topics_intro            => 'Topics in this digest:',
        source_msg_delimiter    => "________________________________________________________________________\n________________________________________________________________________\n\n",
        message_style_flag      => '^Message:\s+(\d+)$',
        from_style_flag         => '^\s+From:\s+(.+)$',
        subject_style_flag      => '^Subject:\s+(.+)$',
        date_style_flag         => '^\s+Date:\s+(.+)$',
        MIME_cleanup_flag       => 0,
    );

%config_out: HOW TO PROCESS A DIGEST ON YOUR SYSTEM ^

%config_in holds the answers to the question: What internal structure has the mailing list sponsor provided for a given digest? In contrast, %config_out will hold the answer to this question: How do I want to structure the results of applying Mail::Digest::Tools to a particular digest on my system?

For purpose of illustration, we will continue to assume that we are processing digest files received from the Perl-Win32-Users and Perl Beginner lists. We will make slightly different choices as to how we process those digest files so as to illustrate different options available from Mail::Digest::Tools.

We shall also assume that we going to place the scripts from which we call Mail::Digest::Tools functions in the directory above the directories in which we store the digest files once they have been saved as plain-text files. If we call this directory digest and place the scripts in that directory, then we will have a directory structure that starts out like this:

    digest/
        process_new.pl
        process_ALL.pl
        reply_digest_message.pl
        repair_digest_order.pl
        consolidate_threads.pl
        deletables.pl
        pw32u/
            Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
            Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
        pbml/
            [PBML] Digest Number 1491.txt
            [PBML] Digest Number 1492.txt

Required %config_out Keys

There are 9 keys which are required in %config_out in order for Mail::Digest::Tools to function properly. They correspond to 9 decisions which you must make in setting up a Mail::Digest::Tools configuration on your system.

1 Title

Each digest must be given a title which is used whenever Mail::Digest::Tools needs to prompt or warn you on standard output. The key which holds this information in %config_out must be called title; the value for this element should be sensible.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        ...
    );
2 Digest Directory

For each digest a directory must be designated where individual digest files are stored in plain-text format. The key which holds this information in %config_out must be called dir_digest. In the examples below directories are named relative to the 'current' directory (..), i.e., the directory where the script invoking a Mail::Digest::Function is stored.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        ...
    );
3 Threads Directory

For each digest a directory must be designated where the thread files created by use of Mail::Digest::Tools functions are stored. The key which holds this information in %config_out must be called dir_threads. In the examples below the threads directory is a subdirectory of the digest directory, but you may make other choices.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        ...
    );
4 Digests Log File

For each digest a file must be kept which logs whether a given digest file has already been processed or not and, if so, when. The key which holds this information in %config_out must be called digests_log. It has been found convenient to keep this file in the digests directory, but you may make other choices.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        ...
    );
5 Today's Topics

For each digest a file must be kept which holds an ongoing record of the list of topics found in each individual digest file. The key which holds this information in %config_out must be called <todays_topics>. It has been found convenient to keep this file in the digests directory, but you may make other choices.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        ...
    );
6 Format for Identifying Digest Number in Output

For each digest you must choose how to format the number(s) of the individual digest file being processed when messages from that file are written to a threads file. What you are doing here is formatting the information captured by the pattern_target key in a given digest's %config_in (see above). You express this choice as a single-quoted string which formats the data captured by Perl regular expression which in pattern_target. This formatting is done via the Perl sprintf function. The resulting string is assigned to be the value of %config_out key <id_format>.

We saw above that digests from the Perl-Win32-Users list carried both a volume number and an individual digest number.

    Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
    Perl-Win32-Users Digest, Vol 1 Issue 1772.txt

Both numbers were captured by the Perl regular expression in %pw32u_config_in key <pattern_target>.

    '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',

Here we have chosen to format the volume number as a 3-digit, 0-padded number and the individual digest number as a 4-digit, 0-padded number. We then join these two data with an underscore.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . \'_\' . sprintf("%04d",$2)',
        ...
    );

We saw above that digests from the Perl Beginners list carried only an digest number -- no volume number.

    [PBML] Digest Number 1491.txt
    [PBML] Digest Number 1492.txt

This number was captured by the Perl regular expression in %pbml_config_in key <pattern_target>.

    '.*\s(\d+)\.txt$'

Here we have chosen to format the digest number as a 5-digit, 0-padded number.

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        ...
    );

Note that if you allow for a 4-digit number, the highest numbered digest you can process off a given mailing list will be 9999. If you allow for a 5-digit number, the upper limit will be 99999. The latter should be sufficient for a lifetime even for a mailing list (e.g., London.pm) which generates 3 or 4 digest files per day or over 1000 per year.

7 Format for Numbering Individual Messages in Output

For each digest you must choose how to format the number which the digest assigns to its individual messages. Experience suggests that 2 digits should be more than sufficient to format this number, as all digests which the author has observed have fewer than 100 entries. However, below we have arbitrarily decided to allow for up to 9999 entries in a given digest. As with the digest number, the formatting is accomplished via the Perl sprintf function. The result is stored in a %config_out key which must be called output_id_format.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . 
                                           \'_\' . sprintf("%04d",$2)',
        output_id_format           => 'sprintf("%04d",$1)',
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        output_id_format           => 'sprintf("%04d",$1)',
        ...
    );
8 Thread Message Delimiter

For each digest you must compose a string which will separate one message in a threads file from its successor. This string must be double-quoted and assigned to %config_out key thread_msg_delimiter. For readability, this string should terminate in two or more \n\n newline characters so that the delimiter is always a paragraph unto itself.

This delimiter may -- or may not -- be the same string which the mailing list provider uses to separate messages in the digest files themselves. In other words, you may choose to use the same string for thread_msg_delimiter in %config_out as you reported the list provider used in %config_in key source_msg_delimiter.

In the example below we make the thread_msg_delimiter for the output from Perl-Win32-Users to be the same as its source_msg_delimiter.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . 
                                           \'_\' . sprintf("%04d",$2)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "--__--__--\n\n",
        ...
    );

Note: In light of the earlier discussion of the changes ActiveState made to its mailing list digests in early 2004, the reader is cautioned that the code above should not be directly 'copied-and-pasted' into a configuration hash with which you might follow an ActiveState mailing list. Treat it as educational. In particular, the author is now testing the following as a setting for $pw32u_config_out{'thread_msg_delimiter'}:

    "\n--__--__--\n\n",

For threads generated by appling Mail::Digest::Tools to the Perl Beginners list, we choose an output message delimiter which differs from the source message delimiter.

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
        ...
    );

Whatever choice you make for the thread_msg_delimiter it should be a string unlikely to occur within the text of a message and should terminate in two or more newlines.

9 Archive or Delete Threads?

For each digest you process with Mail::Digest::Tools, you must decide whether to retain the resulting thread files in an archive them in a separate directory after a specified period of time, to delete them from disk after a specified period of time, or to do neither and allow them to accumulate indefinitely in the threads directory. Your decision is represented as the value of %config_out key <archive_kill_trigger>. This value must be expressed as one of three numerical values:

     0    Thread files are neither archived nor deleted

     1    Thread files are archived in a separate directory (or directories) 
          after the number of days specified by key 'archive_kill_days' 
          (see below)

    -1    Thread files are deleted after I<n> days as specified by key 
          'archive_kill_days' 

In the examples below we have chosen to archive all threads generated by the Perl-Win32-Users list but to kill all threads generated by the Perl Beginner list after a number of days whose specification we shall come to shortly.

    %pw32u_config_out = (
        title                      => 'Perl-Win32-Users',
        dir_digest                 => "../pw32u",
        dir_threads                => "../pw32u/Threads",
        digests_log                => "../pw32u/digests_log.txt",
        todays_topics              => "../pw32u/todays_topics.txt",
        id_format                  => 'sprintf("%03d",$1) . \'_\' . 
                                           sprintf("%04d",$2)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "--__--__--\n\n",
        archive_kill_trigger       => 1,
        ...
    );

    %pbml_config_out = (
        title                      => 'Perl Beginner',
        dir_digest                 => "../pbml",
        dir_threads                => "../pbml/Threads",
        digests_log                => "../pbml/digests_log.txt",
        todays_topics              => "../pbml/todays_topics.txt",
        id_format                  => 'sprintf("%05d",$1)',
        output_id_format           => 'sprintf("%04d",$1)',
        thread_msg_delimiter       => "_*_*_*_*_*_\n_*_*_*_*_*_\n\n\n",
        archive_kill_trigger       => -1,
        ...
    );

This completes the 9 required keys for %config_out. We now turn to keys which are either optional or which are required if you have assigned a value of 1 or -1 to key archive_kill_trigger.

Optional %config_out Keys

HELPFUL HINTS ^

... in which the module author shares what he has learned using Mail::Digest::Tools and its predecessors since August 2000.

Initial Configuration and Testing

As mentioned above, if you are considering creating a local archive of threads originating in daily digest versions of a mailing list, you should first accumulate 6-10 instances of such digests and both:

  1. study the internal structure of the digest -- needed to develop a %config_in for the digest; and
  2. carefully consider how you wish to structure the output from the module's use on your system -- needed to develop %config_out for the digest

Once you have developed the initial configuration, you should call reprocess_ALL_digests() on the digests, then open the files created to see if the results are what you want. If they are not what you want, then you need to think about what you should change in %config_in and/or %config_out. Make those changes, then call reprocess_ALL_digests() again. Repeat as needed, making sure not to delete any of the digest files you are using as sources until you are completely satisfied with your configuration.

Once, however, you are satisfied with your configuration, you should call process_new_digests() on new instances of digests and never call reprocess_ALL_digests() for that digest again (lest you not be able to regenerate threads containing messages from digests you have deleted over time).

Where to Store the Configuration Hashes

As mentioned above, you will probably find it convenient to write separate Perl scripts to call each one of Mail::Digest::Tool's public functions. You could code %config_in and %config_out in each of those scripts just before the respective function calls. But that would violate the principle of 'Repeated Code Is a Mistake' and multiply maintenance problems. It's far better to code the two configuration hashes in a separate plain-text file and 'require' that file into your script. That way, any changes you make in the configuration will be automatically picked up by each script that calls a Mail::Digest::Tools function.

Here is an example of such a file holding the configuration hashes governing use of the Perl-Win32-Users digest, along with a script making use of that file.

    # file:  pw32u.digest.data
    $topdir = "E:/Digest/pw32u";
    %config_in =  (
         grep_formula           => 'Perl-Win32-Users digest',
         pattern_target          => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
         # next element's value must be double-quoted
         source_msg_delimiter   => "--__--__--\n\n",
         topics_intro           => 'Today\'s Topics:',
         message_style_flag     => '^Message:\s+(\d+)$',
         from_style_flag        => '^From:\s+(.+)$',
         org_style_flag         => '^Organization:\s+(.+)$',
         to_style_flag          => '^To:\s+(.+)$',
         cc_style_flag          => '^CC:\s+(.+)$',
         subject_style_flag     => '^Subject:\s+(.+)$',
         date_style_flag        => '^Date:\s+(.+)$',
         reply_to_style_flag    => '^Reply-To:\s+(.+)$',
         MIME_cleanup_flag      => 1,
    );

    %config_out =  (
         title                  => 'Perl-Win32-Users',
         dir_digest             => $topdir,
         dir_threads            => "$topdir/Threads",
         dir_archive_top        => "$topdir/Threads/archive",
         archived_today         => "$topdir/archived_today.txt",
         de_archived_today      => "$topdir/de_archived_today.txt",
         deleted_today          => "$topdir/deleted_today.txt",
         digests_log            => "$topdir/digests_log.txt",
         digests_read           => "$topdir/digests_read.txt",
         todays_topics          => "$topdir/todays_topics.txt",
         mimelog                => "$topdir/mimelog.txt",
         id_format              => 'sprintf("%03d",$1) . \'_\' . 
                                        sprintf("%04d",$2)',
         output_id_format       => 'sprintf("%04d",$1)',
         MIME_cleanup_log_flag  => 1,
         # next element's value must be double-quoted
         thread_msg_delimiter   => "--__--__--\n\n",
         archive_kill_trigger   => 1,
         archive_kill_days      => 14,
         digests_read_flag      => 1,
         archive_config         => 0,
    );

    # script:  dig.pl
    # USAGE:  perl dig.pl
    #!/usr/bin/perl
    use strict;
    use warnings;
    use Mail::Digest::Tools qw( process_new_digests );

    our (%config_in, %config_out);
    my $data_file = 'pw32u.digest.data';
    require $data_file;

    process_new_digests(\%config_in, \%config_out);

    print "\nFinished\n";

Maintaining Local Archives of More than One Digest

The module author has maintained local archives of more than a half dozen different mailing list digests over the past several years. He has found it convenient to maintain the configuration information for all the digests he is following at a given time in a single configuration file. The advantage to this approach is that if two digests share a similar internal structure (perhaps due to being generated by the same mailing list program or list provider) and if the user chooses to structure the output from the two digests in similar or identical ways, then getting the configuration hashes becomes much easier and the potential for error is reduced.

Here is a sample directory and file structure for maintaining archives of two different digests on a Win32 system:

    digest/
    digest.data
    process_new.pl
    process_ALL.pl
    reply_digest_message.pl
    repair_digest_order.pl
    consolidate_threads.pl
    deletables.pl
    pw32u/
        Perl-Win32-Users Digest, Vol 1 Issue 1771.txt
        Perl-Win32-Users Digest, Vol 1 Issue 1772.txt
        digest_log.txt
        digest_read.txt
        mimelog.txt
        Threads/
    pbml/
        [PBML] Digest Number 1491.txt
        [PBML] Digest Number 1492.txt
        digest_log.txt
        Threads/

File digest.data would look like this:

    # digest.data
    $topdir = "E:/Digest";
    %digest_structure = (
        pbml =>    {
             grep_formula   => '\[PBML\]',
             pattern_target => '.*\s(\d+)\.txt$',
             ...
           },
        pw32u =>   {
             grep_formula   => 'Perl-Win32-Users digest',
             pattern_target => '.*Vol\s(\d+),\sIssue\s(\d+)\.txt',
             ...
           },
    );
    %digest_output_format = (
        pbml =>    {
             title          => 'Perl Beginner',
             dir_digest     => "$topdir/pbml",
             dir_threads    => "$topdir/pbml/Threads",
             ...
           },
        pw32u =>   {
             title          => 'Perl-Win32-Users',
             dir_digest     => "$topdir/pw32u",
             dir_threads    => "$topdir/pw32u/Threads",
             ...
           },
    );

To accomodate this slightly more complex structure in the configuration file, the calling script might be modified as follows:

    # script:  dig.pl
    # USAGE:  perl dig.pl [short-name for digest]
    #!/usr/bin/perl
    use Mail::Digest::Tools qw( process_new_digests );

    my ($this_key, %config_in, %config_out);
    # variables imported from $data_file
    our (%digest_structure, %digest_output_format);    

    my $data_file = 'digest.data';
    require $data_file;

    $this_key = shift @ARGV;
    die "\n     The command-line argument you typed:  $this_key\n     does not call an accessible digest$!" 
        unless (defined $digest_structure{$this_key}
            and defined $digest_output_format{$this_key});

    my ($k,$v);
    while ( ($k, $v) = each %{$digest_structure{$this_key}} ) {
        $config_in{$k} = $v;
    }
    while ( ($k, $v) = each %{$digest_output_format{$this_key}} ) {
        $config_out{$k} = $v;
    }

    process_new_digests(\%config_in, \%config_out);

    print "\nFinished\n";

Getting Your Mail to the Right Place on Your System

For several years the module author used the scripts which were predecessors to Mail::Digest::Tools on a Win32 system where mail was read with Microsoft Outlook Express. He would do a "File/Save as.." on an instance of a digest, select text format (*.txt) and save it to an appropriate directory. Later, the author used the shareware e-mail client Poco, in which the same operation was accomplished by highlighting a file and keying "Ctrl+S".

But as the number of digests the author was tracking grew, this procedure became more and more tedious. Fortunately, about that time the author was assigned to write a review of the second edition of the Perl Cookbook, and he learned how to use the Net::POP3 module to receive his e-mail directly. So now he uses a Perl script to get all his digests and save them as text files to appropriate directories -- and then lets a GUI e-mail client take care of the rest.

Here is a script which more or less accomplishes this:

    # script:  get_digests.pl
    #!/usr/bin/perl
    use strict;
    use warnings;
    use Net::POP3;
    use Term::ReadKey;

    my ($site, $username, $password);
    my ($verref, $pop3, $messagesref, $undeleted, $msgnum, $message);
    my ($k,$v);
    my ($oldfh, $output);

    my %digests = (
        'pbml'   => "E:/Digest/pbml",
        'pw32u'  => "E:/Digest/pw32u",
        'london' => "E:/Digest/london",
    );

    $site = 'pop3.someISP.com';
    $username = 'myuserid';

    $pop3 = Net::POP3->new($site)
            or die "Couldn't open connection to $site: $!";

    print "Enter password for $username at $site:  ";
    ReadMode('noecho');
    $password = ReadLine(0);
    chomp $password;
    ReadMode(0);
    print "\n";

    defined ($pop3->login($username, $password))
        or die "Can't authenticate: $!";

    $messagesref = $pop3->list 
        or die "Can't get list of undeleted messages: $!";

    while ( ($k,$v) = each %$messagesref ) {
        my ($messageref, $line, %headers);
        print "$k:\t$v\n";
        $messageref = $pop3->top($k);
        local $_;
        foreach (@$messageref) {
            chomp;
            last if (/^\s*$/);
            next unless (/^\s*(Date:|From:|Subject:|To:)/);
            if (/^\s*Date:\s*(.*)/) {
                $headers{'Date'} = $1;
            }
            if (/^\s*From:\s*(.*)/) {
                $headers{'From'} = $1;
            }
            if (/^\s*Subject:\s*(.*)/) {
                $headers{'Subject'} = $1;
            }
            if (/^\s*To:\s*(.*)/) {
                $headers{'To'} = $1;
            }
        }
        if ($headers{'Subject'} =~ /^\[PBML\]/) {
            get_digest($pop3, $k, 'pbml', $headers{'Subject'});
        }
        if ($headers{'Subject'} =~ /^Perl-Win32-Users/) {
            get_digest($pop3, $k, 'pw32u', $headers{'Subject'});
        }
        if ($headers{'Subject'} =~ /^london\.pm/) {
            get_digest($pop3, $k, 'london', $headers{'Subject'});
        }
    }

    $pop3->quit() or die "Couldn't quit cleanly: $!";

    print "Finished!\n";

    sub get_digest {
        my ($pop3, $msgnum, $digest, $subj) = @_;
        print "Retrieving $msgnum: $subj";
        my $message = 
            $pop3->get($msgnum) or die "Couldn't get message $msgnum: $!";
        if ($message) {
            print "\n";
            my $digestfile = "$digests{$digest}/$subj.txt";
            _print_message($digestfile, $message);
            print "Marking $msgnum for deletion\n";;
            $pop3->delete($msgnum) or die "Couldn't delete message $msgnum: $!";
        } else {
            print "Failed:  $!\n";
        }
    }

    sub _print_message {
        my ($digestfile, $message) = @_;
        my @lines = @{$message};
        my $counter = 0;
        open(FH, ">$digestfile") 
            or die "Couldn't open $digestfile for writing: $!";
        for (my $i = 0; $i<=$#lines; $i++) {
            chomp($lines[$i]);
            # Identify the first blank line in the digest,
            # i.e., the end of the headers
            if ($lines[$i] =~ /^$/) {
                $counter = $i;
                last;
            }
        };
        # Transfer digest to appropriate directory, skipping over digest header
        # so as to start just above Today's Topics
        foreach my $line (@lines[$counter+1 .. $#lines]) {
            chomp($line);
            # For some reason the $pop3->get() puts a single whitespace at the 
            # start of most (all but the first?) lines
            # That has to be cleaned up so digest.pl can correctly process 
            # header info and identify beginning of Today's Topics
            if ($line =~ /^\s(.*)/) {
                print FH $1, "\n";
            } else {
                print FH $line, "\n";
            }
        }
        close FH or die "Couldn't close after writing: $!";
    }

No promise is made that this script or any script contained in this documentation will work correctly on your system. Hack it up to get it to work the way you want it to.

ASSUMPTIONS AND QUALIFICATIONS ^

1 No Change in Mailing List Digest Software

The main assumption on which Mail::Digest::Tools depends for its success is that the provider of a particular digest continues to use the same mailing list software to produce the digest. If the provider changes his/her software, you must modify Mail::Digest::Tools' configuration data accordingly.

2 Digest Must Be One E-mail Without Attachments

At its current stage of development Mail::Digest::Tools is only applicable to mailing list digests which arrive as one continuous file. It is not applicable to digests (e.g., Cygwin, module-authors@perl.org) which are supplied in a format consisting of (a) one file with instructions and a table of contents and (b) all the individual messages provided as e-mail attachments.

3 Perl 5.6+ Only

The program was created with Perl 5.6. Certain features, such as the use of the our modifier, were not available prior to 5.6. Modifications to account for pre-5.6 features are left as an exercise for the user.

4 Time::Local

Mail::Digest::Tools internally uses Perl core extension Time::Local. If at some future point this module is not included as part of a Perl core distribution, you would have to install it manually from CPAN.

HISTORY AND FUTURE DEVELOPMENT ^

PRE-CPAN HISTORY

ActiveState maintains Perl for Windows-based platforms and also maintains a variety of mailing lists for users of its Windows-compatible versions of Perl. Subscribers to these lists can receive messages either as individual e-mails or as part of a daily digest which contains a listing of the day's topics and the complete text of each message. The messages are often best followed as discussion 'threads' which may extend over several days' worth of digests.

In June of 2000, however, ActiveState had to temporarily take its mailing lists off-line for technical reasons. When these lists were restored to service, their archive capacities were not immediately restored. I had just begun my study of Perl and had come to enjoy reading the Perl-Win32-Users digest. As I set off for the Yet Another Perl Conference in Pittsburgh, I shouted out, 'I want my Perl-Win32-Users digest!' I wrote a Perl script called digest.pl to fill that gap.

ActiveState has since restored archiving capacity to their lists. For reasons that would perhaps best be explored in a psychotherapeutic context, however, I had become attached to my local archive of the 'pw32u' list, so I continued to maintain this program and fine-tune its coding.

In early 2001 it became apparent that this program could be applied to a wide variety of mailing list digests -- not just those provided by ActiveState. In particular, valuable digests provided by Yahoo Groups (formerly E-groups) such as NT Emacs Users, Perl 5 Porters and Perl Beginners could also be archived if digest.pl were modified appropriately. I made those modifications and began to track several other digests. I was able to use the archive I had developed as a window into one part of the Perl community in a Lightning Talk I gave at YAPC::North America in Montreal in June 2001, ''An Index of Incivility in the Perl Community.''

Maintaining digest.pl was, to a considerable extent, the way I taught myself Perl. Along the way I incorporated my first profiler into the script -- and then discarded it. Some of the subroutines I had written for early versions of the program had applicability to other scripts -- and thus was born my first module -- also since discarded. By July 2003 I was up to version 1.3. Following a suggestion by Uri Guttman at the YAPC::EU conference held in Paris in July 2003, wherever possible the use of separate print statements for each line to be printed was eliminated in favor of concatenating strings to be printed into much larger strings which could be printed all at once. This revision reduced the number of times filehandles had to be opened for writing. A given thread file was now opened only once per call of this program, rather than once for each message in each digest processed per call of the program.

Various other improvements, such as the possibility of stripping out unnecessary multipart MIME content and the introduction of subdirectories for archiving, were made in late 2003. At that point I decided to transform the script into a full-fledged Perl module. At first I tried out an object-oriented structure (with which I was familiar from my first two CPAN modules, List::Compare and Data::Presenter). That OO structure necessitated one constructor and one method call per typical script, but since the constructor did nothing but some cursory validation of the configuration data, it was mostly superfluous. Hence, I jettisoned the OO structure in favor of a functional approach. The result: Mail::Digest::Tools.

CPAN

After these revisions, I was up to version 1.96. Why revert to a lower version number at this point? That is why Mail::Digest::Tools makes its CPAN debut in version 2.04.

v1.97 (2/18/2004): Dealing with problem that Win32 and Unix/Linux may create different thread names for the same set of source messages because they have different lists of characters forbidden in file names. This became a problem while writing tests for process_new_digests() because it made predicting the names of thread files created via that function more difficult to predict. Tests adjusted appropriately.

v1.98 (2/19/2004): Eliminated suspect uses of /o modifier on regexes. This was causing problems when I called process_new_digests() on two different types of digests in the same script. Also, eliminated code referring to DOS (e.g., code eliminating characters unacceptable in DOS filenames) as I have no way to test this module on a DOS box.

v1.99 (2/22/2004): ActiveState introduced a new format for its Perl-Win32-Users digest -- the digest which originally inspired the creation of this module's predecessor in 2000. One aspect of this new format was a clear improvement: HTML attachments are now stripped before messages are posted to the digest, so multipart MIME content has either been reduced considerably or eliminated altogether. But another aspect of this new format upset code going back four years: The delimiter immediately following Today's Topics is now different from the delimiters separating each message in the digest. Working around this appeared to be surprisingly difficult, especially since this revision had to be done in the middle of writing a test suite for CPAN distribution. A new key has been added to the %config_in hash for each digest:

    $config_in{'post_topics_delimiter'}

v2.00 (2/23/2004): Testing conducted after the last revision revealed a bug going back several versions in the internal subroutine stripping multipart MIME content. The last paragraph of each message which did not have MIME content was being stripped off. The offending code was found within _analyze_message_body(). (The author recently learned of the CPAN module Email::StripMime. This looks promising as a replacement for the hand-rolled subroutine used within Mail::Digest::Tools, but a full study of its possibilities will be deferred to a later version. Also in this version, POD was rewritten to reflect the introduction of the post-topics delimiter.

v2.01 (2/24/2004): Backslashes (except as part of \n newline characters) are prohibited in %config_out key thread_msg_delimiter. This is because in the test suite that key's value is used as a variable inside a regular expression which in turn is used as an argument to split(). Preliminary investigation suggests that to work around the backslash metacharacter in that situation would be very time-consuming.

v2.02 (2/26/2004): Revised reply_to_digest_message() internal subroutine _strip_down_for_reply to reflect distinction between post-topics delimiter and source message delimiter.

v2.03 (3/04/2004): Fixed bug in readdir call in repair_message_order(). Extensive reworking of test suite.

v2.04 (3/05/2004): No changes in module. Refinement of test suite only.

v2.05 (3/07/2004): Fixed accidental deletion of incrementation of $message_count in _strip_down().

v2.06 (3/10/2004): Correction of errors in test suite. Elimination of use of List::Compare in test suite.

v2.07 (3/11/2004): Correction of error in t/03.t

v2.08 (3/11/2004): Correction in _clean_up_thread_title and in tests.

v2.10 (3/15/2004): Corrections to README and documentation only.

v2.11 (10/23/2004): Fixed several errors which resulted in "Bizarre copy of hash in leave" error when running test suite under Devel::Cover.

v2.12 (05/14/2011): Added 'mirbsd' to list of Unixish-OSes.

AUTHOR ^

James E. Keenan (jkeenan@cpan.org).

Creation date: August 21, 2000. Last modification date: May 14, 2011. Copyright (c) 2000-2011 James E. Keenan. United States. All rights reserved.

This software is distributed with absolutely no warranty, express or implied. Use it at your own risk. This is free software which you may distribute under the same terms as Perl itself.

syntax highlighting: