Text::Summarizer - Summarize Bodies of Text
use Text::Summarizer; # all constructor arguments shown are OPTIONAL and reflect the DEFAULT VALUES of each attribute $summarizer = Text::Summarizer->new( articles_path => 'subdirectory/to/summarize/*', permanent_path => 'data/permanent.stop', stopwords_path => 'data/stopwrods.stop', store_working => 0, print_scanner => 0, print_summary => 0, print_graphs => 0, print_typifier => 0, return_count => 20, phrase_thresh => 2, phrase_radius => 5, freq_constant => 0.004, ); $summarizer = Text::Summarizer->new(); # to summarize a string $stopwords = $summarizer->scan_text( 'this is a sample text' ); $summary = $summarizer->summ_text( 'this is a sample text' ); # or to summarize an entire file $stopwords = $summarizer->scan_file("some/file.txt"); $summary = $summarizer->summ_file("some/file.txt"); # or to summarize in bulk # (if no argument provided, uses the 'articles_path' attribute) @stopwords = $summarizer->scan_each("/directory/glob/*"); @summaries = $summarizer->summ_each("/directory/glob/*");
This module allows you to summarize bodies of text into a scored hash of sentences, phrase-fragments, and individual words from the provided text.
These scores reflect the weight (or precedence) of the relative text-fragments, i.e. how well they summarize or reflect the overall nature of the text.
All of the sentences and phrase-fragments are drawn from within the existing text, and are NOT proceedurally generated.
The following constructor attributes are available to the user, and can be accessed/modified at any time via $summarizer->_set_[attribute] :
$summarizer->_set_[attribute]
articles_path
folder containing some text-files you wish to summarize
permanent_path
file containing a base set of universal stopwords (defaults to English stopwords)
stopwords_path
file containing a list of new stopwords identified by the scan function
scan
store_scanner
flag for storing new stopwords in the file indicated by stopwords_path
print_scanner
flag that enables visual graphing of scanner activity (prints to STDOUT)
STDOUT
print_summary
flag that enables visual charting of summary activity (prints to STDOUT)
return_count
number of items to list when printing summary list
phrase_thresh
minimum number of word tokens allowed in a phrase
phrase_radius
distance iterated backward and forward from a given word when establishing a phrase (i.e. maximum length of phrase divided by 2)
freq_constant
mathematical constant for establishing minimum threshold of occurence for frequently occuring words (defaults to 0.004)
0.004
These attributes are read-only, and can be accessed via $summarizer->[attribute] :
$summarizer->[attribute]
full_text
all the lines of the provided text, joined together
sentences
list of each sentence found in the provided text
sen_words
for each sentence, contains an array of each word in order
word_list
each individual word of the entire text, in order (token stream)
freq_hash
all words that occur more than a specified threshold, paired with their frequency of occurence
clst_hash
for each word in the text, specifies the position of each occurence of the word, both relative to the sentence it occurs in and absolute within the text
phrs_hash
for each word in the text, contains a phrase of radius r centered around the given word, and references the sentence from which the phrase was gathered
sigma_hash
gives the population standard deviation of the clustering of each word in the text
inter_hash
list of each chosen phrase-fragment-scrap, paired with its score
score_hash
list of each word in the text, paired with its score
phrs_list
list of complete sentences that each scrap was drawn from, paired with its score
frag_list
for each chosen scrap, contains a hash of: the pivot word of the scrap; the sentence containing the scrap; the number of occurences of each word in the sentence; an ordered list of the words in the phrase from which the scrap was derived
file_name
the filename of the current text-source (if text was extracted from a file)
text_hint
brief snippet of text containing the first 50 and the final 30 characters of the current text
summary
scored lists of each summary sentence, each chosen scrap, and each frequently-occuring word
stopwords
list of all stopwords, both permanent and proceedural
watchlist
list of proceedurally generated stopwords, derived by the `scan` function
Scan is a utility that allows the Text::Summarizer to parse through a body of text to find words that occur with unusually high frequency. These words are then stored as new stopwords via the provided stopwords_path. Additionally, calling any of the three scan_[...] subroutines will return a reference (or array of references) to an unordered list containing the new stopwords. $stopwords = $summarizer->scan_text( 'this is a sample text' ); $stopwords = $summarizer->scan_file( 'some/file/path.txt' ); @stopwords = $summarizer->scan_each( 'some/directory/*' ); # if no argument provided, uses the 'articles_path' attribute
scan_[...]
summarize
Summarizing is, not surprisingly, the heart of the Text::Summarizer. Summarizing a body of text provides three distinct categories of information drawn from the existing text and ordered by relevance to the summary: full sentences, phrase-fragments / context-free token streams, and a list of frequently occuring words.
There are three provided functions for summarizing text documents. $summary = $summarizer->summarize_text( 'this is a sample text' ); $summary = $summarizer->summarize_file( 'some/file/path.txt' ); @summaries = $summarizer->summarize_each( 'some/directory/*' ); # if no argument provided, defaults to the 'articles_path' attribute # or their short forms $summary = $summarizer->summ_text('...'); $summary = $summarizer->summ_file('...'); @sumamries = $summarizer->summ_each('...'); # if no argument provided, defaults to the 'articles_path' attribute
summarize_text and summarize_file each return a summary hash-ref containing three array-refs, while summarize_each returns a list of these hash-refs. These summary hashes take the following form:
summarize_text
summarize_file
summarize_each
sentences => a list of full sentences from the given text, with composite scores of the words contained therein
fragments => a list of phrase fragments from the given text, scored similarly to sentences
fragments
words => a list of all words in the text, scored by a three-factor system consisting of frequency of appearance, population standard deviation, and use in important phrase fragments.
words
Phrase fragments are in actuality short "scraps" of text (usually only two or three words) that are derived from the text via the following process:
the entirety of the text is tokenized and scored into a frequency table, with a high-pass threshold of frequencies above # of tokens * user-defined scaling factor
frequency
# of tokens * user-defined scaling factor
each sentence is tokenized and stored in an array
for each word within the frequency table, a table of phrase-fragments is derived by finding each occurance of said word and tracking forward and backward by a user-defined "radius" of tokens (defaults to radius = 5, does not include the central key-word) — each phrase-fragment is thus compiled of (by default) an 11-token string
radius = 5
all fragments for a given key-word are then compared to each other, and each word is deleted if it appears only once amongst all of the fragments (leaving only A ∪ B ∪ ... ∪ S where A, B, ..., S are the phrase-fragments)
A ∪ B ∪ ... ∪ S
what remains of each fragment is a list of "scraps" — strings of consecutive tokens — from which the longest scrap is chosen as a representation of the given phrase-fragment
when a shorter fragment-scrap (A) is included in the text of a longer scrap (B) such that A ⊂ B, the shorter is deleted and its score is added to that of the longer
A
B
A ⊂ B
when multiple fragments are equivalent (i.e. they consist of the same list of tokens when stopwords are excluded), they are condensed into a single scrap in the form of "(some|word|tokens)" such that the fragment now represents the tokens of the scrap (excluding stopwords) regardless of order (refered to as a "context-free token stream")
"(some|word|tokens)"
Bugs should always be submitted via the project hosting bug tracker
https://github.com/faelin/text-summarizer/issues
For other issues, contact the maintainer.
Faelin Landy <faelin.landy@gmail.com> (current maintainer)
* Michael McClennen <michaelm@umich.edu>
Copyright (c) 2018 by the AUTHOR as listed above
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.
To install Text::Summarizer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Summarizer
CPAN shell
perl -MCPAN -e shell install Text::Summarizer
For more information on module installation, please visit the detailed CPAN module installation guide.