The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::StopWordList - A sorted list of English stop words

Synopsis

        use Lingua::EN::StopWordList;

        my($ara_ref) = Lingua::EN::StopWordList -> new -> words;

Here's a complete program:

        use strict;
        use warnings;
        use Lingua::EN::StopWordList;

        my($count) = 0;

        print map{"@{[++$count]}: $_\n"} @{Lingua::EN::StopWordList -> new -> words};

Description

Lingua::EN::StopWordList is a pure Perl module.

It returns a sorted arrayref of 659 English stop words.

Constructor and initialization

new(...) returns an object of type Lingua::EN::StopWordList.

This is the class's contructor.

Usage: Lingua::EN::StopWordList -> new.

Distributions

This module is available as a Unix-style distro (*.tgz).

Install Lingua::EN::StopWordList as you would for any Perl module:

Run:

        cpanm Lingua::EN::StopWordList

or run:

        sudo cpan Lingua::EN::StopWordList

or unpack the distro, and then run one of:

        perl Build.PL
        ./Build
        ./Build test
        ./Build install

or

        perl Makefile.PL
        make (or dmake)
        make test
        make install

See http://savage.net.au/Perl-modules.html for details.

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.

Methods

new()

See "Constructor and initialization".

words()

Returns the sorted arrayref of English stop words.

FAQ

Is there a definitive list of stop words?

No, there is no such thing as a definitive list. For an important discussion, e.g. including 'phrase search', see the Wikipedia discussion of word lists.

Where does the list come from?

I downloaded it from the bottom of this page: http://www.translatum.gr/forum/index.php?topic=2476.0. It contains 659 words.

Are there other lists available?

Sure. Try http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop. This list contains 570 words.

Another good place to look is http://www.ranks.nl/resources/stopwords.html, but its English list only contains 174 words. Since Lingua::StopWords (below) also has 174 words in its Englist list, perhaps this is where that module got its words from. Lastly, it has stop word lists for a whole range of languages.

Alternately, just Google for references to various lists. Note however these lists are normally very short.

Why another Perl module for stop words?

Lingua::StopWords only has a short list of words (174). And its bug list goes back 3 years.

Lingua::EN::StopWords only has a short list of words (227). Also, this module is part of Lingua::EN::Segmenter, whose documentation is poor. Even the exact basis of how it splits text is not documented. Lastly, its bug list goes back 6 years.

I could have offered to take over maintentance of either or both those modules, but there are problems:

o Lingua::StopWords

It ships with a set of sub-modules, with names like Lingua::StopWords::EN, but I'm not in a position to support its other languages if I put my module's English list into it.

Nevertheless, the fact that it supports 13 languages is definitely something in favour of this module.

o Lingua::EN::StopWords

This is part of text processing stuff which I don't want to get involved with. Also, it has a long list of pre-reqs (not listed on MetaCPAN until you view the makefile), which may well suit the purposes of Lingua::EN::Segmenter, but is overkill for just a stop word list.

Several other Perl modules, written for various purposes, either use one of the above, or have their own very short (as always) lists.

How can I help?

If you translate the list of stop words in this module into your favourite language and email it to me, I will include your words in the next release.

It all depends on whether you think this new list is somehow 'better' than the lists in pre-existing modules. I cannot make that decision on your behalf.

See Also

Benchmark::Featureset::StopwordLists.

This module includes a comparison of various stopword list modules.

See http://savage.net.au/Perl-modules/html/stopwordlists.report.html.

Lingua::EN::StopWords.

Lingua::StopWords.

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=Lingua::EN::StopWordList.

Repository

https://github.com/ronsavage/Lingua-EN-StopWordList.git.

Author

Lingua::EN::StopWordList was written by Ron Savage <ron@savage.net.au> in 2012.

Homepage: http://savage.net.au/index.html.

Copyright

Australian copyright (c) 2012 Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html