View on
MetaCPAN
Pratheepan Raveendranathan > WebService-GoogleHack-0.15 > WebService::GoogleHack

Download:
WebService-GoogleHack-0.15.tar.gz

Dependencies

Annotate this POD

CPAN RT

New  1
Open  1
View/Report Bugs
Module Version: 0.15   Source  

NAME ^

WebService::GoogleHack - Perl package that ties together all GoogleHack modules (Webservice::GoogleHack::Search, Webservice::GoogleHack::Spelling, Webservice::GoogleHack::Rate, and Webservice::GoogleHack::Text) to implement Natural Language Processing techniques that use the World Wide Web as a source of information. Use this package to access all the functionality of GoogleHack.

SYNOPSIS ^

    use WebService::GoogleHack;

    my $google = new WebService::GoogleHack;

    #Initializing the object to the contents of the configuration file
    # API Key, GoogleSearch.wsdl file location.

    $google->initConfig("initconfig.txt");

    #Printing the contents of the configuration file
    $google->printConfig();

    #Measure the semantic relatedness between the words "white house" and 
    #"president".

    $measure=$google->measureSemanticRelatedness1("white house","president");

    print "\nRelatedness measure between white house and president is: ";
    print $measure."\n";

    #Going to search for words that are related to "toyota" and "ford" 
    my @terms=();
    push(@terms,"toyota");
    push(@terms,"ford");

    #The parameters are the search terms, number of web page results to look 
    #at, the number of iterations,output file and the "true" indicates that the
    #diagnostic data should be stored in the file "results.txt"

    $results=$google->Algorithm1(\@terms,10,25,1,"results.txt","true");

    print $results;

DESCRIPTION ^

WebService::GoogleHack is a PERL package that interacts with the Google API, and implements basic functions that allow the user to interact with Google and retrieve results in an easy to use format. GoogleHack also implements and extends a number of Natural Language Processing by using the World Wide Web as a source of information.

Some of the features are:

    * Issue queries to Google (WebService::GoogleHack, WebService::GoogleHack::Search)

    * Retrieve Spelling suggestions from Google (WebService::GoogleHack, WebService::GoogleHack::Spelling)

    * Find the Pointwise Mututal Information (PMI) measure between two words (WebService::GoogleHack,WebService::GoogleHack::Rate)

    * Given a paragraph find if the paragraph has a positive or negative semantic orientation.(WebService::GoogleHack,WebService::GoogleHack::Rate)
         
    * Given a set of words along with a positively oriented word such as "excellent" and a negatively oriented 
      word such as "poor", find if the word has a positive or negative semantic orientation.(WebService::GoogleHack,WebService::GoogleHack::Rate)

    * Given a set of phrases along with a positively oriented word such  as "excellent" and a negatively oriented word 
      such as "poor", predict if the given phrases are positive or negative in sentiment.(WebService::GoogleHack,WebService::GoogleHack::Rate)

    * Given two or more words finds a set of related words. (WebService::GoogleHack)

Related Modules: GoogleHack uses 4 sub-modules to interact with Google and Process text. Though the functions in these modules can be accessed directly, it is advised to use the GoogleHack module's interface to access the functions in the sub-modules.

WebService::GoogleHack::Text - GoogleHack uses this module to manipulate text retrieved from the web (Get n-word sentences, words,and parse HTML etc).

WebService::GoogleHack::Search - GoogleHack uses this module to query Google.

WebService::GoogleHack::Rate - GoogleHack users this module to implement some of the Sentiment Classification algorithms.

WebService::GoogleHack::Spelling - GoogleHack uses this module to query Google for spelling suggestions.

REQUIRED PACKAGES ^

1) Google API (http://www.google.com/apis/)

2) Brill Tagger (If using Sentiment Classification stuff)

    Installation file and instructions @ : 
   
    http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z

    Instructions also available in GoogleHack INSTALL file.

3) Required PERL Modules

    SOAP::Lite;

    Set::Scalar;

    Text::English;

    LWP::Simple;

    URI::URL;

    LWP::UserAgent;

    HTML::LinkExtor;
 
    Data::Dumper;

FUNCTIONS ^

GENERAL FUNCTIONALITY

These are the GoogleHack functions that are common to all sort of operations. These functions are used to create and initialize GoogleHack objects.

__METHOD__->new()

Purpose: This function creates an object of type GoogleHack and returns a blessed reference.

returns: A blessed reference to a GoogleHack object.

__METHOD__->initConfig(configLocation)

Purpose: This function is used to read the configuration file containing information such as the Google-API key, the base directory path, and the path to the Brill Tagger. The configuration file is in the WebService/GoogleHack/Datafiles directory.

This function must be called in order to initialize the GoogleHack object.

Valid arguments are :

returns : Returns an object which contains the parsed information.

__METHOD__->printConfig()

Purpose: This function is used to print the information read from the configuration file

No arguments.

SETS OF RELATED WORDS

This set of functions deal with the problem of finding sets of related words by using the World Wide Web as a source of information.

__METHOD__->measureSemanticRelatedness1(searchString1,searchString2)

Purpose: This function is used to measure the relatedness between two words.

Formula used: log(hits(w1)) + log(hits(w2)) - 2 * log(hits(w1w2))

Valid arguments are :

Returns: Returns the object containing the relatedness measure.

__METHOD__->measureSemanticRelatedness2(searchString1,searchString2)

Purpose: This function is used to measure the relatedness between two words.

Formula used: log(w1w2/(w1+w2))

Valid arguments are :

Returns: Returns the object containing the relatedness measure.

__METHOD__->measureSemanticRelatedness3(searchString1,searchString2)

Purpose: This function is used to measure the relatedness between two words.

Formula used: log( hits(w1w2) / (hits(w1) * hits(w2)))

Valid arguments are :

Returns: Returns the object containing the relatedness measure.

__METHOD__->Algorithm1(searchTerms,N,C,I,trace, html)

Purpose:Given two or more words, this function tries to find a set of related words. This is the Google-Hack baseline algorithm 1. For example, given the two words gun and pistol, an example of an expanded set of related words would be,

{laser,paintball, case,bullet, machine gun, rifle} etc.

  The features of Initial Approach (Algorithm 1) is given below

                  - Frequency Based
 
                  - Accepts only 2 terms

                  - Results also contain only unigrams
                
                  - A frequency cutoff is used
                
                  - Stop words and web stop words are removed.

returns : Returns an html or text version of the results.

__METHOD__->Algorithm2(searchTerms,N,C,BC, I,S,SC,trace, html)

Purpose:Given two or more words, this function tries to find a set of related words. This is the Google-Hack algorithm 2.

   The features of Second Approach (Algorithm 2) is given below

                  - Accepts more than 2 terms

                  - Has 3 relatedness scores
                
                  - Accepts unigrams and 2-word collocation as input
                
                  - Results also contain 2-word collocations
                
                  - A score cutoff is also included along with frequency cutoff
                
                  - A bigram cutoff is also included.

                  - Stop words and web stop words are removed.

                  - Stop phrases and web stop phrases are removed.

returns : Returns an html or text version of the results.

__METHOD__->getWordsInPage(searchTerms,N,C,I, NT,BI,trace)

Purpose:Given a set of search terms, this function will retreive the resulting URLs from Google, it will then follow those links, and retrieve the text from there. Once all the text is collected, the function finds the intersecting or co-occurring words between the top N results. This function is basically used by the function Algorithm1.

Valid arguments are :

returns : Returns nothing.

SENTIMENT CLASSIFICATION

    This set of functions deal with sentiment classification. The functions include the PMI-IR, and some other similar functions that try to classify if a given word or phrase is positively or negatively oriented in its sentiment.

__METHOD__->predictSemanticOrientation(rfile,posInf,negInf,trace)

Purpose: This function tries to predict the semantic orientation of a paragraph of text. The semantic orientation of a paragraph is calculated according to the paper "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews" By Peter Turney. The difference between Peter Turneys implementation of the PMI-IR algorithm and the implementation of the PMI-IR algorithm in Google Hack is small, but very important.

In Peter Turneys implementation, the PMI-IR algorithm uses the search engine Alta Vista. However, in Google-Hack, we are using Google as our search engine. More importantly, AltaVista provides a "near" operator which the original PMI=IRuses, however, Google does not. Hence, we are using the "AND" operator.

Valid arguments are :

Returns : the PMI measure and the prediction which is 0 or 1.

__METHOD__->predictWordSentiment(infile,posInf,negInf,html,trace)

Purpose:Given an file containing text, this function tries to find the positive and negative words. The formula used to calculate the sentiment of a word is based on the PMI-IR formula given in Peter Turneys paper.

              (hits(word AND "excellent") hits (poor))

         log2 ----------------------------------------

              (hits(word AND "poor") hits (excellent))

For more information refer the paper, "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews" By Peter Turney.

returns : Returns an html or text version of the results.

__METHOD__->predictPhraseSentiment(infile,,posInf,negInf,html,trace)

Purpose:Given an file containing text, this function tries to find the positive and negative phrases. The formula used to calculate the sentiment of a phrase is based on the PMI-IR formula given in Peter Turneys paper.

              (hits(phrase AND "excellent") hits (poor))

         log2 ------------------------------------------
     
              (hits(phrase AND "poor") hits (excellent))

For more information refer the paper, "Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews" By Peter Turney.

returns : Returns an html or text version of the results.

SPELLING SUGGESTION

__METHOD__->phraseSpelling(searchString)

Purpose: This is function is used to retrieve a spelling suggestion from Google

Valid arguments are :

Returns: Returns suggested spelling if there is one, otherwise returns "No Spelling Suggested":

GOOGLE SEARCH

Use this function to issue queries to Google.

__METHOD__->Search(searchString,num_results)

Purpose: This function is used to query googles

Valid arguments are :

Returns: Returns a GoogleHack object containing the search results.

MANIPULATE WEB TEXT

This set of functions deal with retrieving text from the World Wide Web. Basically, the user can use these functions to retrieve sentences, words, or phrases that occur in web pages (In snippets, cached web pages, links etc.

__METHOD__->getSearchSnippetWords(searchString,numResults,trace_file)

Purpose: Given a search word, this function tries to retreive the text surrounding the search word in the retrieved snippets.

Valid arguments are :

returns : Returns an object which contains the parsed information

__METHOD__->getCachedSurroundingWords(searchString,trace_file)

Purpose: Given a search word, this function tries to retreive the text surrounding the search word in the retrieved CACHED Web pages. It basically does the search and passes the search results to the WebService::GoogleHack::Text::getCachedSurroundingWords function.

Valid arguments are :

returns : Returns a hash with the keys being the words and the values being the frequency of occurence.

__METHOD__->getSearchSnippetSentences(searchString,trace_file)

Purpose: Given a search word, this function tries to retreive the sentences in the snippet.It basically does the search and passes the search results to the WebService::GoogleHack::Text::getSnippetSentences function

Valid arguments are :

returns : Returns an array of strings.

__METHOD__->getCachedSurroundingSentences(searchString,trace_file)

Purpose: Given a search word, this function tries to retreive the sentences in the cached web page.

Valid arguments are :

returns : Returns a hash which contains the parsed sentences as values and the key being the web URL.

__METHOD__->getSearchCommonWords(searchString1,searchString2,trace_file,stemmer)

Purpose:Given two search words, this function tries to retreive the common text/words surrounding the search strings in the retrieved snippets.

Valid arguments are :

returns : Returns a hash which contains the intersecting words.

__METHOD__->getWordClustersInSnippets(searchString1,iterations,number,trace_file)

Purpose:Given a search string, this function retreive the top frequency words , and does a search on those words, and builds a list of words that can be regarded as a cluster of related words.

Valid arguments are :

returns : Returns a set of words as a hash.

__METHOD__->getClustersInSnippets(searchString1,searchString2,iterations,number,trace_file)

Purpose:Given two search strings, this function retreive the snippets for each string, and then finds the intersection of words, and then repeats the search with the intersection of words.

Valid arguments are :

returns : Returns a hash which contains the intersecting words as keys and the values being the frequency of occurence.

__METHOD__->getText(searchString,iterations,number,path_to_data_directory)

Purpose:Given a search string, this function will retreive the resulting URLs from Google, follow those links, and retrieve the text from there. The function will then clean up the text and store it in a file along with the URL, Date and time of retrieval.The file will be stored under the name of the search string.

Valid arguments are :

returns : Returns nothing.

AUTHOR ^

Pratheepan Raveendranathan, <rave0029@d.umn.edu>

Ted Pedersen, <tpederse@d.umn.edu>

BUGS ^

SEE ALSO ^

WebService::GoogleHack home page - http://google-hack.sourceforge.net

Pratheepan Raveendranathan - http://www.d.umn.edu/~rave0029/research

Ted Pedersen - www.d.umn.edu./~tpederse

Google-Hack Maling List <google-hack-users@lists.sourceforge.net>

COPYRIGHT AND LICENSE ^

Copyright (c) 2005 by Pratheepan Raveendranathan, Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

syntax highlighting: