Jeremy Kahn > Lingua-Treebank-0.16 > vocabulary

Download:
Lingua-Treebank-0.16.tar.gz

Annotate this POD

CPAN RT

New  3
Open  0
View/Report Bugs
Source  

NAME ^

vocabulary -- extract vocabularies from Penn treebank files

SYNOPSIS ^

vocabulary [-NT ntfile] [-POS posfile] [-word wordfile] [-count] [-binarized] [-verbose] file1 [file2...]

File1, file2 etc. are the names of Penn treebank files. If none are specified, STDIN is used.

OPTIONS ^

NT

Write the non-terminal node vocabulary to ntfile.

POS

Write the part of speech vocabulary to posfile

word

Write the word vocabulary to wordfile.

count

Print the frequency counts for each of the categories.

binarized

The file is in binarized format.

verbose

Print filenames as they are processed.

DESCRIPTION ^

Given a list of Penn treebank files, this script extracts the words, parts of speech, and non-terminal node names and emits each in a separate file in order of frequency.

Note that giving a "-" argument for any of ntfile, posfile, or wordfile causes the results to be written to STDOUT.

AUTHOR ^

W.P. McNeill <billmcn@ssli.ee.washington.edu>

syntax highlighting: