
vocabulary -- extract vocabularies from Penn treebank files

vocabulary [-NT ntfile] [-POS posfile] [-word wordfile] [-count] [-binarized] [-verbose] file1 [file2...]
File1, file2 etc. are the names of Penn treebank files. If none are specified, STDIN is used.

Write the non-terminal node vocabulary to ntfile.
Write the part of speech vocabulary to posfile
Write the word vocabulary to wordfile.
Print the frequency counts for each of the categories.
The file is in binarized format.
Print filenames as they are processed.

Given a list of Penn treebank files, this script extracts the words, parts of speech, and non-terminal node names and emits each in a separate file in order of frequency.
Note that giving a "-" argument for any of ntfile, posfile, or wordfile causes the results to be written to STDOUT.

W.P. McNeill <billmcn@ssli.ee.washington.edu>