Dušan Variš > Treex-EN > Treex::Block::W2A::EN::FixTokenization

Download:
Treex-EN-0.13095.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.13095   Source  

NAME ^

Treex::Block::W2A::EN::FixTokenization - fix some issues in output of tokenizer

VERSION ^

version 0.13095

DESCRIPTION ^

Some abbreviations (with periods) are merged into one token. For example "e. g." is in Penn Treebank one token (with tag FW). Using only Treex::Block::W2A::EN::Tokenize we get four tokens: e . g . which may be distributed by the parser into different clauses. And this is hard to fix afterwards.

OVERRIDEN METHODS ^

from Treex::Core::Block

process_atree

AUTHOR ^

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE ^

Copyright © 2009 - 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: