Tomáš Kraut > Treex-EN > Treex::Block::W2A::EN::FixTokenization

Download:
Treex-EN-0.08171.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.08171   Source  

NAME ^

Treex::Block::W2A::EN::FixTokenization - fix some issues in output of tokenizer

VERSION ^

version 0.08171

DESCRIPTION ^

Some abbreviations (with periods) are merged into one token. For example "e. g." is in Penn Treebank one token (with tag FW). Using only SEnglishW_to_SEnglishM::Penn_style_tokenization we get four tokens: e . g . which may be distributed by the parser into different clauses. And this is hard to fix afterwards.

OVERRIDEN METHODS ^

from Treex::Core::Block

process_atree

AUTHOR ^

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE ^

Copyright © 2009 - 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: