The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
package Treex::Core;
$Treex::Core::VERSION = '2.20150928';
use strict;
use warnings;
use Treex::Core::Document;
use Treex::Core::Node;
use Treex::Core::Bundle;
use Treex::Core::Scenario;

1;

__END__

=pod

=encoding utf8

=head1 NAME

Treex::Core - interface to linguistic structures and processing units in Treex

=head1 VERSION

version 2.20150928

=head1 SYNOPSIS

 use Treex::Core;
 
 my $doc = Treex::Core::Document->new;
 
 my $bundle = $doc->create_bundle;
 my $zone   = $bundle->create_zone('en');
 my $atree  = $zone->create_atree;
 
 my $predicate = $atree->create_child({form=>'loves'});
 
 foreach my $argument (qw(John Mary)) {
   my $child = $atree->create_child( { form=>$argument } );
   $child->set_parent($predicate);
 }
 
 $doc->save('demo.treex');


=head1 DESCRIPTION

C<Treex::Core> is a library of modules for processing linguistic data,
especially tree-shaped syntactic representations of natural language
sentences, both for language analysis and synthesis purposes.

C<Treex::Core> is meant to be as language universal as possible.
It makes only a few assumptions: the language's written form must be
representable by Unicode characters, and it should be possible to segment
texts in such language into sentences (or sentence-like units) and words
(or word-like units).

C<Treex::Core> is tightly coupled with the tree editor TrEd, which
makes browsing the linguistic data structures very comfortable.

C<Treex::Core> uses TrEd's L<Treex::PML> for the memory
representation, as well as for storing the data into *.treex files, using
the XML-based Prague Markup Language.


=head2 Zones parametrized by language codes and selectors

Treex documents can contain parallel texts in two or more languages,
as well as alternative linguistic representations (such as two
dependency parses of a same sentence, resulting from different parsers).
Such contents of the same type are separated by introducing zones.

Zones (classes derived from L<Treex::Core::Zone>) are 
parametrized by language ISO codes, and optionally also by so called 
selectors. Selector can be any string identifying the source or purpose of the 
given piece of data. It can distinguish e.g. reference translation from 
machine-translated text, or the most probable parse of a given sentence from 
the second most probable parse. In Treex data structures, zones are used at 
two levels:

- L<Treex::Core::DocZone> - allows to have multiple texts
  stored in the same document

- L<Treex::Core::BundleZone> - allows to have multiple
  sentences and their representations in each bundle.

As for Treex processing units (scenarios and blocks, see below), each
processing unit either limits itself to a certain zone, or it can be
zone-parametrized (especially in the case of language-universal blocks).

=head2 Data structure units

In Treex, linguistic representations of running texts are organized
in the following hierarchy:

=head3 Documents

The smallest independently storable unit is a document
(L<Treex::Core::Document>).

Technically, each document consists of a set of document zones, and of a
sequence of bundles.

=head3 Document zone

A document can contain one ore more zone 
(L<Treex::Core::DocZone>), each of them containing a text.

=head3 Bundle

A bundle (L<Treex::Core::Bundle>) corresponds to a 
sentence (or a tuple of parallel or alternative sentences) and all its (or 
their) linguistic analyses.

Technically, a bundle contains a set of bundle zones.

=head3 Bundle zone

Bundle zone (L<Treex::Core::Bundle>) contains one sentence 
and at most one its linguistic analysis for each layer of analysis. The 
following layers are currently distinguished:

* a-layer - analytical layer (surface syntax dependency layer) merged with the
  morphological layer, as defined in the Prague Dependency Treebank.

* t-layer - tectogrammatical layer (deep-syntactic dependency)

* p-layer - phrase-structure layer

* n-layer - named entity layer

Each layer representation has a form of a tree, represented by the tree's root node.

=head3 Node

Each node has a parent (unless it is the root) and a set of predefined 
attributes, depending on the layer it belongs to. There is an abstract class 
L<Treex::Core::Node> defining the functionality which is 
common to all types of trees (such as functions for accessing node's parent or 
children). Functionality specific for the individual linguistic layers is 
implemented in the derived classes:

* L<Treex::Core::Node::A>

* L<Treex::Core::Node::T>

* L<Treex::Core::Node::P>

* L<Treex::Core::Node::N>

=head3 Attributes

Nodes contain attribute-value pairs. Some attributes are universal (such as 
identifier), but most of them are specific for a certain layer. Even if node 
instances are regular Moose objects (i.e., blessed hashes), node's attributes 
should be accessed exclusively via predefined accessors.

Attribute values can be plain or further structured using PML data types (e.g. 
sequences), according to the PML schema.


=head2 Processing units

=head3 Block

Blocks (descendants of L<Treex::Core::Block>) are the 
smallest processing units applicable on Treex documents.

=head3 Scenario

Scenarios (instances of L<Treex::Core::Scenario>) are 
sequences of blocks. Blocks from a scenario are applied on a document one 
after another.

=head2 Support for visualizing Treex trees in TrEd

C<Treex::Core> also contains a TrEd extension for browsing .treex files.
The extension itself is only a thin wrapper for the viewing functionality
implemented in L<Treex::Core::TredView>.


=head1 AUTHOR

Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>

Martin Popel <popel@ufal.mff.cuni.cz>

David Mareček <marecek@ufal.mff.cuni.cz>

=head1 COPYRIGHT AND LICENSE

Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.