The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<html><head><title>Lingua::Interset::Atom</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >

<style type="text/css">
 <!--/*--><![CDATA[/*><!--*/
BODY {
  background: white;
  color: black;
  font-family: arial,sans-serif;
  margin: 0;
  padding: 1ex;
}

A:link, A:visited {
  background: transparent;
  color: #006699;
}

A[href="#POD_ERRORS"] {
  background: transparent;
  color: #FF0000;
}

DIV {
  border-width: 0;
}

DT {
  margin-top: 1em;
  margin-left: 1em;
}

.pod { margin-right: 20ex; }

.pod PRE     {
  background: #eeeeee;
  border: 1px solid #888888;
  color: black;
  padding: 1em;
  white-space: pre;
}

.pod H1      {
  background: transparent;
  color: #006699;
  font-size: large;
}

.pod H1 A { text-decoration: none; }
.pod H2 A { text-decoration: none; }
.pod H3 A { text-decoration: none; }
.pod H4 A { text-decoration: none; }

.pod H2      {
  background: transparent;
  color: #006699;
  font-size: medium;
}

.pod H3      {
  background: transparent;
  color: #006699;
  font-size: medium;
  font-style: italic;
}

.pod H4      {
  background: transparent;
  color: #006699;
  font-size: medium;
  font-weight: normal;
}

.pod IMG     {
  vertical-align: top;
}

.pod .toc A  {
  text-decoration: none;
}

.pod .toc LI {
  line-height: 1.2em;
  list-style-type: none;
}

  /*]]>*/-->
</style>


</head>
<body class='pod'>
<!--
  generated by Pod::Simple::HTML v3.28,
  using Pod::Simple::PullParser v3.28,
  under Perl v5.018002 at Tue Jul 11 10:46:51 2017 GMT.

 If you want to change this HTML document, you probably shouldn't do that
   by changing it directly.  Instead, see about changing the calling options
   to Pod::Simple::HTML, and/or subclassing Pod::Simple::HTML,
   then reconverting this document from the Pod source.
   When in doubt, email the author of Pod::Simple::HTML for advice.
   See 'perldoc Pod::Simple::HTML' for more info.

-->

<!-- start doc -->
<a name='___top' class='dummyTopAnchor' ></a>

<div class='indexgroup'>
<ul   class='indexList indexList1'>
  <li class='indexItem indexItem1'><a href='#NAME'>NAME</a>
  <li class='indexItem indexItem1'><a href='#VERSION'>VERSION</a>
  <li class='indexItem indexItem1'><a href='#SYNOPSIS'>SYNOPSIS</a>
  <li class='indexItem indexItem1'><a href='#DESCRIPTION'>DESCRIPTION</a>
  <li class='indexItem indexItem1'><a href='#ATTRIBUTES'>ATTRIBUTES</a>
  <ul   class='indexList indexList2'>
    <li class='indexItem indexItem2'><a href='#surfeature'>surfeature</a>
    <li class='indexItem indexItem2'><a href='#decode_map'>decode_map</a>
    <li class='indexItem indexItem2'><a href='#encode_map'>encode_map</a>
    <li class='indexItem indexItem2'><a href='#tagset'>tagset</a>
  </ul>
  <li class='indexItem indexItem1'><a href='#METHODS'>METHODS</a>
  <ul   class='indexList indexList2'>
    <li class='indexItem indexItem2'><a href='#decode()'>decode()</a>
    <li class='indexItem indexItem2'><a href='#decode_and_merge_hard()'>decode_and_merge_hard()</a>
    <li class='indexItem indexItem2'><a href='#decode_and_merge_soft()'>decode_and_merge_soft()</a>
    <li class='indexItem indexItem2'><a href='#encode()'>encode()</a>
    <li class='indexItem indexItem2'><a href='#list()'>list()</a>
    <li class='indexItem indexItem2'><a href='#merge_atoms()'>merge_atoms()</a>
  </ul>
  <li class='indexItem indexItem1'><a href='#SEE_ALSO'>SEE ALSO</a>
  <li class='indexItem indexItem1'><a href='#AUTHOR'>AUTHOR</a>
  <li class='indexItem indexItem1'><a href='#COPYRIGHT_AND_LICENSE'>COPYRIGHT AND LICENSE</a>
</ul>
</div>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="NAME"
>NAME</a></h1>

<p>Lingua::Interset::Atom - Atomic driver for a surface feature.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="VERSION"
>VERSION</a></h1>

<p>version 3.006</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="SYNOPSIS"
>SYNOPSIS</a></h1>

<pre>  use Lingua::Interset::Atom;

  my $atom = Lingua::Interset::Atom-&#62;new
  (
      &#39;surfeature&#39;    =&#62; &#39;gender&#39;,
      &#39;decode_map&#39; =&#62;

          { &#39;M&#39; =&#62; [&#39;gender&#39; =&#62; &#39;masc&#39;, &#39;animateness&#39; =&#62; &#39;anim&#39;],
            &#39;I&#39; =&#62; [&#39;gender&#39; =&#62; &#39;masc&#39;, &#39;animateness&#39; =&#62; &#39;inan&#39;],
            &#39;F&#39; =&#62; [&#39;gender&#39; =&#62; &#39;fem&#39;],
            &#39;N&#39; =&#62; [&#39;gender&#39; =&#62; &#39;neut&#39;] },

      &#39;encode_map&#39; =&#62;

          { &#39;gender&#39; =&#62; { &#39;masc&#39; =&#62; { &#39;animateness&#39; =&#62; { &#39;inan&#39; =&#62; &#39;I&#39;,
                                                         &#39;@&#39;    =&#62; &#39;M&#39; }},
                          &#39;fem&#39;  =&#62; &#39;F&#39;,
                          &#39;@&#39;    =&#62; &#39;N&#39; }}
  );</pre>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="DESCRIPTION"
>DESCRIPTION</a></h1>

<p>Atom is a special case of a tagset driver. As the name suggests, the surface tags are considered atomic, i.e. indivisible. It provides environment for easy mapping between surface strings and Interset features.</p>

<p>While Atom can be used to implement drivers of tagsets whose tags are not structured (such as en::penn or sv::mamba), they should also provide means of defining &#8220;sub-drivers&#8221; for individual surface features within drivers of complex tagsets. For example, the Czech tags in the Prague Dependency Treebank are always strings of 15 characters where the <i>i</i>-th position in the string encodes the <i>i</i>-th surface feature (which may or may not directly correspond to a feature in Interset). A driver for the PDT tagset could internally construct atomic drivers for PDT gender, number, case etc.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="ATTRIBUTES"
>ATTRIBUTES</a></h1>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="surfeature"
>surfeature</a></h2>

<p>Name of the surface feature the atom describes. If the atom describes a whole tagset, the tagset id could be stored here. The surface features may be structured differently from Interset, e.g. there might be an <i>agreement</i> feature, which would map to the Interset features of <code>person</code> and <code>number</code>.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="decode_map"
>decode_map</a></h2>

<p>A compact description of mapping from the surface tags to the Interset feature values. It is a hash reference. Hash keys are surface tags. Hash values are references to arrays of assignments. The arrays must have even number of elements and every pair of elements is a feature-value pair.</p>

<p>Example:</p>

<pre>  { &#39;M&#39; =&#62; [&#39;gender&#39; =&#62; &#39;masc&#39;, &#39;animateness&#39; =&#62; &#39;anim&#39;],
    &#39;I&#39; =&#62; [&#39;gender&#39; =&#62; &#39;masc&#39;, &#39;animateness&#39; =&#62; &#39;inan&#39;],
    &#39;F&#39; =&#62; [&#39;gender&#39; =&#62; &#39;fem&#39;],
    &#39;N&#39; =&#62; [&#39;gender&#39; =&#62; &#39;neut&#39;] }</pre>

<p>Vertical bars may be used to separate multiple values of one feature. The <code>other</code> feature can have a structured value, so you can use standard Perl syntax to describe hash and/or array references.</p>

<pre>  { &#39;name_of_dog&#39; =&#62; [ &#39;pos&#39; =&#62; &#39;noun&#39;, &#39;nountype&#39; =&#62; &#39;prop&#39;, &#39;other&#39; =&#62; { &#39;named_entity_type&#39; =&#62; &#39;dog&#39; } ],
    &#39;wh_word&#39;     =&#62; [ &#39;pos&#39; =&#62; &#39;noun|adj|adv&#39;, &#39;prontype&#39; =&#62; &#39;int|rel&#39; ] }</pre>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="encode_map"
>encode_map</a></h2>

<p>A compact description of mapping from the Interset feature structure to the surface tags. It is a hash reference, possibly with nested hashes. The top-level hash must always have just one key, which is a name of an Interset feature. (It could be encoded without the hash but I believe that the whole map looks better this way.)</p>

<p>The top-level key leads to a second-level hash, which is indexed by the values of the feature. It is not necessary that all possible values are listed. A special value <code>@</code>, if present, means &#8220;everything else&#8221;. It is recommended to always mark the default value using <code>@</code>. Even if we list all currently known values of the feature, new values may be introduced to Interset in future and we do not want to have to get back to all tagsets and update their encoding maps. (On the other hand, if there are values that the <code>decode()</code> method of the current atom does not generate but we still have a preferred output for them, the preference must be made explicit. For instance, if the language does not have the pluperfect tense, it may still define that it be encoded the same way as the past tense.)</p>

<p>A feature may have a <i>multi-value</i> (several values joined and separated by vertical bars). A value (multi- or not) is always first sought using the exact match. If the search fails, both the current feature value and the keys of the value hash are treated as lists of values and their largest intersection is sought for. If no overlap is found, the default <code>@</code> decision is taken.</p>

<p>Example:</p>

<pre>  { &#39;gender&#39; =&#62; { &#39;masc&#39;      =&#62; { &#39;animateness&#39; =&#62; { &#39;inan&#39; =&#62; &#39;I&#39;,
                                                      &#39;@&#39;    =&#62; &#39;M&#39; }},
                  &#39;fem|masc&#39;  =&#62; &#39;T&#39;,
                  &#39;fem&#39;       =&#62; &#39;F&#39;,
                  &#39;@&#39;         =&#62; &#39;N&#39; }}</pre>

<p>The <code>other</code> feature, if queried by the map, receives special treatment. First, the <code>tagset</code> attribute must be filled in and its value is checked against the <code>tagset</code> feature. The value is only processed if the tagset ids match (otherwise an empty value is assumed). String values and array values (given as vertical-bar-separated strings) are processed similarly to normal features. In addition, it is possible to have a hash of subfeatures stored in <code>other</code>, and to query them as &#39;other/subfeature&#39;.</p>

<p>Example:</p>

<pre>  { &#39;other/subfeature1&#39; =&#62; { &#39;x&#39; =&#62; &#39;X&#39;,
                             &#39;y&#39; =&#62; &#39;Y&#39;,
                             &#39;@&#39; =&#62; { &#39;other/subfeature2&#39; =&#62; { &#39;1&#39; =&#62; &#39;S&#39;,
                                                               &#39;@&#39; =&#62; &#39;&#39; }}}}</pre>

<p>The corresponding <code>decode_map</code> would be in this case:</p>

<pre>  {
      &#39;X&#39; =&#62; [&#39;other&#39; =&#62; {&#39;subfeature1&#39; =&#62; &#39;x&#39;}],
      &#39;Y&#39; =&#62; [&#39;other&#39; =&#62; {&#39;subfeature1&#39; =&#62; &#39;y&#39;}],
      &#39;S&#39; =&#62; [&#39;other&#39; =&#62; {&#39;subfeature2&#39; =&#62; &#39;1&#39;}]
  }</pre>

<p>Note that in general it is not possible to automatically derive the <code>encode_map</code> from the <code>decode_map</code> or vice versa. However, there are simple instances of atoms where this is possible.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="tagset"
>tagset</a></h2>

<p>Optional identifier of the tagset that this atom is part of. It is required when the encoding map queries values of the <code>other</code> feature (to check against the <code>tagset</code> feature that the values come from the same tagset). Default is empty string.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="METHODS"
>METHODS</a></h1>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="decode()"
>decode()</a></h2>

<pre>  my $fs  = $driver-&#62;decode ($tag);</pre>

<p>Takes a tag (string) and returns a <a href="http://search.cpan.org/perldoc?Lingua%3A%3AInterset%3A%3AFeatureStructure" class="podlinkpod"
>Lingua::Interset::FeatureStructure</a> object with corresponding feature values set.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="decode_and_merge_hard()"
>decode_and_merge_hard()</a></h2>

<pre>  my $fs  = $driver1-&#62;decode ($tag1);
  $driver2-&#62;decode_and_merge_hard ($tag2, $fs);</pre>

<p>Takes a tag (string) and a <a href="http://search.cpan.org/perldoc?Lingua%3A%3AInterset%3A%3AFeatureStructure" class="podlinkpod"
>Lingua::Interset::FeatureStructure</a> object. Adds the feature values corresponding to the tag to the existing feature structure. Replaces previous values in case of conflict.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="decode_and_merge_soft()"
>decode_and_merge_soft()</a></h2>

<pre>  my $fs  = $driver1-&#62;decode ($tag1);
  $driver2-&#62;decode_and_merge_soft ($tag2, $fs);</pre>

<p>Takes a tag (string) and a <a href="http://search.cpan.org/perldoc?Lingua%3A%3AInterset%3A%3AFeatureStructure" class="podlinkpod"
>Lingua::Interset::FeatureStructure</a> object. Adds the feature values corresponding to the tag to the existing feature structure. Merges lists of values in case a feature had already a value set.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="encode()"
>encode()</a></h2>

<pre>  my $tag = $driver-&#62;encode ($fs);</pre>

<p>Takes a <a href="http://search.cpan.org/perldoc?Lingua%3A%3AInterset%3A%3AFeatureStructure" class="podlinkpod"
>Lingua::Interset::FeatureStructure</a> object and returns the tag (string) in the given tagset that corresponds to the feature values. Note that some features may be ignored because they cannot be represented in the given tagset.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="list()"
>list()</a></h2>

<pre>  my $list_of_tags = $driver-&#62;list();</pre>

<p>Returns the reference to the list of all known tags in this particular tagset. This is not directly needed to decode, encode or convert tags but it is very useful for testing and advanced operations over the tagset. Note however that many tagset drivers contain only an approximate list, created by collecting tag occurrences in some corpus.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="merge_atoms()"
>merge_atoms()</a></h2>

<pre>  $atom0-&#62;merge($atom1, $atom2, ..., $atomN);</pre>

<p>Takes references to one or more other atoms and merges (adds) their decoding maps to our decoding map. Ordering of the atoms matters: if several atoms define decoding of the same feature, the first definition will be used and the others will be ignored. The atom <code>$self</code> comes first.</p>

<p>Note that the <i>encoding</i> map will <i>not change</i>. This method is useful for tagsets where feature values appear without naming the feature. For example, instead of</p>

<pre>  gender=masc|number=sing|case=nom</pre>

<p>the tag only contains</p>

<pre>  masc|sing|nom</pre>

<p>Such tagsets require asymmetric processing. There is one big atom that decodes any feature value regardless of which feature it belongs to. But it does not encode anything. Then there are many small atoms for individual features. We cannot use them for decoding because we do not know which atom to pick until we have decoded the value. But we will use them for encoding because we know which features and in what order we want to encode for a particular part of speech.</p>

<p>We could define both the big decoding atom and the small encoding atoms manually. There is a drawback to it: we would be describing each feature twice at two different places in the source code. The <code>merge_atoms()</code> method gives us a better way: we will define the small atoms (both for decoding and encoding) and then create the big decoding atom by merging the small ones:</p>

<pre>  # This code goes in a tagset driver, e.g. Lingua::Interset::Tagset::CS::Mytagset,
  # in a function that builds all necessary atoms, e.g. sub _create_atoms.
  my %atoms;
  $atoms{genderanim} = $self-&#62;create_atom
  (
      &#39;surfeature&#39; =&#62; &#39;genderanim&#39;,
      &#39;decode_map&#39; =&#62;
      {
          &#39;ma&#39; =&#62; [&#39;gender&#39; =&#62; &#39;masc&#39;, &#39;animateness&#39; =&#62; &#39;anim&#39;],
          &#39;mi&#39; =&#62; [&#39;gender&#39; =&#62; &#39;masc&#39;, &#39;animateness&#39; =&#62; &#39;inan&#39;],
          &#39;f&#39;  =&#62; [&#39;gender&#39; =&#62; &#39;fem&#39;],
          &#39;n&#39;  =&#62; [&#39;gender&#39; =&#62; &#39;neut&#39;]
      },
      &#39;encode_map&#39; =&#62;
      {
          &#39;gender&#39; =&#62; { &#39;masc&#39; =&#62; { &#39;animateness&#39; =&#62; { &#39;inan&#39; =&#62; &#39;mi&#39;,
                                                       &#39;@&#39;    =&#62; &#39;ma&#39; }},
                        &#39;fem&#39;  =&#62; &#39;f&#39;,
                        &#39;@&#39;    =&#62; &#39;n&#39; }
      }
  );
  $atoms{number} = $self-&#62;create_simple_atom
  (
      &#39;intfeature&#39; =&#62; &#39;number&#39;,
      &#39;simple_decode_map&#39; =&#62;
      {
          &#39;sg&#39; =&#62; &#39;sing&#39;,
          &#39;pl&#39; =&#62; &#39;plur&#39;
      }
  );
  $atoms{feature} = $self-&#62;create_atom
  (
      &#39;surfeature&#39; =&#62; &#39;feature&#39;,
      &#39;decode_map&#39; =&#62; {},
      &#39;encode_map&#39; =&#62; { &#39;pos&#39; =&#62; {} } # The encoding map cannot be empty even if we are not going to use it.
  );
  $atoms{feature}-&#62;merge_atoms($atoms{genderanim}, $atoms{number});</pre>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="SEE_ALSO"
>SEE ALSO</a></h1>

<p><a href="http://search.cpan.org/perldoc?Lingua%3A%3AInterset%3A%3ATagset" class="podlinkpod"
>Lingua::Interset::Tagset</a>, <a href="http://search.cpan.org/perldoc?Lingua%3A%3AInterset%3A%3AFeatureStructure" class="podlinkpod"
>Lingua::Interset::FeatureStructure</a></p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="AUTHOR"
>AUTHOR</a></h1>

<p>Dan Zeman &#60;zeman@ufal.mff.cuni.cz&#62;</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="COPYRIGHT_AND_LICENSE"
>COPYRIGHT AND LICENSE</a></h1>

<p>This software is copyright (c) 2017 by Univerzita Karlova (Charles University).</p>

<p>This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.</p>

<!-- end doc -->

</body></html>