The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<html><head><title>Unique Identifiers for GEDCOM Records and Their Attributes</title>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >
<link rel="stylesheet" type="text/css" title="pod_stylesheet" href="/assets/css/local/pod.css">

</head>
<body class='pod'>
<!--
  generated by Local::Pod::Simple::HTML v1.02,
  using Pod::Simple::PullParser v3.16,
  under Perl v5.012002 at Wed Aug 31 02:47:02 2011 GMT.

 If you want to change this HTML document, you probably shouldn't do that
   by changing it directly.  Instead, see about changing the calling options
   to Local::Pod::Simple::HTML, and/or subclassing Local::Pod::Simple::HTML,
   then reconverting this document from the Pod source.
   When in doubt, email the author of Local::Pod::Simple::HTML for advice.
   See 'perldoc Local::Pod::Simple::HTML' for more info.

-->

<!-- start doc -->
<a name='___top' class='dummyTopAnchor' ></a>
<h1><div class="toc_title">Unique Identifiers for GEDCOM Records and Their Attributes</div></h1>
<h3><div class="toc_toc">Table of contents</div></h3>
<table align="center" summary = "Table of Contents">
<tr><td align="center"><a href="#Unique_Identifiers_for_GEDCOM_Records_and_Their_Attributes">Unique Identifiers for GEDCOM Records and Their Attributes</a></td></tr>
<tr><td align="center"><a href="#Terminology">Terminology</a></td></tr>
<tr><td align="center"><a href="#Purpose_of_this_document">Purpose of this document</a></td></tr>
<tr><td align="center"><a href="#Limitations_of_this_document">Limitations of this document</a></td></tr>
<tr><td align="center"><a href="#A_Use_Case_for_UUIDs">A Use Case for UUIDs</a></td></tr>
<tr><td align="center"><a href="#Identifying_Unique_Identifiers_per_INDI">Identifying Unique Identifiers per INDI</a></td></tr>
<tr><td align="center"><a href="#Avoiding_Unnecessary_Proliferation_of_UUIDs">Avoiding Unnecessary Proliferation of UUIDs</a></td></tr>
<tr><td align="center"><a href="#Generation_of_UUIDs">Generation of UUIDs</a></td></tr>
<tr><td align="center"><a href="#Lifetime_of_UUIDs">Lifetime of UUIDs</a></td></tr>
<tr><td align="center"><a href="#Importation_of_GEDCOM_data">Importation of GEDCOM data</a></td></tr>
<tr><td align="center"><a href="#Merging_and_Matching_Sources">Merging and Matching Sources</a></td></tr>
<tr><td align="center"><a href="#Exporting_UUIDs">Exporting UUIDs</a></td></tr>
<tr><td align="center"><a href="#The_Header">The Header</a></td></tr>
<tr><td align="center"><a href="#Exporting_more_than_1_UUID_per_INDI">Exporting more than 1 UUID per INDI</a></td></tr>
<tr><td align="center"><a href="#References">References</a></td></tr>
</table>
<h1><a class='u' href='#___top' title='click to go to top of document'
name="Unique_Identifiers_for_GEDCOM_Records_and_Their_Attributes"
>Unique Identifiers for GEDCOM Records and Their Attributes</a></h1>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="Terminology"
>Terminology</a></h2>

<dl>
<dt><a name="o_Must_means_mandatory_and_visa_versa"
>o Must means mandatory and visa versa</a></dt>

<dd>
<p>You know the drill.</p>

<dt><a name="o_UUID_&#39;v&#39;_GUID"
>o UUID &#39;v&#39; GUID</a></dt>

<dd>
<p>UUIDs (Universally Unique Identifiers) are also known as GUIDs (Globally Unique Identifiers).
This document uses UUID.</p>

<dt><a name="o_HEADER/INDI/FAM/etc"
>o HEADER/INDI/FAM/etc</a></dt>

<dd>
<p>To avoid endlessly referring to HEADER/INDI/FAM/etc,
this document just uses INDI.</p>

<dt><a name="o_Page_numbers"
>o Page numbers</a></dt>

<dd>
<p>Page numbers refer to DRAFT Release 5.5.1,
in Ged551-5.pdf.
See <a href="http://metacpan.org/module/Unique Identifiers for GEDCOM Records and Their Attributes#References" class="podlinkpod"
>&#34;References&#34;</a> for downloading details.</p>
</dd>
</dl>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="Purpose_of_this_document"
>Purpose of this document</a></h2>

<p>This document is intended to formalize a set of proposals to add Unique Identifiers (UUIDs) to the GEDCOM specification,
and hence to GEDCOM-structured data.</p>

<p>These proposals are expected to be implementated within various software packages,
without any expectation that they will be accepted by the original authors of the GEDCOM specification.</p>

<p>UUIDs function to make it possible to combine GEDCOM files from multiple sources (files or systems),
while retaining enough information so as to be able to uniquely identify the specific source from which any particular assertion about an individual or family originated.</p>

<p>This document is written from the point of view that a specific software system handling GEDCOM data has the capability of storing UUIDs,
without necessarily being able to generate them.</p>

<p>This means that such a system which cannot generate UUIDs must nevertheless have the capability of reliably processing 1 or more UUIDs per GEDCOM record and/or attribute.</p>

<p>This means handling an INDI record which has 1 or more UUIDs,
and further it means handling a NAME (within that INDI) which itself has 1 or more UUIDs.</p>

<p>Also,
for a system which can generate UUIDs,
the question arises: At what stage should these UUIDs be generated.
This is discussed below.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="Limitations_of_this_document"
>Limitations of this document</a></h2>

<p>This document does not attempt to define any mechanism whereby data is stored in any particular system.</p>

<p>It discusses everthing in terms of GEDCOM-compatible (i.e.
stand-alone) files.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="A_Use_Case_for_UUIDs"
>A Use Case for UUIDs</a></h2>

<dl>
<dt><a name="o_Import"
>o Import</a></dt>

<dd>
<p>Import 2 GEDCOM files into a program which uses UUIDs,
assigning a UUID to each input file.</p>

<dt><a name="o_Match_individuals"
>o Match individuals</a></dt>

<dd>
<p>The program (somehow) decides an INDI record from file A matches an INDI record from file B.
Let&#39;s say the names match.</p>

<dt><a name="o_Flag_inconsistencies"
>o Flag inconsistencies</a></dt>

<dd>
<p>Then when there is inconsistent data (e.g.
the asserted birthdays of the individual are different in the 2 files),
the program saves the respective files&#39; UUIDs <i>attached to each to birthday</i>.</p>

<dt><a name="o_Multiple_versions_of_data"
>o Multiple versions of data</a></dt>

<dd>
<p>In this manner inconsistent data is saved permanently (carried forward in time),
without any specific determination that one version is &#39;right&#39; and one is &#39;wrong&#39;.</p>

<dt><a name="o_Returning_to_the_roots"
>o Returning to the roots</a></dt>

<dd>
<p>At some later time we can then use these UUIDs to determine which source the specific values of the birthdays came from.
Naturally any piece of data can be treated likewise.</p>
</dd>
</dl>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="Identifying_Unique_Identifiers_per_INDI"
>Identifying Unique Identifiers per INDI</a></h1>

<p>Proposal: Use _UID,
UUID and TYPE as the GEDCOM-like tags for UUIDs.</p>

<p>Usage:</p>

<dl>
<dt><a name="o_Extension_of_the_definition_of_HEADER_and_INDI"
>o Extension of the definition of HEADER and INDI</a></dt>

<dd>
<pre>          HEADER :=
          n HEAD                                    {1:1}
          ...
            +1 _UID &#60;&#60;UNIQUE_IDENTIFIER_STRUCTURE&#62;&#62; {0:M}

          INDIVIDUAL_RECORD :=
          n @XREF:INDI@ INDI                        {1:1}
          ...
            +1 _UID &#60;&#60;UNIQUE_IDENTIFIER_STRUCTURE&#62;&#62; {0:M}</pre>

<p>Note:</p>

<dl>
<dt><a name="o_The_{0:M}_limit"
>o The {0:M} limit</a></dt>

<dd>
<p>The limit is not {0:1} because there needs to be provision for storing multiple UUIDs per HEADER or INDI, assuming these have been generated on, or on behalf of, different source systems.</p>

<p>Perhaps every UUID in the file should be specified once in the header, for the convenience of the software importing the data. This also helps validation of the data.</p>

<dt><a name="o_REFN"
>o REFN</a></dt>

<dd>
<p>REFN is rejected as being suitable for this purpose because its length is limited to 20.</p>

<dt><a name="o_RIN"
>o RIN</a></dt>

<dd>
<p>RIN is rejected as being suitable for this purpose because its length is limited to 12.</p>
</dd>
</dl>

<dt><a name="o_UNIQUE_IDENTIFIER_STRUCTURE"
>o UNIQUE_IDENTIFIER_STRUCTURE</a></dt>

<dd>
<pre>          UNIQUE_IDENTIFIER_STRUCTURE :=
          n UUID &#60;UNIQUE_IDENTIFIER&#62;         {34:34}
            +1 TYPE &#60;UNIQUE_IDENTIFIER_TYPE&#62; {1:248}</pre>

<dt><a name="o_UUID"
>o UUID</a></dt>

<dd>
<pre>          UUID := {Size=34:34}
          A unique identifier assigned to the INDI and, perhaps implicitly, to all attributes of the INDI.</pre>

<p>The following is a quotation from the documentation of the Perl module <a href="http://metacpan.org/module/Data::UUID" class="podlinkpod"
>Data::UUID</a>:</p>

<p>&#34;The algorithm for UUID generation, used by this extension is described in the Internet Draft &#34;UUIDs and GUIDs&#34; by Paul J. Leach and Rich Salz. (See RFC 4122.) It provides reasonably efficient and reliable framework for generating UUIDs and supports fairly high allocation rates -- 10 million per second per machine -- and therefore is suitable for identifying both extremely short-lived and very persistent objects on a given system as well as across the network.&#34;</p>

<p>The following is an adaptation of a quotation from the documentation of the Perl module <a href="http://metacpan.org/module/Data::Session::ID::UUID34" class="podlinkpod"
>Data::Session::ID::UUID34</a>:</p>

<p>&#34;Perl usage: my($uuid) = Data::UUID -&#62; new -&#62; create_hex.</p>

<p>Note: Data::UUID returns &#39;0x&#39; as the prefix of the 34-byte hex digest. You have been warned.&#34;</p>

<p>The point of including the &#39;0x&#39; in the UUID is to indicate to any system that the characters in the UUID are meant to be hex digits.</p>

<dt><a name="o_TYPE"
>o TYPE</a></dt>

<dd>
<pre>          TYPE := {Size=1:248}
          A user-defined definition of the UUID.</pre>
</dd>
</dl>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="Avoiding_Unnecessary_Proliferation_of_UUIDs"
>Avoiding Unnecessary Proliferation of UUIDs</a></h1>

<p>Proposal: A system must only generate the minimum number of UUIDs to satisfy the requirements of this document.</p>

<p>This means only a single UUID need be stored in the header of a GEDCOM file, on the understanding it automatically applies (&#39;trickles down&#39;) to each record within the file.</p>

<p>Likewise, only a single UUID need be stored in an INDI record, on the understanding it trickles down to each attribute (e.g. NAME) within that INDI.</p>

<p>The attribute is said to &#39;inherit&#39; the UUID of its parent.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="Generation_of_UUIDs"
>Generation of UUIDs</a></h1>

<p>Proposal: That UUIDs, by default, need not be generated.</p>

<p>In other words, UUIDs are not, by default, mandatory.</p>

<p>Hence any (isolated) system can continue to operate in the absence of UUIDs.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="Lifetime_of_UUIDs"
>Lifetime of UUIDs</a></h1>

<p>Proposal: When a system does generate an INDI, then only 1 UUID is permanently associated with the corresponding data.</p>

<p>Hence, within any (isolated) system, any number of UUIDs can be generated per INDI record, but only the most recently generated one is to be preserved. All trace of all preceeding UUIDs is to be expunged from the system.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="Importation_of_GEDCOM_data"
>Importation of GEDCOM data</a></h1>

<p>When a file of GEDCOM data is imported, various cases arise:</p>

<dl>
<dt><a name="o_Importation_into_an_empty_system"
>o Importation into an empty system</a></dt>

<dd>
<p>In this case, one UUID must be generated, which willl then apply to all of the incoming data.</p>

<dt><a name="o_Importation_into_a_system_with_pre-existing_data"
>o Importation into a system with pre-existing data</a></dt>

<dd>
<dl>
<dt><a name="o_When_the_pre-existing_data_has_at_least_1_UUID"
>o When the pre-existing data has at least 1 UUID</a></dt>

<dd>
<p>Then, the imported data must have a UUID, or a UUID must be generated on-the-fly for the incoming data.</p>

<dt><a name="o_When_the_pre-exising_data_does_not_have_UUIDs"
>o When the pre-exising data does not have UUIDs</a></dt>

<dd>
<p>Then a UUID must be generated for the pre-existing data.</p>

<p>After that, the incoming data is treated as in the preceeding case.</p>

<p>The point of this is to specifically exclude any attempt to combine 2 datasets in some manner based on the false assumption that they can&#39;t have any matching data.</p>
</dd>
</dl>
</dd>
</dl>

<p>In each case, 2 (at least) UUIDs must be output in the header when the data is exported.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="Merging_and_Matching_Sources"
>Merging and Matching Sources</a></h1>

<dl>
<dt><a name="o_Exactly_matching_data"
>o Exactly matching data</a></dt>

<dd>
<p>When data coming from the 2 sources exactly matches, there is no need to store a UUID with each attribute of the combined result. Here, attribute refers to, say, NAME within INDI.</p>

<p>It only suffices to store the 2 UUIDs, if they exist, at the INDI level.</p>

<p>Actually, if every data record matches, then there is only a need to store the UUID are the header level, and not at each INDI level.</p>

<dt><a name="o_Mis-matched_data"
>o Mis-matched data</a></dt>

<dd>
<p>When the data does not match, then the UUIDs, if they exist, are to be stored at the INDI level. In practice, this means 1 or more levels below the INDI tag itself. See <a href="http://metacpan.org/module/Unique Identifiers for GEDCOM Records and Their Attributes#Exporting-UUIDs" class="podlinkpod"
>&#34;Exporting UUIDs&#34;</a> just below.</p>

<p>At the level of the attribute which mis-matches, say NAME within INDI, it only suffices to store 1 UUID, although both can be stored. The one not stored can be inferred from the pair stored at the INDI level. This optimization is designed to reduce file sizes.</p>

<p>Perhaps this can be extended to the header, where the first mentioned UUID becomes the default (or the default is explicitly tagged), so that any data lacking but needing a UUID takes the default as its UUID.</p>
</dd>
</dl>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="Exporting_UUIDs"
>Exporting UUIDs</a></h1>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="The_Header"
>The Header</a></h2>

<p>Proposal: The header must list all UUIDs which appear in the data records.</p>

<p>This allows importing code to better prepare for the data, and to perform validation of the data.</p>

<h2><a class='u' href='#___top' title='click to go to top of document'
name="Exporting_more_than_1_UUID_per_INDI"
>Exporting more than 1 UUID per INDI</a></h2>

<p>If we start with 2 GEDCOM files:</p>

<pre>   0 @I1@ INDI
         1 NAME Alfred E. /Newman/

   0 @I2@ INDI
         1 Name Alfred Einstein /Newman/</pre>

<p>Then, after they are been combined, how should they be exported? There are 2 obvious possibilities:</p>

<dl>
<dt><a name="o_As_above,_but_with_cross-referencing"
>o As above, but with cross-referencing</a></dt>

<dd>
<pre>   0 @I1@ INDI
         1 NAME Alfred E. /Newman/
           2 ALIA @I2@
           2 _UID 0x1...

   0 @I2@ INDI
         1 NAME Alfred Einstein /Newman/
           2 ALIA @I1@
           2 _UID 0x2...</pre>

<dt><a name="o_As_one_INDI_with_an_AKA_(alias)"
>o As one INDI with an AKA (alias)</a></dt>

<dd>
<pre>   0 @I1@ INDI
         1 NAME Alfred E. /Newman/
           2 _UID 0x1...
         1 NAME Alfred Einstein /Newman/
           2 TYPE aka
           2 _UID 0x2...</pre>
</dd>
</dl>

<p>Proposal: That only the second format be supported.</p>

<p>There is less processing complexity in handling the second format.</p>

<h1><a class='u' href='#___top' title='click to go to top of document'
name="References"
>References</a></h1>

<p>General:</p>

<dl>
<dt><a name="o_GEDCOM_Specification"
>o <a href="http://wiki.webtrees.net/File:Ged551-5.pdf" class="podlinkurl"
>GEDCOM Specification</a></a></dt>

<dd>
<dt><a name="o_The_Perl_Genealogy::*_Namespace"
>o <a href="http://savage.net.au/Perl-modules/html/genealogy/rationale.html" class="podlinkurl"
>The Perl Genealogy::* Namespace</a></a></dt>
</dl>

<p>Perl modules:</p>

<dl>
<dt><a name="o_Gedcom"
>o <a href="http://metacpan.org/module/Gedcom" class="podlinkpod"
>Gedcom</a></a></dt>

<dd>
<dt><a name="o_Data::UUID"
>o <a href="http://metacpan.org/module/Data::UUID" class="podlinkpod"
>Data::UUID</a></a></dt>

<dd>
<dt><a name="o_Data::Session::ID::UUID34"
>o <a href="http://metacpan.org/module/Data::Session::ID::UUID34" class="podlinkpod"
>Data::Session::ID::UUID34</a></a></dt>
</dl>

<!-- end doc -->

</body></html>