The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=pod

=encoding utf8

=head1 NAME

Rosetta::Language -
Design document of the Rosetta D language

=head1 DESCRIPTION

The native command language of a L<Rosetta> DBMS (database management
system) / virtual machine is called B<Rosetta D>; this document,
Rosetta::Language ("Language"), is the human readable authoritative design
document for that language, and for the Rosetta virtual machine in which it
executes.  If there's a conflict between any other document and this one,
then either the other document is in error, or the developers were
negligent in updating it before Language, so you can yell at them.

Rosetta D is intended to qualify as a "D" language as defined by "The Third
Manifesto" (TTM), a formal proposal for a solid foundation for data and
database management systems, written by Christopher J. Date and Hugh
Darwen; see
L<http://www.aw-bc.com/catalog/academic/product/0,1144,0321399420,00.html>
for a publishers link to the book that formally publishes TTM.  See
L<http://www.thethirdmanifesto.com/> for some references to what TTM is,
and also copies of some documents I used in writing Rosetta D.  The initial
main reference I used when creating Rosetta D was the book "Database in
Depth" (2005; L<http://www.oreilly.com/catalog/databaseid/>), written by
Date and published by Oreilly.

It should be noted that Rosetta D, being quite new, may omit some features
that are mandatory for a "D" language initially, to speed the way to a
useable partial solution, but you can be comforted in knowing that they
will be added as soon as possible.  Also, it contains some features that go
beyond the scope of a "D" language, so Rosetta D is technically a "D plus
extra"; examples of this are constructs for creating the databases
themselves and managing connections to them.  However, Rosetta D should
never directly contradict The Third Manifesto; for example, its relations
never contain duplicates, and it does not allow nulls anywhere, and you can
not specify attributes by ordinal position instead of by name.  That's not
to say you can't emulate all the SQL features over Rosetta D; you can, at
least once its complete.

Rosetta D also incorporates design aspects and constructs that are taken
from or influenced by Perl 6, pure functional languages like Haskell,
Tutorial D, various TTM implementations, and various SQL dialects and
implementations (see the L<Rosetta::SeeAlso> file).  While most of these
languages or projects aren't specifically related to TTM, none of Rosetta's
adaptions from these are incompatible with TTM.

Note that the Rosetta documentation will be focusing mainly on how Rosetta
itself works, and will not spend much time in providing rationales; you can
read TTM itself and various other external documentation for much of that.

=head1 10,000 MILE VIEW

Rosetta D is a computationally complete (and industrial strength)
high-level programming language with fully integrated database
functionality; you can use it to define, query, and update relational
databases.  It is mainly imperative in style, since at the higher levels,
users provide sequential instructions; but in many respects it is also
functional or declarative, in that many constructs are pure or
deterministic, and the constructs focus on defining what needs to be
accomplished rather than how to accomplish that.

This permits a lot of flexability on the part of implementers of the
language (usually Rosetta Engine classes) to be adaptive to changing
constraints of their environment and deliver efficient solutions.  This
also makes things a lot easier for users of the language because they can
focus on the meaning of their data rather than worrying about
implementation details, which relieves burdens on their creativity, and
saves them time.  In short, this system improves everyone's lives.

=head2 Environment

The Rosetta DBMS / virtual machine, which by definition is the environment
in which Rosetta D executes, conceptually resembles a hardware PC, having a
command processor (CPU), standard user input and output channel, persistant
read-only memory (ROM), volatile read-write memory (RAM), and read-write
persistent disk or network storage.

Within this analogy, the role of the PC's user, that feeds it through
standard input and accepts its standard output, is fulfilled by the
application that is using the Rosetta DBMS; similarly, the application
itself will activate the virtual machine when wanting to use it (done in
this distribution by instantiating a new Rosetta::Interface::DBMS object),
and deactivate the virtual machine when done (letting that object expire).

When a new virtual machine is activated, the virtual machine has a default
state where the CPU is ready to accept user-input commands to process, and
there is a built-in (to the ROM) set of system-defined data types and
operators which are ready to be used to define or be invoked by said
user-input commands; the RAM starts out effectively empty and the
persistant disk or network storage is ignored.

Following this activation, the virtual machine is mostly idle except when
executing Rosetta D commands that it receives via the standard input (done
in this distribution by invoking methods on the DBMS object).  The virtual
machine effectively handles just one command at a time, and executes each
separately and in the order received; any results or side-effects of each
command provide a context for the next command.

At some point in time, as the result of appropriate commands, data
repositories (either newly created or previously existing) that live in the
persistant disk or network storage will be mounted within the virtual
machine, at which point subsequent commands can read or update them, then
later unmount them when done.  Speaking in the terms of a typical database
access solution like the Perl DBI, this mounting and unmounting of a
repository usually corresponds to connecting to and disconnecting from a
database.  Speaking in the terms of a typical disk file system, this is
mounting or unmounting a logical volume.

Any mounted persistent repository, as well as the temporary repository
which is most of the conceptual PC's RAM, is home to all user-defined data
variables, data types, operators, constraints, packages, and routines; they
collectively are the database that the Rosetta DBMS is managing.  Most
commands against the DBMS would typically involve reading and updating the
data variables, which in typical database terms is performing queries and
data manipulation.  Much less frequently, you would also see changes to
what variables, types, etcetera exist, which in typical terms is data
definition.  Any updates to a persistent repository will usually last
between multiple activations of the virtual machine, while any updates to
the temporary repository are lost when the machine deactivates.

All virtual machine commands are subject to a collection of both
system-defined and user-defined constraints (also known as business rules),
which are always active over the period that they are defined.  The
constraints restrict what state the database can be in, and any commands
which would cause the constraints to be violated will fail; this mechanism
is a large part of what makes the Rosetta DBMS a reliable modeller of
anything in reality, since it only stores values that are reasonable.

=head2 Command Structure and Processing

Rosetta D commands are structured as arbitarily complex routines /
operators, either named or anonymous, and they can have (named) parameters,
can contain single (usually) or multiple Rosetta D statements or value
expressions, and can return one or more values.

Rosetta D command routine definitions can either be named and stored in a
persistent repository for reuse like a repository's data types or
variables, or they can be anonymous and furnished by an application at
run-time for temporary use.  A command routine can take the form of either
a function / read-only operator or a procedure / update operator; the
former has a special return value which is the value of the evaluated
function invocation within a value expression; the latter has no such
special return value, and can not be invoked within a value expression.

An application can only ever directly define and invoke an anonymous
command routine, but an anonymous routine can in turn invoke (and define if
it is a procedure) named command routines within the DBMS environment.

Speaking in terms of SQL, a Rosetta D statement or value expression
corresponds to a SQL statement or value expression, a Rosetta D named
command routine corresponds to a SQL named stored procedure or function, a
Rosetta D anonymous command procedure corresponds to a SQL anonymous
subroutine or series of SQL statements, the parameters of a Rosetta D named
routine correspond to the parameters of a SQL named stored procedure or
function, and the parameters of a Rosetta D anonymous routine correspond
to SQL host parameters or bind variables.

A Rosetta D procedure parameter can be read-only or read-write (which
corresponds to SQL's IN or OUT+INOUT parameter types), but a Rosetta D
function parameter can only be read-only (a function may not have any
side-effects).  When invoking a routine, an argument corresponding to a
read-only parameter can be an arbitrarily complex value expression (which
is passed in by value), but an argument corresponding to a read-write
parameter must be a valid target variable (which is passed in by
reference), and that target variable may be updated during the procedure
invocation.  A function always returns its namesake mandatory special
return value using the standard "return" keyword, and it may not update any
global variables.  For a procedure, the only way to pass output directly to
its invoker (meaning, without updating global variables) is to assign that
output to read-write parameters.  Note that "return" can be used in a
procedure too for flow control, but it doesn't pass a value as well.

Orthogonal to the procedure/function and named/anonymous classifications of
Rosetta D routines is the deterministic/nondeterministic classification.  A
routine that is deterministic does not directly reference (for reading or
updating) any global variables nor invoke any nondeterministic routines;
its behaviour is governed soley by its arguments, so given only identical
arguments it has identical behaviour; if the routine is a function, this
means that the return value is always the same for the same arguments.  A
routine that is nondeterministic does directly reference (for reading or
updating) one or more global variable or does invoke a nondeterministic
routine; its behaviour can change, or it return different results, even if
given the all of the same arguments.  Generally speaking, all routines /
operators that are specific to a data type (such as typical comparison and
assignment operators) must be deterministic, while routines / operators
that are not specific to a data type do not need to be.  Most built-in
routines are deterministic.  Note that a deterministic routine can indeed
operate with or on global variables if they are passed to it as arguments.

The Rosetta DBMS is designed to allow user-applications to furnish the
definition of an anonymous command routine once and then execute it
multiple times (for efficiency and ease of use); speaking in terms of SQL,
the Rosetta DBMS supports prepared statements.  The arguments for any
routine parameters are provided at execution time, and they are used for
values that are intended to be different for each execution of the command,
as well as to return results that probably differ with each execution; as
an exception to the latter, the application does not have to pre-define an
anonymous function's special return value, which doesn't correspond to a
parameter.  Presumably, any values that will be constant through the life
of a command routine will be coded as literal values in its definition
rather than parameters.

(In this distribution, you furnish an anonymous command routine definition
for reuse using a DBMS object's "prepare" or "compile" method; that method
returns a new Rosetta::Interface::Command object.  You then associate a
Rosetta::Interface::Variable/::Value object with each of the routine's
parameters using the Routine object's "bind_param" method, and then invoke
the Command object's "execute" method.  Any Variable/Value objects
corresponding to input parameters need to be set by the application prior
to "execute", and following the "execute", the application can read the
routine's output from the Variable objects associated with output
parameters.  When the Command is a function, "execute" will generate and
return a new Value object with the special return value.)

The Rosetta D language has all the standard imperative language keywords,
any of which a Rosetta D routine (both anonymous and named) can contain,
including: conditionals ("if"), loops ("for", "while"), procedure
invocation ("call"), normal routine exit ("return"), plus exception
creation and resolution ("throw", "try", "catch").  For all types of
routines, the "throw" keyword takes a value expression whose resolved value
is the exception to be thrown, and visually looks like "return" does for
functions.  Note that a thrown exception which falls out of an anonymous
procedure will result in an exception thrown out to the application (in
this distribution, it will be as a thrown new Rosetta::Interface::Exception
object).  For our purposes, transaction control statements ("start",
"commit", "rollback") and resource locking statements are also grouped with
these standard keywords.  Note that value assignment of a value
expression's result to a named target is not accomplished with a keyword,
but rather with an update procedure that is defined for the value's data
type, with the target provided to it as a read-write argument.

Value assignment, which pdates a target variable to a new value provided by
a Rosetta D expression, is used frequently in Rosetta D, and is the form of
all its major functionality.  If the target variable is an anonymous
procedure's read-write parameter, the statement corresponds to an
unencapsulated SQL "select" query; or, the same task is usually done using
"return" in a function.  If the target variable is an ordinary variable,
and particularly if it is a repository's component data variable, the
statement's effect corresponds to SQL "data manipulation" (usually "insert"
or "update" or "delete").  If the target variable is a repository's special
catalog variable, the statement's effect corresponds to SQL "data
definition" (usually "create" or "alter" or "drop"); this is also how all
named command routines are defined, by such statements in other
usually-anonymous routines.  If the target variable is the DBMS' own
special catalog of repositories, then the effect is to mount or unmount a
repository, which corresponds to SQL client statements like "connect to".

All types of Rosetta D command routines can have assignment statements
which target their own lexical variables, but only procedures (that are not
invoked by operators / functions) are allowed to target global variables,
which are declared in a repository directly, or have read-write parameters.
In other words, an function may not have side-effects, though it can
I<read> from global variables.  Moreover, any procedure that is invoked by
an function is subject to the same restriction against targeting globals,
since it is effectively part of the function.  A few special exceptions may
be made to this restriction on functions, but for the most part, the
restriction is in place to prevent inconsistencies between reads of the
environment/globals from multiple functions that are invoked in the same
Rosetta D expression; all reads in the same expression need to see the same
state, so the expression's result is the same regardless of any
logically-equivalent changes to order of execution of the sub-expressions.
Further to this goal, any target variable may not be used more than once in
the same Rosetta D statement; target meaning a read-write procedure
parameter's argument, or directly referenced global variable.

=head2 Named Users and Privileges

The Rosetta DBMS / virtual machine itself does not have its own set of
named users where one must authenticate to use it.  Rather, any concept of
such users is associated with individual persistent repositories, such that
you may have to authenticate in order to mount them within the virtual
machine; moreover, there may be user-specific privileges for that
repository that restrict what users can do in regards to its contents.

The Rosetta privilege system is orthogonal to the standard Rosetta
constraint system, though both have the same effect of conditionally
allowing or barring a command from executing.  The constraint system is
strictly charged with maintaining the logical integrity of the database,
and so only comes into affect when an update of a repository or its
contents are attempted; it usually ignores what users were attempting the
changes.  By contrast, the privilege system is strictly user-centric, and
gates a lot of activities which don't involve any updates or threaten
integrity.

The privilege system mainly controls, per user, what individual repository
contents they are allowed to see / read from, what they are allowed to
update, and what routines they are allowed to execute; it also controls
other aspects of their possible activity.  The concerns here are analagous
to privileges on a computer's file system, or a typical SQL database.

=head2 States, Transactions and Concurrency

This official specification of the Rosetta DBMS includes full ACID
compliance as part of the core feature set; moreover, all types of changes
within a repository are subject to transactions and can be rolled back,
including both data manipulation and schema manipulation; moreover, an
interrupted session with a repository must result in an automatic rollback,
not an automatic commit.

I<It is important to point out that any attempt to implement the Rosetta
DBMS (a Rosetta Engine) which does not include full ACID compliance, with
all aspects described above, is not a true Rosetta DBMS implementation, but
rather is at best a partial implementation, and should be treated with
suspicion concerning reliability.  Of course, such partial implementations
will likely be made and used, such as ones implemented over existing
database products that are themselves not ACID compliant, but you should
see them for what they are and weigh the corruption risks of using them.>

Each individual instance of the Rosetta DBMS is a single process virtual
machine, and conceptually only one thing is happening in it at a time; each
individual Rosetta D statement executes in sequence, following the
completion or failure of its predecessor.  During the life of a statement's
execution, the state of the virtual machine is constant, except for any
updates (and side-effects of such) that the statement makes.  Breaking this
down further, a statement's execution has 2 sequential phases; all reads
from the environment are done in the first phase, and all writes to the
environment are done in the second phase.  Therefore, regardless of the
complexity of the statement, and even if it is a multi-update statement,
the final values of all the expressions to be assigned are determined prior
to any target variables being updated.  Moreover, all functions may not
have side-effects, so that avoids complicating the issue due to environment
updates occuring during their invoker statement's first phase.

In account to situations where external processes are concurrently using
the same persistent (and externally visible) repository as a Rosetta DBMS
instance, the Rosetta DBMS will maintain a lock on the whole repository (or
appropriate subset thereof) during any active read-only and/or for-update
transaction, to ensure that the transaction sees a consistent environment
during its life.  The lock is a shared lock if the transaction only does
reading, and it is an exclusive lock if the transaction also does writing.
Speaking in terms of SQL, the Rosetta DBMS supports only the serializable
transaction isolation level.

I<Note that there is currently no official support for using Rosetta in a
multi-threaded application, where its structures are shared between
threads, or where multiple thread-specific structures want to use the same
repositories.  But such support is expected in the future.>

No multi-update statement may target both catalog and non-catalog
variables.  If you want to perform the equivalent of SQL's "alter"
statement on a relation variable that already contains data, you must have
separate statements to change the definition of the relation variable and
change what data is in it, possibly more than one of each; the combination
can still be wrapped in an explicit transaction for atomicity.

Transactions can be nested, by starting a new one before concluding a
previous one, and the parent-most transaction has the final say on whether
all of its committed children actually have a final committed effect or
not.  The layering of transactions can involve any combination of explicit
and implicit transactions (the combination should behave intuitively).

The lifetimes of all transactions in Rosetta D (except those declared in
anonymous routines) are bound to specific lexical scopes, such that they
begin when that scope is entered and end when that scope is exited; if the
scope is exited normally, its transaction commits; if the scope terminates
early due to a thrown exception, its transaction rolls back.

Each Rosetta D named routine as a whole (being a lexical scope), whether
built-in and user-defined, is implicitly atomic, so invoking one will
either succeed or have no side-effect, and the environment will remain
frozen during its execution, save for the routine's own changes.  The
implicit transaction of a function is always read-only, and the implicit
transaction of a procedure is either read-only or for-update depending on
what it wants to do.  Each try-block is also implicitly atomic, committing
if it exits normally or rolling back if it traps an exception.

Every Rosetta D statement (including multi-update statements) is atomic;
all parts of that statement and its child expressions will see the same
static view of the environment; if the statement is an update, either all
parts of that update will succeed and commit, or none of it will
(accompanied by a thrown exception) and no changes are left.

Explicit atomic statement blocks can also be declared within a routine.

Rosetta D also supports the common concept of explicit open-ended
transaction statements that start or end transactions which are not bound
to lexical scopes; however, these statements may only be invoked within
anonymous routines, that an application invokes directly, and not in any
named routines, nor within atomic statement blocks in anonymous routines.

While scope-bound transactions always occur entirely within one invocation
of the DBMS by an application, the open-ended transactions are intended for
transactions which last over multiple DBMS invocations of an application.

All currently mounted repositories (persistent and temporary both) are
joined at the hip with respect to transactions; a commit or rollback is
performed on all of them simultaneously, and a commit either succeeds for
all or fails for all (a repository suddenly becoming inaccessable counts as
a failure).  I<Note that if a Rosetta DBMS implementation can not guarantee
such synchronization between multiple repositories, then it must refuse to
mount more than one repository at a time under the same virtual machine
(users can still employ multiple virtual machines, that are not
synchronized); by doing one of those two actions, a less capable
implementation can still be considered reliable and recommendable.>

Certain Rosetta D commands can not be executed within the context of a
parent transaction; in other words, they can only be executed directly by
an anonymous routine, the main examples being those that mount or unmount a
persistent repository; this is because such a change in the environment
mid-transaction would result in an inconsistent state.

I<Rosetta D lets you explicitly place locks on resources that you don't
want external processes to change out from under you, and these locks do
not automatically expire when transactions end; or maybe they do; this
feature has to be thought out more.>

=head2 Data Types

I<TODO.>

=head2 Grammar and Name Spaces

I<TODO.>

=head1 OLDER DOCUMENTATION TO REWRITE/REMOVE: 10,000 MILE VIEW OF ROSETTA D

=head2 Operational Context

Rosetta D is designed for a specific virtual environment that is
implemented by a B<DBMS> (database management system).  This environment is
home to zero or more data repositories, each of which users may create,
have a dialog with (over a B<connection>), and delete; the components of
the dialog, including queries and updates of the database, are the scope of
the "D proper" language, and the other actions framing the dialog are the
"D plus extra".

From an application's point of view, a DBMS is a library that provides
services for storing data "some where" (which may be in memory, or the file
system, or a network service, depending on implementation), like using
files but more abstract and flexible; its API provides functions or methods
for reading data from and writing data to the store.  This API takes richly
structured commands which are written in Rosetta D, either AST (abstract
syntax tree) form or string form.  Considering the distribution that
contains the Language document you are reading now, L<Rosetta> is the main
API that uses Rosetta D, and L<Rosetta::Model> provides the AST
representation of Rosetta D.

A B<database> is a fully self-contained and fully addressable entity.  Fully
self-contained means that nothing in the database depends on anything that
is external to the database (such as in type or constraint definitions),
save the DBMS implementing that database.  Fully addressable means that the
database I<is> what an application opens a "data source" connection to, and
its address can include such things as a file name or network server
location or abstract DSN, depending on the implementation.

A database is a usually-persistent container for B<relvars> (relation
variables), in which all kinds of data are stored, and it provides
relational operators for querying, updating, creating, and deleting those
relvars.  A database also stores user-defined data types and operators for
working with them, and relvars can be defined in terms of those
user-defined types (as well as built-in types).  A database also defines
various kinds of logical constraints that must be satisfied at all times,
some system defined and some user defined, which complete the picture such
that relvars are capable on their own of modelling anything in the real
world.  A database also defines users that are authorized to access it,
mediated by the DBMS.

=head2 Grammar and Name Spaces

Rosetta D is a low-sugar language, such that its string form, which will be
used for illustrative purposes in this documentation, has a very simple
grammar and visually corresponds one-to-one with its abstract syntax tree.

All Rosetta D B<types> and B<operators> are grouped into a hierarchical
name space for simplicity and ease of use; the fully-qualified name of any
type and operator includes its namespace hierarchy, with the highest level
namespace appearing first (left to right).  You are recommended to use the
fully-qualified names at all times (eg: C<root.branch.leaf>), although you
may also use partially qualified (eg: C<branch.leaf>) or unqualified
versions (eg: C<leaf>) if they are unambiguous.  For that matter, all
standard relvars and constraints are likewise in that namespace hierarchy.

All Rosetta D entity identifiers, built-in and user-defined, including
names of types, operators, relvars, constraints, and users, are all
case-sensitive and may contain characters from the entire Unicode 4+
character repertoire (even whitespace), in each level of their namespace.
But fully or partially qualified identifiers always use the radix point
character (.) to delimit each level.  Each namespace level may be formatted
either with or without double-quote (") delimiters, if said name only
contains non-punctuation and non-whitespace characters; if it does contain
either of those, then it must always appear in delimited format.  All
built-in entities only use characters that don't require delimiters (the
upper and lowercase letters A-Z, and the underscore, and sometimes are
partially composed of the digits 0-9), and your code will be simpler if you
do likewise.  All built-in type and possrep names conform to best practices
for Perl package names (eg: C<CharStr>), and all built-in names for
operators, constraints, relvars, and users, conform to best practices for
Perl routine and variable names (eg: C<the_x>), and certain pre-defined
constant value names conform to best practices (eg: C<TRUE>).  No built-in
operators have symbols like "+" or "=" as names, but rather use letters,
"add" and "eq" in this case.

All Rosetta D expressions are formatted in prefix notation, where the
operator appears before (left to right) all of its arguments, and the
argument list is delimited by parenthesis and delimited by commas (eg: C<<
<op>( <arg>, <arg> ) >>).  This is like most programming languages but
unlike most logical or mathematical expressions, which use infix notation
(eg: C<< <arg> <op> <arg> >>).

In addition, all arguments are named (formatted in name/value pairs),
rather than positional, so they can be passed in any order (eg: C<< <op>(
<arg-name> => <arg-val> ) >>), and so the expressions are more
self-documenting about what the arguments mean (eg: source vs target).  As
an extension to this, if an operator takes a variable number of arguments
that are all being used for the same purpose (eg: a list of numbers to add,
or a list of relations to join), then those are collected into a single
named argument whose value is a parenthesized and comma-delimited but
un-ordered list (eg: C<< <op>( <arg-name> => (<arg-val>, <arg-val>) ) >>).

The root level of a database's name hierarchy contains these 4 name-spaces,
which are searched in top-down order when trying to resolve unqualified
entity identifiers:

=over

=item C<system>

All system-defined entities go here, including built-in data types and
operators, and the catalog relvars that allow user introspection of the
database using just relational operators (analagous to SQL's "information
schema", but the provided meta-data is always fully decomposed), and
constraints on the above.

The standard way to create, alter, or drop user-defined entities is to
update the catalog relvars concerning them (although some short-hand
"create", etc, operators are provided to simplify those tasks).  It is like
the user-defined entities are views defined in terms of the catalog
relvars, and so explicitly changing the former results in implicitly
changing the latter.

For uniformity, the system-defined entities are also listed in the catalog
relvars (or, for the types, at least their interfaces are), but constraints
on the catalog relvars forbid users from updating or removing the
built-ins, or adding entities that say they are built-ins.

=item C<local>

All persistent user-defined entities go here, including real and virtual
relvars, types, operators, and constraints.  This is the "normal" or "home"
namespace for users.  All entities here may only be defined in terms of
either C<system> entities or other entities here.

Typically, the next name space level down under C<local> would be
functionally similar to a list of schemata as larger SQL databases
typically provide so that each of a database's users can have a separate
place for the types, relvars, etc, that they create.  In fact, to best be
able to represent various existing DBMSs that have anywhere from zero to 2
or 3 such name spaces, Rosetta D allows you to have an arbitrary number of
such intermediate name space levels, or use none at all.  In fact, unless
you actually need these intermediate levels, it is highly recommended that
you don't use them at all, to reduce complexity.  But as I mentioned
earlier, unless the database has more than one entity with the same
unqualified or semi-qualified name, you can just use those shorter names
everywhere, which results in the optional hierarchy being abstracted away.

=item C<temporary>

All user-defined entities go here whose fates are tied to open connections;
each connection to a database has its own private instance of this
name-space, and its contents disappear when the connection is closed.
These entities can be all of the same kinds as those that go in C<local>.
They can be defined in terms of C<local> entities, but the reverse isn't
true.  Generally, C<temporary> is the name-space for entities that are
specific to the current application, but that it makes sense for them to
exist within the Tutorial D virtual environment for efficiency.

=item C<remote>

If the current DBMS has support for federating access to external
databases, effectively by "mounting" their contents within the current
database as an extension to it, so users with a connection to the current
database can access those other databases through the same connection, then
those contents appear under C<remote>.

This may or may not count as the current DBMS being a proxy.

In terms of a hypothetical federated DBMS that lets you use a single
"connection" to access multiple remote databases at once, such as for a
database cloning utility or a multiplexer, I<all> of the interesting
contents would be C<remote>, and the C<local> name space would be empty.

Typically, the next name space level down under C<remote> will contain a
single name per distinct mounted external database, and then below each of
those may be that database's C<local> items, or alternately and more likely
we would see literal C<system>, C<local>, etc folders like our own root.

I<This feature is more experimental and has yet to be fleshed out.>

=back

Types and relvars would then have their unqualified names sitting just
below the above name spaces, per root space; so, for example, we would have
fully qualified names like C<system.CharStr> or C<local.suppliers>; simple.

However, operators have mandatory "package" name-spaces under which their
otherwise unqualified names would go, and these are usually identical to
the data type name that they are primarily associated with.  So, for
example, we would have fully qualified names like C<system.NumInt.add> or
C<system.CharStr.substr> or C<system.Relation.join>.  Note that type
selector operators and such would be named in exactly the same way.

Constraints on data types, that are specifically part of the definitions of
the data types, have their names package-qualified like operators, while
constraints on relvars don't have to be, or aren't.

=head2 Overview of Data Types

Rosetta D is a strongly typed language, where every value and variable is
of a specific B<data type>, and every operator and expression is defined in
terms of specific data types.  A variable can only store a value which is
of its type, and every operator can only take argument values or
expressions that are the same types as its parameters.

Values can only be explicitly converted from one data type to another (such
as when comparing two values for equality) using explicitly defined
operators for that purpose (this includes selectors, which typically
convert from character strings to something else), and value type
conversions can not happen implicitly; the sole exception to this is if one
of the two involved types is defined as a constraint-restricted sub-type of
the other, or if both are similarly restricted from a common third type.

All data types in Rosetta D fit into 3 main categories, which are B<scalar
types>, B<tuple types>, and B<relation types>.  For our purposes, every
data type that is not a tuple type or relation type is a scalar type.

A scalar type is a named set of scalar values; its sub-types mainly include
booleans, numerics, character strings, bit strings, temporals, spacials,
and any custom / user-defined data types.

A custom data type can be defined as a sub-type of a system-defined or
user-defined scalar type that has extra constraints, which are named; for
example, to restrict its set of scalar values to a sub-set of its parent
type's set of scalar values (eg: restrict from an integer to an integer
that has to be in the range 1 to 100).

Alternately, a custom data type can be defined to have one or more named
B<possreps> (possible representations), each of being different from the
others in appearance but being identical in meaning; every possible value
of that type should be renderable in each of its possible representations.
For example, we could represent a point in space using either cartesian
coordinates or polar coordinates.  Each possrep is defined in terms of a
list of components, where each component has a name and a type, and that
type is some other system-defined or user-defined type.  Such a custom data
type can also have named constraints as part of its definition (eg: the
point may not be more than a certain distance from the origin).

You can not declare named custom tuple types or relation types, as you can
with scalar types, but rather all values and variables of these types carry
with them a definition provided by a tuple or relation generator operator.

A tuple value or variable consists of an unordered set of zero or more
attributes, where each attribute consists of a name (that is distinct
within the tuple) and a type (that type is some other system-defined or
user-defined type); it also has exactly one value per attribute which is of
the same type as the attribute.

A relation value or variable consists of an unordered set of zero or more
attributes, where each attribute consists of a name (that is distinct
within the relation) and a type (that type is some other system-defined or
user-defined type); it also has an unordered set of zero or more tuples
whose set of attributes are all identical to those of the relation, and
where every tuple value is distinct within the relation.

=head1 OLDER DOCUMENTATION TO REWRITE/REMOVE: DATA TYPES AND VALUES

Generally speaking, any two data types are considered to be mutually
exclusive, such that a data value can only be in the domain of one of the
types, and not the other.  (The exception is if type A is declared to be a
restriction of type B, or both are restrictions of type C.)

If you want to compare two values that have different data types, you must
explicitly cast one or both values into the same data type.  Likewise, if
you want to use a value of one type in an operation that requires a value
of a different type, such as when assigning to a container, the value must
be cast into the needed type.  The details of casting depend on the two
types involved, but often you must choose from several possible methods of
casting; for example, when casting between a numeric and a character
string, you must choose what numeric radix to use.  However, no casting
choice is necessary if the data type of the value in hand is a restriction
of the needed data type.

IRL gains rigor from this requirement for strong typing and explicit
casting methods because you have to be very explicit as to what behaviour
is expected; as a result, there should be no ambiguity in the system and
the depot manager should perform exactly as you intended. This reduces
troublesome subtle bugs in your programs, making development faster, and
making your programs more reliable.  Your data integrity is greatly
improved, with certain causes of corruption removed, which is an important
goal of any data management system, and supports the ideals of the
relational data model.

IRL gains simplicity from this same requirement, because your depot-centric
routines can neatly avoid the combinatorial complexity that comes from
being given a range of data types as values when you conceptually just want
one type, and your code doesn't have to deal with all the possible cases.
The simpler routines are easier for developers to write, as they don't have
to worry about several classes of error detection and handling (due to
improper data formats), and the routines would also execute faster since
they do less actual work.  Any necessary work to move less strict data from
the outside to within the depot manager environment is handled by the depot
manager itself and/or your external application components (the latter is
where any user interaction takes place), so that work is un-necessary to do
once the data is inside the depot manager environment.

=head2 Classifications

IRL has 2 main classes of data types, which are opaque data types and
transparent data types.

An opaque data type is like a black box whose internal representation is
completely unknown to the user (and is determined by the depot manager),
though its external interace and behaviour are clearly defined.  Or, an
opaque data type is like an object in a typical programming language whose
attributes are all private.  Conceptually speaking, all opaque data values
are atomic and no sub-components are externally addressable for reading and
changing, although the data type can provide its own specific methods or
operators to extract or modify sub-components of an opaque data value.  An
example is extracting a sub-string of a character string to produce a new
character string, or extracting a calendar-specific month-day from a
temporal type.

A transparent data type is like a partitioned open box, such that each
partition is a visibly distinct container or value that can be directly
addressed for reading or writing.  Or, an opaque data type is like an
object in a typical programming language whose attributes are all public.
Conceptually speaking, all transparent data types are named collections of
zero or more other data types, as if the transparent data value or
container was an extra level of namespace.  Accessing these sub-component
partitions individually is unambiguous and can be done without an accessor
method.  An example is a single element in an array, or a single member of
a set, or a single field in a tuple, or a single tuple in a relation.

Opaque data types are further classified into unrestricted opaque data
types and restricted opaque data types.  An unrestricted opaque type has
the full natural domain of possible values, and that domain is infinite in
size for most of them; eg, the unrestricted numerical type can accomodate
any number from negative to positive infinity, though the unrestricted
boolean type still only has 2 values, false and true.  A restricted opaque
type is defined as a sub-type of another opaque type (restricted or not)
which excludes part of the parent type's domain; eg a new type of numerical
type can be defined that can only represent integers between 1 and 100.  A
trivial case of a restricted type is one declared to be identical in range
to the parent type, such as if it simply served as an alias; that is also
how you always declare a boolean type.  A restricted type can implicitly be
used as input to all operations that its parent type could be, though it
can only be used conditionally as output.

Note that, for considerations of practicality, as computers are not
infinite, IRL requires you to explicitly declare a container (but not a
value) to be of a restricted opaque type, having a defined finite range in
its domain, though that domain can still be very large.  This allows depot
manager implementations to know whether or not they need to do something
very inefficient in order to store extremely large possible values (such as
implement a numeric using a LOB), or whether a more efficient but more
limited solution will work (using an implementation-native numeric type);
stating your intentions by defining a finite range helps everything work
better.

Transparent data types are further classified into collective and
disjunctive transparent data types.  A collective transparent data type is
what you normally think of with transparent types, and includes arrays,
sets, relations, and tuples; each one can contain zero or more distinct
sub-values at once.  A disjunctive transparent data type is the means that
IRL provides to simulate both weak data types and normal-but-nullable data
types.  It looks like a tuple where only one field is allowed to contain a
non-empty value at once, and it has a distinct field for each possible
strong data type that the weak type can encompass (one being of the null
type when simulating nullability); it actually has one more field than
that, always valued, which says which of the other fields contains the
important value.

=head1 OLDER DOCUMENTATION TO REWRITE/REMOVE: DATA TYPES OVERVIEW

IRL is strongly typed, following the relational model's ideals of stored
data integrity, and the actual practice of SQL and many database products,
and Rosetta's own ideals of being rigorously defined.  However, its
native set of data types also includes ones that have the range of typical
weak types such as some database products and languages like Perl use.

A data type is a set of representable values.  All data types are based on
the concept of domains; any variable or literal that is of a particular
data type may only hold a value that is part of the domain that defines the
data type.  IRL has some native data types that it implicitly understands
(eg, booleans, integers, rational numbers, character strings, bit strings,
arrays, rows, tables), and you can define custom ones too that are based on
these (eg, counting numbers, character strings that are limited to 10
characters in length, rows having 3 specific fields).

All Rosetta::Model "domain" Nodes (and schema objects) are user defined,
having a name that you pick, regardless of whether the domain corresponds
directly to a native data type, or to one you customized; this is so
there won't be any name conflicts regardless of any same named data types
that a particular database implementation used in conjunction with
Rosetta::Model may have.

=head2 Generalities

It is the general case that every data type defines a domain of values that
is mutually exclusive from every other data type; 2 artifacts having a
common data type (eg, 2 character strings) can always be compared for
equality or inequality, and 2 artifacts of different data types (eg, 1
character string and 1 bit string) can not be compared and hence are always
considered inequal.  Following this, it is mandatory that every native
and custom data type define the 'eq' (equal) and 'ne' (not equal) operators
for comparing 2 artifacts that are of that same data type.  Moreover, it is
mandatory that no data type defines for themselves any 'eq' or 'ne'
operators for comparing 2 artifacts of different data types.

In order to compare 2 artifacts of different data types for equality or
inequality, either one must be cast into the other's data type, or they
must both be cast into a common third data type.  How exactly this is done
depends on the situation at hand.

The simplest casting scenario is when there is a common domain that both
artifacts belong to, such as happens when either one artifact's data type
is a sub-domain of the other (eg, an integer and a rational number), or the
data types of both are sub-domains of a common third data type (eg, even
numbers and square whole numbers).  Then both artifacts are cast as the
common parent type (eg, rationals and integers respectively).

A more difficult but still common casting scenario is when the data types
of two artifacts do not have a common actual domain, but yet there is one
or more commonly known or explicitly defined way of mapping members of one
type's domain to members of the other type's domain.  Then both artifacts
can be cast according to one of the candidate mappings.  A common example
of this is numbers and character strings, since numbers are often expressed
as characters, such as when they come from user input or will be displayed
to the user; sometimes characters are expressed as numbers too, as an
encoding.  One reason the number/character scenario is said to be more
difficult is due to there being multiple ways to express numbers in
character strings, such as octal vs decimal vs hexadecimal, so you have to
explicitly choose between multiple casting methods or formats for the
version you want; in other words, there are multiple members of one domain
that map to the same member of another domain, so you have to choose; a
cast method can not be picked simply on the data type of the operands.

A different casting scenario occurs when one or both of the data types are
composite types, such as 2 tuples that are either of different degrees or
that have different attribute names or value types.  Dealing with these
involves mapping all the attributes of each tuple against the other, with
or without casting of the individual attributes, possibly into a third data
type having attributes to represent all of those from the two.

Most data types support the extraction of part of an artifact to form a new
artifact, which is either of the same data type or a different one.  In
some cases, even if 2 artifacts can't be compared as wholes, it is possible
to compare an extract from one with the other, or extractions from both
with each other.  Commonly this is done with composite data types like
tuples, where some attributes are extracted for comparison, such when
joining the tuples, or filtering a tuple from a relation.

Aside from the 'eq' and 'ne' comparison operators, there are no other
mandatory operators that must be defined for a given custom data type,
though the native ones will include others.  However, it is strongly
recommended that each data type implement the 'cmp' (comparison) operator
so that linearly sorting 2 artifacts of that common data type is a
deterministic activity.

IRL requires that all data types are actually self-contained, regardless of
their complexity or size.  So nothing analagous to a "reference" or
"pointer" in the Perl or C or SQL:2003 sense may be stored; the only valid
way to say that two artifacts are related is for them to be equal, or have
attributes that are equal, or be stored in common or adjoining locations.

=head2 Native Null Type

IRL natively supports the special NULL data type, whose value domain is by
definition mutually exclusive of the domains of all other data types; in
practice, a NULL is distinct from all possible values that the other IRL
native primitive types can have.  But some native complex types and user
customized types could be defined where their domains are a super-set of
NULL; those latter types are described as "nullable", while types whose
domains are not a super-set of NULL are described as "not nullable".

The NULL data type represents situations where a value of an arbitrary data
type is desired but none is yet known; it sits in place of the absent value
to indicate that fact.  NULL artifacts will always explicitly compare as
being unequal to each other; since they all represent unknowns, we can not
logically say any are equal, so they are all treated as distinct.  This
data type corresponds to SQL's concept of NULL, and is similar to Perl's
concept of "undef".  A NULL does not natively cast between any data types.

Rosetta::Model does not allow you to declare "domain" Nodes that are simply
of or based on the data type NULL; rather, to use NULL you must declare
"domain" Nodes that are either based on a not-nullable data type unioned
with the NULL type, or are based on a nullable data type.  The "domain"
Node type provides a short-hand to indicate the union of its base type with
NULL, in the form of the boolean "is_nullable" attribute; if the attribute
is undefined, then the nullability status of the base data type is
inherited; if it is defined, then it overrides the parent's status.

All not-nullable native data types default to their concept of empty or
nothingness, such as zero or the empty string.  All nullable native
types, and all not-nullable native types that you customize with a true
is_nullable, will default to NULL.  In either case, you can define an
explicit default value for your custom data type, which will override those
behaviours; details are given further below.

=head2 Native Primitive Types

These are the simplest data types, from which all others are derived:

=over

=item C<BOOLEAN>

This data type is a single logical truth value, and can only be FALSE or
TRUE.  Its concept of nothingness is FALSE.

=item C<NUMERIC>

This data type is a single rational number.  Its concept of nothingness is
zero.  A subtype of NUMERIC must specify the radix-agnostic "num_precision"
and "num_scale" attributes, which determine the maximum valid range of the
subtype's values, and the subtype's storage representation can often be
derived from it too.

The "num_precision" attribute is an integer >= 1; it specifies the maximum
number of significant values that the subtype can represent.  The
"num_scale" attribute is an integer >= 0 and <= "num_precision"; if it is
>= 1, the subtype is a fixed radix point rational number, such that 1 /
"num_scale" defines the increment size between adjacent possible values;
the trivial case of "num_scale" = 1 means the increment size is 1, and the
number is an integer; if "num_scale" = 0, the subtype is a floating radix
point rational number where "num_precision" represents the product of the
maximum number of significant values that the subtype's mantissa and
exponent can represent.  IRL does not currently specify how much
of a floating point number's "num_precision" is for the mantissa and how
much for the exponent, but commonly the exponent takes a quarter.

The meanings of "precision" and "scale" are more generic for IRL than they
are in the SQL:2003 standard; in SQL, "precision" (P) means the maximum
number of significant figures, and the "scale" (S) says how many of those
are on the right side of the radix point.  Translating from base-R (eg, R
being 10 or 2) to the IRL meanings are as follows (assuming negative
numbers are allowed and zero is always in the middle of a range). For
fixed-point numerics, a (P,S) becomes (2*R^P,R^S), meaning an integer (P,0)
becomes (2*R^P,1).  For floating-point numerics, a (P) sort-of becomes
(2*R^P,0); I say sort-of because SQL:2003 says that the P shows significant
figures in just the mantissa, but IRL currently says that the size of the
exponent eats away from that, commonly a quarter.

As examples, a base-10 fixed in SQL defined as [p=10,s=0] (an integer in
-10^10..10^10-1) becomes [p=20_000_000_000,s=1] in IRL; the base-10
[p=5,s=2] (a fixed in -1_000.00..999.99) becomes [p=200_000,s=100]; the
base-2 [p=15,s=0] (a 16-bit int in -32_768..32_767) becomes [p=65_536,s=1];
the base-2 float defined as [p=31] (a 32-bit float in
+/-8_388_608*2^+/-128) becomes [p=4_294_967_296,s=0].

A subtype of NUMERIC may specify the "num_min_value" and/or "num_max_value"
attributes, which further reduces the subtype's valid range.  For example,
a minimum of 1 and maximum of 10 specifies that only numbers in the range
1..10 (inclusive) are allowed.  Simply setting the minimum to zero and
leaving the maximum unset is the recommended way in IRL to specify that you
want to allow any non-negative number.  Setting the minimum >= 0 also
causes the maximum value range allowable by "num_precision" to shift into
the positive, rather than it being half there and half in the negative.
Eg, an (P,S) of (256,1) becomes 0..255 when the minimum = 0, whereas it
would be -128..127 if the min/max are unset.

=item C<CHAR_STR>

This data type is a string of characters.  Its concept of nothingness is
the empty string.  A subtype of CHAR_STR must specify the "char_max_length"
and "char_repertoire" attributes, which determine the maximum valid range
of the subtype's values, and the subtype's storage representation can often
be derived from it too.

The "char_max_length" attribute is an integer >= 0; it specifies the
maximum length of the string in characters (eg, a 100 means a string of
0..100 characters can be stored).  The "char_repertoire" enumerated
attribute specifies what individual characters there are to choose from
(eg, Unicode 4.1, Ascii 7-bit, Ansel; Unicode is the recommended choice).

A subtype of CHAR_STR may specify the "char_min_length" attribute, which
means the length of the character string must be at least that long (eg, to
say strings of length 6..10 are required, set min to 6 and max to 10).

=item C<STR_BIT>

This data type is a string of bits.  Its concept of nothingness is the
empty string.  A subtype of BIT_STR must specify the "bit_max_length"
attribute, which determines the maximum valid range of the subtype's
values, and the subtype's storage representation can often be derived from
it too.

The "bit_max_length" attribute is an integer >= 0; it specifies the maximum
length of the string in characters (eg, an 8000 means a string of 0..8000
bits can be stored).

A subtype of BIT_STR may specify the "bit_min_length" attribute, which
means the length of the bit string must be at least that long (eg, to say
strings of length 24..32 are required, set min to 24 and max to 32).

=back

A subtype of any of these native primitive types can define a default value
for the subtype, it can define whether the subtype is nullable or not (they
are all not-nullable by default), and it can enumerate an explicit list of
allowed values (eg, [4, 8, 15, 16, 23, 42], or ['foo', 'bar', 'baz'], or
[B'1100', B'1001']), one each in a child Node (these must fall within the
specified range/size limits otherwise defined for the subtype).

=head2 Native Scalar Type

IRL has native support for a special SCALAR data type, which is akin to
SQLite's weakly typed table columns, or to Perl's weakly typed default
scalar variables.  This data type is a union of the domains of the BOOLEAN,
NUMERIC, CHAR_STR, and BIT_STR data types; it is not-nullable by default.
Its concept of nothingness is the empty string.

=head1 SEE ALSO

Go to L<Rosetta> for the majority of distribution-internal references, and
L<Rosetta::SeeAlso> for the majority of distribution-external references.

=head1 AUTHOR

Darren R. Duncan (C<perl@DarrenDuncan.net>)

=head1 LICENCE AND COPYRIGHT

This file is part of the Rosetta DBMS framework.

Rosetta is Copyright (c) 2002-2006, Darren R. Duncan.

See the LICENCE AND COPYRIGHT of L<Rosetta> for details.

=head1 ACKNOWLEDGEMENTS

The ACKNOWLEDGEMENTS in L<Rosetta> apply to this file too.

=cut