View on
MetaCPAN
Darren Duncan > Rosetta > Rosetta::Language

Download:
Rosetta-v0.724.0.tar.gz

Annotate this POD

View/Report Bugs
Source  

NAME ^

Rosetta::Language - Design document of the Rosetta D language

DESCRIPTION ^

The native command language of a Rosetta DBMS (database management system) / virtual machine is called Rosetta D; this document, Rosetta::Language ("Language"), is the human readable authoritative design document for that language, and for the Rosetta virtual machine in which it executes. If there's a conflict between any other document and this one, then either the other document is in error, or the developers were negligent in updating it before Language, so you can yell at them.

Rosetta D is intended to qualify as a "D" language as defined by "The Third Manifesto" (TTM), a formal proposal for a solid foundation for data and database management systems, written by Christopher J. Date and Hugh Darwen; see http://www.aw-bc.com/catalog/academic/product/0,1144,0321399420,00.html for a publishers link to the book that formally publishes TTM. See http://www.thethirdmanifesto.com/ for some references to what TTM is, and also copies of some documents I used in writing Rosetta D. The initial main reference I used when creating Rosetta D was the book "Database in Depth" (2005; http://www.oreilly.com/catalog/databaseid/), written by Date and published by Oreilly.

It should be noted that Rosetta D, being quite new, may omit some features that are mandatory for a "D" language initially, to speed the way to a useable partial solution, but you can be comforted in knowing that they will be added as soon as possible. Also, it contains some features that go beyond the scope of a "D" language, so Rosetta D is technically a "D plus extra"; examples of this are constructs for creating the databases themselves and managing connections to them. However, Rosetta D should never directly contradict The Third Manifesto; for example, its relations never contain duplicates, and it does not allow nulls anywhere, and you can not specify attributes by ordinal position instead of by name. That's not to say you can't emulate all the SQL features over Rosetta D; you can, at least once its complete.

Rosetta D also incorporates design aspects and constructs that are taken from or influenced by Perl 6, pure functional languages like Haskell, Tutorial D, various TTM implementations, and various SQL dialects and implementations (see the Rosetta::SeeAlso file). While most of these languages or projects aren't specifically related to TTM, none of Rosetta's adaptions from these are incompatible with TTM.

Note that the Rosetta documentation will be focusing mainly on how Rosetta itself works, and will not spend much time in providing rationales; you can read TTM itself and various other external documentation for much of that.

10,000 MILE VIEW ^

Rosetta D is a computationally complete (and industrial strength) high-level programming language with fully integrated database functionality; you can use it to define, query, and update relational databases. It is mainly imperative in style, since at the higher levels, users provide sequential instructions; but in many respects it is also functional or declarative, in that many constructs are pure or deterministic, and the constructs focus on defining what needs to be accomplished rather than how to accomplish that.

This permits a lot of flexability on the part of implementers of the language (usually Rosetta Engine classes) to be adaptive to changing constraints of their environment and deliver efficient solutions. This also makes things a lot easier for users of the language because they can focus on the meaning of their data rather than worrying about implementation details, which relieves burdens on their creativity, and saves them time. In short, this system improves everyone's lives.

Environment

The Rosetta DBMS / virtual machine, which by definition is the environment in which Rosetta D executes, conceptually resembles a hardware PC, having a command processor (CPU), standard user input and output channel, persistant read-only memory (ROM), volatile read-write memory (RAM), and read-write persistent disk or network storage.

Within this analogy, the role of the PC's user, that feeds it through standard input and accepts its standard output, is fulfilled by the application that is using the Rosetta DBMS; similarly, the application itself will activate the virtual machine when wanting to use it (done in this distribution by instantiating a new Rosetta::Interface::DBMS object), and deactivate the virtual machine when done (letting that object expire).

When a new virtual machine is activated, the virtual machine has a default state where the CPU is ready to accept user-input commands to process, and there is a built-in (to the ROM) set of system-defined data types and operators which are ready to be used to define or be invoked by said user-input commands; the RAM starts out effectively empty and the persistant disk or network storage is ignored.

Following this activation, the virtual machine is mostly idle except when executing Rosetta D commands that it receives via the standard input (done in this distribution by invoking methods on the DBMS object). The virtual machine effectively handles just one command at a time, and executes each separately and in the order received; any results or side-effects of each command provide a context for the next command.

At some point in time, as the result of appropriate commands, data repositories (either newly created or previously existing) that live in the persistant disk or network storage will be mounted within the virtual machine, at which point subsequent commands can read or update them, then later unmount them when done. Speaking in the terms of a typical database access solution like the Perl DBI, this mounting and unmounting of a repository usually corresponds to connecting to and disconnecting from a database. Speaking in the terms of a typical disk file system, this is mounting or unmounting a logical volume.

Any mounted persistent repository, as well as the temporary repository which is most of the conceptual PC's RAM, is home to all user-defined data variables, data types, operators, constraints, packages, and routines; they collectively are the database that the Rosetta DBMS is managing. Most commands against the DBMS would typically involve reading and updating the data variables, which in typical database terms is performing queries and data manipulation. Much less frequently, you would also see changes to what variables, types, etcetera exist, which in typical terms is data definition. Any updates to a persistent repository will usually last between multiple activations of the virtual machine, while any updates to the temporary repository are lost when the machine deactivates.

All virtual machine commands are subject to a collection of both system-defined and user-defined constraints (also known as business rules), which are always active over the period that they are defined. The constraints restrict what state the database can be in, and any commands which would cause the constraints to be violated will fail; this mechanism is a large part of what makes the Rosetta DBMS a reliable modeller of anything in reality, since it only stores values that are reasonable.

Command Structure and Processing

Rosetta D commands are structured as arbitarily complex routines / operators, either named or anonymous, and they can have (named) parameters, can contain single (usually) or multiple Rosetta D statements or value expressions, and can return one or more values.

Rosetta D command routine definitions can either be named and stored in a persistent repository for reuse like a repository's data types or variables, or they can be anonymous and furnished by an application at run-time for temporary use. A command routine can take the form of either a function / read-only operator or a procedure / update operator; the former has a special return value which is the value of the evaluated function invocation within a value expression; the latter has no such special return value, and can not be invoked within a value expression.

An application can only ever directly define and invoke an anonymous command routine, but an anonymous routine can in turn invoke (and define if it is a procedure) named command routines within the DBMS environment.

Speaking in terms of SQL, a Rosetta D statement or value expression corresponds to a SQL statement or value expression, a Rosetta D named command routine corresponds to a SQL named stored procedure or function, a Rosetta D anonymous command procedure corresponds to a SQL anonymous subroutine or series of SQL statements, the parameters of a Rosetta D named routine correspond to the parameters of a SQL named stored procedure or function, and the parameters of a Rosetta D anonymous routine correspond to SQL host parameters or bind variables.

A Rosetta D procedure parameter can be read-only or read-write (which corresponds to SQL's IN or OUT+INOUT parameter types), but a Rosetta D function parameter can only be read-only (a function may not have any side-effects). When invoking a routine, an argument corresponding to a read-only parameter can be an arbitrarily complex value expression (which is passed in by value), but an argument corresponding to a read-write parameter must be a valid target variable (which is passed in by reference), and that target variable may be updated during the procedure invocation. A function always returns its namesake mandatory special return value using the standard "return" keyword, and it may not update any global variables. For a procedure, the only way to pass output directly to its invoker (meaning, without updating global variables) is to assign that output to read-write parameters. Note that "return" can be used in a procedure too for flow control, but it doesn't pass a value as well.

Orthogonal to the procedure/function and named/anonymous classifications of Rosetta D routines is the deterministic/nondeterministic classification. A routine that is deterministic does not directly reference (for reading or updating) any global variables nor invoke any nondeterministic routines; its behaviour is governed soley by its arguments, so given only identical arguments it has identical behaviour; if the routine is a function, this means that the return value is always the same for the same arguments. A routine that is nondeterministic does directly reference (for reading or updating) one or more global variable or does invoke a nondeterministic routine; its behaviour can change, or it return different results, even if given the all of the same arguments. Generally speaking, all routines / operators that are specific to a data type (such as typical comparison and assignment operators) must be deterministic, while routines / operators that are not specific to a data type do not need to be. Most built-in routines are deterministic. Note that a deterministic routine can indeed operate with or on global variables if they are passed to it as arguments.

The Rosetta DBMS is designed to allow user-applications to furnish the definition of an anonymous command routine once and then execute it multiple times (for efficiency and ease of use); speaking in terms of SQL, the Rosetta DBMS supports prepared statements. The arguments for any routine parameters are provided at execution time, and they are used for values that are intended to be different for each execution of the command, as well as to return results that probably differ with each execution; as an exception to the latter, the application does not have to pre-define an anonymous function's special return value, which doesn't correspond to a parameter. Presumably, any values that will be constant through the life of a command routine will be coded as literal values in its definition rather than parameters.

(In this distribution, you furnish an anonymous command routine definition for reuse using a DBMS object's "prepare" or "compile" method; that method returns a new Rosetta::Interface::Command object. You then associate a Rosetta::Interface::Variable/::Value object with each of the routine's parameters using the Routine object's "bind_param" method, and then invoke the Command object's "execute" method. Any Variable/Value objects corresponding to input parameters need to be set by the application prior to "execute", and following the "execute", the application can read the routine's output from the Variable objects associated with output parameters. When the Command is a function, "execute" will generate and return a new Value object with the special return value.)

The Rosetta D language has all the standard imperative language keywords, any of which a Rosetta D routine (both anonymous and named) can contain, including: conditionals ("if"), loops ("for", "while"), procedure invocation ("call"), normal routine exit ("return"), plus exception creation and resolution ("throw", "try", "catch"). For all types of routines, the "throw" keyword takes a value expression whose resolved value is the exception to be thrown, and visually looks like "return" does for functions. Note that a thrown exception which falls out of an anonymous procedure will result in an exception thrown out to the application (in this distribution, it will be as a thrown new Rosetta::Interface::Exception object). For our purposes, transaction control statements ("start", "commit", "rollback") and resource locking statements are also grouped with these standard keywords. Note that value assignment of a value expression's result to a named target is not accomplished with a keyword, but rather with an update procedure that is defined for the value's data type, with the target provided to it as a read-write argument.

Value assignment, which pdates a target variable to a new value provided by a Rosetta D expression, is used frequently in Rosetta D, and is the form of all its major functionality. If the target variable is an anonymous procedure's read-write parameter, the statement corresponds to an unencapsulated SQL "select" query; or, the same task is usually done using "return" in a function. If the target variable is an ordinary variable, and particularly if it is a repository's component data variable, the statement's effect corresponds to SQL "data manipulation" (usually "insert" or "update" or "delete"). If the target variable is a repository's special catalog variable, the statement's effect corresponds to SQL "data definition" (usually "create" or "alter" or "drop"); this is also how all named command routines are defined, by such statements in other usually-anonymous routines. If the target variable is the DBMS' own special catalog of repositories, then the effect is to mount or unmount a repository, which corresponds to SQL client statements like "connect to".

All types of Rosetta D command routines can have assignment statements which target their own lexical variables, but only procedures (that are not invoked by operators / functions) are allowed to target global variables, which are declared in a repository directly, or have read-write parameters. In other words, an function may not have side-effects, though it can read from global variables. Moreover, any procedure that is invoked by an function is subject to the same restriction against targeting globals, since it is effectively part of the function. A few special exceptions may be made to this restriction on functions, but for the most part, the restriction is in place to prevent inconsistencies between reads of the environment/globals from multiple functions that are invoked in the same Rosetta D expression; all reads in the same expression need to see the same state, so the expression's result is the same regardless of any logically-equivalent changes to order of execution of the sub-expressions. Further to this goal, any target variable may not be used more than once in the same Rosetta D statement; target meaning a read-write procedure parameter's argument, or directly referenced global variable.

Named Users and Privileges

The Rosetta DBMS / virtual machine itself does not have its own set of named users where one must authenticate to use it. Rather, any concept of such users is associated with individual persistent repositories, such that you may have to authenticate in order to mount them within the virtual machine; moreover, there may be user-specific privileges for that repository that restrict what users can do in regards to its contents.

The Rosetta privilege system is orthogonal to the standard Rosetta constraint system, though both have the same effect of conditionally allowing or barring a command from executing. The constraint system is strictly charged with maintaining the logical integrity of the database, and so only comes into affect when an update of a repository or its contents are attempted; it usually ignores what users were attempting the changes. By contrast, the privilege system is strictly user-centric, and gates a lot of activities which don't involve any updates or threaten integrity.

The privilege system mainly controls, per user, what individual repository contents they are allowed to see / read from, what they are allowed to update, and what routines they are allowed to execute; it also controls other aspects of their possible activity. The concerns here are analagous to privileges on a computer's file system, or a typical SQL database.

States, Transactions and Concurrency

This official specification of the Rosetta DBMS includes full ACID compliance as part of the core feature set; moreover, all types of changes within a repository are subject to transactions and can be rolled back, including both data manipulation and schema manipulation; moreover, an interrupted session with a repository must result in an automatic rollback, not an automatic commit.

It is important to point out that any attempt to implement the Rosetta DBMS (a Rosetta Engine) which does not include full ACID compliance, with all aspects described above, is not a true Rosetta DBMS implementation, but rather is at best a partial implementation, and should be treated with suspicion concerning reliability. Of course, such partial implementations will likely be made and used, such as ones implemented over existing database products that are themselves not ACID compliant, but you should see them for what they are and weigh the corruption risks of using them.

Each individual instance of the Rosetta DBMS is a single process virtual machine, and conceptually only one thing is happening in it at a time; each individual Rosetta D statement executes in sequence, following the completion or failure of its predecessor. During the life of a statement's execution, the state of the virtual machine is constant, except for any updates (and side-effects of such) that the statement makes. Breaking this down further, a statement's execution has 2 sequential phases; all reads from the environment are done in the first phase, and all writes to the environment are done in the second phase. Therefore, regardless of the complexity of the statement, and even if it is a multi-update statement, the final values of all the expressions to be assigned are determined prior to any target variables being updated. Moreover, all functions may not have side-effects, so that avoids complicating the issue due to environment updates occuring during their invoker statement's first phase.

In account to situations where external processes are concurrently using the same persistent (and externally visible) repository as a Rosetta DBMS instance, the Rosetta DBMS will maintain a lock on the whole repository (or appropriate subset thereof) during any active read-only and/or for-update transaction, to ensure that the transaction sees a consistent environment during its life. The lock is a shared lock if the transaction only does reading, and it is an exclusive lock if the transaction also does writing. Speaking in terms of SQL, the Rosetta DBMS supports only the serializable transaction isolation level.

Note that there is currently no official support for using Rosetta in a multi-threaded application, where its structures are shared between threads, or where multiple thread-specific structures want to use the same repositories. But such support is expected in the future.

No multi-update statement may target both catalog and non-catalog variables. If you want to perform the equivalent of SQL's "alter" statement on a relation variable that already contains data, you must have separate statements to change the definition of the relation variable and change what data is in it, possibly more than one of each; the combination can still be wrapped in an explicit transaction for atomicity.

Transactions can be nested, by starting a new one before concluding a previous one, and the parent-most transaction has the final say on whether all of its committed children actually have a final committed effect or not. The layering of transactions can involve any combination of explicit and implicit transactions (the combination should behave intuitively).

The lifetimes of all transactions in Rosetta D (except those declared in anonymous routines) are bound to specific lexical scopes, such that they begin when that scope is entered and end when that scope is exited; if the scope is exited normally, its transaction commits; if the scope terminates early due to a thrown exception, its transaction rolls back.

Each Rosetta D named routine as a whole (being a lexical scope), whether built-in and user-defined, is implicitly atomic, so invoking one will either succeed or have no side-effect, and the environment will remain frozen during its execution, save for the routine's own changes. The implicit transaction of a function is always read-only, and the implicit transaction of a procedure is either read-only or for-update depending on what it wants to do. Each try-block is also implicitly atomic, committing if it exits normally or rolling back if it traps an exception.

Every Rosetta D statement (including multi-update statements) is atomic; all parts of that statement and its child expressions will see the same static view of the environment; if the statement is an update, either all parts of that update will succeed and commit, or none of it will (accompanied by a thrown exception) and no changes are left.

Explicit atomic statement blocks can also be declared within a routine.

Rosetta D also supports the common concept of explicit open-ended transaction statements that start or end transactions which are not bound to lexical scopes; however, these statements may only be invoked within anonymous routines, that an application invokes directly, and not in any named routines, nor within atomic statement blocks in anonymous routines.

While scope-bound transactions always occur entirely within one invocation of the DBMS by an application, the open-ended transactions are intended for transactions which last over multiple DBMS invocations of an application.

All currently mounted repositories (persistent and temporary both) are joined at the hip with respect to transactions; a commit or rollback is performed on all of them simultaneously, and a commit either succeeds for all or fails for all (a repository suddenly becoming inaccessable counts as a failure). Note that if a Rosetta DBMS implementation can not guarantee such synchronization between multiple repositories, then it must refuse to mount more than one repository at a time under the same virtual machine (users can still employ multiple virtual machines, that are not synchronized); by doing one of those two actions, a less capable implementation can still be considered reliable and recommendable.

Certain Rosetta D commands can not be executed within the context of a parent transaction; in other words, they can only be executed directly by an anonymous routine, the main examples being those that mount or unmount a persistent repository; this is because such a change in the environment mid-transaction would result in an inconsistent state.

Rosetta D lets you explicitly place locks on resources that you don't want external processes to change out from under you, and these locks do not automatically expire when transactions end; or maybe they do; this feature has to be thought out more.

Data Types

TODO.

Grammar and Name Spaces

TODO.

OLDER DOCUMENTATION TO REWRITE/REMOVE: 10,000 MILE VIEW OF ROSETTA D ^

Operational Context

Rosetta D is designed for a specific virtual environment that is implemented by a DBMS (database management system). This environment is home to zero or more data repositories, each of which users may create, have a dialog with (over a connection), and delete; the components of the dialog, including queries and updates of the database, are the scope of the "D proper" language, and the other actions framing the dialog are the "D plus extra".

From an application's point of view, a DBMS is a library that provides services for storing data "some where" (which may be in memory, or the file system, or a network service, depending on implementation), like using files but more abstract and flexible; its API provides functions or methods for reading data from and writing data to the store. This API takes richly structured commands which are written in Rosetta D, either AST (abstract syntax tree) form or string form. Considering the distribution that contains the Language document you are reading now, Rosetta is the main API that uses Rosetta D, and Rosetta::Model provides the AST representation of Rosetta D.

A database is a fully self-contained and fully addressable entity. Fully self-contained means that nothing in the database depends on anything that is external to the database (such as in type or constraint definitions), save the DBMS implementing that database. Fully addressable means that the database is what an application opens a "data source" connection to, and its address can include such things as a file name or network server location or abstract DSN, depending on the implementation.

A database is a usually-persistent container for relvars (relation variables), in which all kinds of data are stored, and it provides relational operators for querying, updating, creating, and deleting those relvars. A database also stores user-defined data types and operators for working with them, and relvars can be defined in terms of those user-defined types (as well as built-in types). A database also defines various kinds of logical constraints that must be satisfied at all times, some system defined and some user defined, which complete the picture such that relvars are capable on their own of modelling anything in the real world. A database also defines users that are authorized to access it, mediated by the DBMS.

Grammar and Name Spaces

Rosetta D is a low-sugar language, such that its string form, which will be used for illustrative purposes in this documentation, has a very simple grammar and visually corresponds one-to-one with its abstract syntax tree.

All Rosetta D types and operators are grouped into a hierarchical name space for simplicity and ease of use; the fully-qualified name of any type and operator includes its namespace hierarchy, with the highest level namespace appearing first (left to right). You are recommended to use the fully-qualified names at all times (eg: root.branch.leaf), although you may also use partially qualified (eg: branch.leaf) or unqualified versions (eg: leaf) if they are unambiguous. For that matter, all standard relvars and constraints are likewise in that namespace hierarchy.

All Rosetta D entity identifiers, built-in and user-defined, including names of types, operators, relvars, constraints, and users, are all case-sensitive and may contain characters from the entire Unicode 4+ character repertoire (even whitespace), in each level of their namespace. But fully or partially qualified identifiers always use the radix point character (.) to delimit each level. Each namespace level may be formatted either with or without double-quote (") delimiters, if said name only contains non-punctuation and non-whitespace characters; if it does contain either of those, then it must always appear in delimited format. All built-in entities only use characters that don't require delimiters (the upper and lowercase letters A-Z, and the underscore, and sometimes are partially composed of the digits 0-9), and your code will be simpler if you do likewise. All built-in type and possrep names conform to best practices for Perl package names (eg: CharStr), and all built-in names for operators, constraints, relvars, and users, conform to best practices for Perl routine and variable names (eg: the_x), and certain pre-defined constant value names conform to best practices (eg: TRUE). No built-in operators have symbols like "+" or "=" as names, but rather use letters, "add" and "eq" in this case.

All Rosetta D expressions are formatted in prefix notation, where the operator appears before (left to right) all of its arguments, and the argument list is delimited by parenthesis and delimited by commas (eg: <op>( <arg>, <arg> )). This is like most programming languages but unlike most logical or mathematical expressions, which use infix notation (eg: <arg> <op> <arg>).

In addition, all arguments are named (formatted in name/value pairs), rather than positional, so they can be passed in any order (eg: <op>( <arg-name> => <arg-val> )), and so the expressions are more self-documenting about what the arguments mean (eg: source vs target). As an extension to this, if an operator takes a variable number of arguments that are all being used for the same purpose (eg: a list of numbers to add, or a list of relations to join), then those are collected into a single named argument whose value is a parenthesized and comma-delimited but un-ordered list (eg: <op>( <arg-name> => (<arg-val>, <arg-val>) )).

The root level of a database's name hierarchy contains these 4 name-spaces, which are searched in top-down order when trying to resolve unqualified entity identifiers:

system

All system-defined entities go here, including built-in data types and operators, and the catalog relvars that allow user introspection of the database using just relational operators (analagous to SQL's "information schema", but the provided meta-data is always fully decomposed), and constraints on the above.

The standard way to create, alter, or drop user-defined entities is to update the catalog relvars concerning them (although some short-hand "create", etc, operators are provided to simplify those tasks). It is like the user-defined entities are views defined in terms of the catalog relvars, and so explicitly changing the former results in implicitly changing the latter.

For uniformity, the system-defined entities are also listed in the catalog relvars (or, for the types, at least their interfaces are), but constraints on the catalog relvars forbid users from updating or removing the built-ins, or adding entities that say they are built-ins.

local

All persistent user-defined entities go here, including real and virtual relvars, types, operators, and constraints. This is the "normal" or "home" namespace for users. All entities here may only be defined in terms of either system entities or other entities here.

Typically, the next name space level down under local would be functionally similar to a list of schemata as larger SQL databases typically provide so that each of a database's users can have a separate place for the types, relvars, etc, that they create. In fact, to best be able to represent various existing DBMSs that have anywhere from zero to 2 or 3 such name spaces, Rosetta D allows you to have an arbitrary number of such intermediate name space levels, or use none at all. In fact, unless you actually need these intermediate levels, it is highly recommended that you don't use them at all, to reduce complexity. But as I mentioned earlier, unless the database has more than one entity with the same unqualified or semi-qualified name, you can just use those shorter names everywhere, which results in the optional hierarchy being abstracted away.

temporary

All user-defined entities go here whose fates are tied to open connections; each connection to a database has its own private instance of this name-space, and its contents disappear when the connection is closed. These entities can be all of the same kinds as those that go in local. They can be defined in terms of local entities, but the reverse isn't true. Generally, temporary is the name-space for entities that are specific to the current application, but that it makes sense for them to exist within the Tutorial D virtual environment for efficiency.

remote

If the current DBMS has support for federating access to external databases, effectively by "mounting" their contents within the current database as an extension to it, so users with a connection to the current database can access those other databases through the same connection, then those contents appear under remote.

This may or may not count as the current DBMS being a proxy.

In terms of a hypothetical federated DBMS that lets you use a single "connection" to access multiple remote databases at once, such as for a database cloning utility or a multiplexer, all of the interesting contents would be remote, and the local name space would be empty.

Typically, the next name space level down under remote will contain a single name per distinct mounted external database, and then below each of those may be that database's local items, or alternately and more likely we would see literal system, local, etc folders like our own root.

This feature is more experimental and has yet to be fleshed out.

Types and relvars would then have their unqualified names sitting just below the above name spaces, per root space; so, for example, we would have fully qualified names like system.CharStr or local.suppliers; simple.

However, operators have mandatory "package" name-spaces under which their otherwise unqualified names would go, and these are usually identical to the data type name that they are primarily associated with. So, for example, we would have fully qualified names like system.NumInt.add or system.CharStr.substr or system.Relation.join. Note that type selector operators and such would be named in exactly the same way.

Constraints on data types, that are specifically part of the definitions of the data types, have their names package-qualified like operators, while constraints on relvars don't have to be, or aren't.

Overview of Data Types

Rosetta D is a strongly typed language, where every value and variable is of a specific data type, and every operator and expression is defined in terms of specific data types. A variable can only store a value which is of its type, and every operator can only take argument values or expressions that are the same types as its parameters.

Values can only be explicitly converted from one data type to another (such as when comparing two values for equality) using explicitly defined operators for that purpose (this includes selectors, which typically convert from character strings to something else), and value type conversions can not happen implicitly; the sole exception to this is if one of the two involved types is defined as a constraint-restricted sub-type of the other, or if both are similarly restricted from a common third type.

All data types in Rosetta D fit into 3 main categories, which are scalar types, tuple types, and relation types. For our purposes, every data type that is not a tuple type or relation type is a scalar type.

A scalar type is a named set of scalar values; its sub-types mainly include booleans, numerics, character strings, bit strings, temporals, spacials, and any custom / user-defined data types.

A custom data type can be defined as a sub-type of a system-defined or user-defined scalar type that has extra constraints, which are named; for example, to restrict its set of scalar values to a sub-set of its parent type's set of scalar values (eg: restrict from an integer to an integer that has to be in the range 1 to 100).

Alternately, a custom data type can be defined to have one or more named possreps (possible representations), each of being different from the others in appearance but being identical in meaning; every possible value of that type should be renderable in each of its possible representations. For example, we could represent a point in space using either cartesian coordinates or polar coordinates. Each possrep is defined in terms of a list of components, where each component has a name and a type, and that type is some other system-defined or user-defined type. Such a custom data type can also have named constraints as part of its definition (eg: the point may not be more than a certain distance from the origin).

You can not declare named custom tuple types or relation types, as you can with scalar types, but rather all values and variables of these types carry with them a definition provided by a tuple or relation generator operator.

A tuple value or variable consists of an unordered set of zero or more attributes, where each attribute consists of a name (that is distinct within the tuple) and a type (that type is some other system-defined or user-defined type); it also has exactly one value per attribute which is of the same type as the attribute.

A relation value or variable consists of an unordered set of zero or more attributes, where each attribute consists of a name (that is distinct within the relation) and a type (that type is some other system-defined or user-defined type); it also has an unordered set of zero or more tuples whose set of attributes are all identical to those of the relation, and where every tuple value is distinct within the relation.

OLDER DOCUMENTATION TO REWRITE/REMOVE: DATA TYPES AND VALUES ^

Generally speaking, any two data types are considered to be mutually exclusive, such that a data value can only be in the domain of one of the types, and not the other. (The exception is if type A is declared to be a restriction of type B, or both are restrictions of type C.)

If you want to compare two values that have different data types, you must explicitly cast one or both values into the same data type. Likewise, if you want to use a value of one type in an operation that requires a value of a different type, such as when assigning to a container, the value must be cast into the needed type. The details of casting depend on the two types involved, but often you must choose from several possible methods of casting; for example, when casting between a numeric and a character string, you must choose what numeric radix to use. However, no casting choice is necessary if the data type of the value in hand is a restriction of the needed data type.

IRL gains rigor from this requirement for strong typing and explicit casting methods because you have to be very explicit as to what behaviour is expected; as a result, there should be no ambiguity in the system and the depot manager should perform exactly as you intended. This reduces troublesome subtle bugs in your programs, making development faster, and making your programs more reliable. Your data integrity is greatly improved, with certain causes of corruption removed, which is an important goal of any data management system, and supports the ideals of the relational data model.

IRL gains simplicity from this same requirement, because your depot-centric routines can neatly avoid the combinatorial complexity that comes from being given a range of data types as values when you conceptually just want one type, and your code doesn't have to deal with all the possible cases. The simpler routines are easier for developers to write, as they don't have to worry about several classes of error detection and handling (due to improper data formats), and the routines would also execute faster since they do less actual work. Any necessary work to move less strict data from the outside to within the depot manager environment is handled by the depot manager itself and/or your external application components (the latter is where any user interaction takes place), so that work is un-necessary to do once the data is inside the depot manager environment.

Classifications

IRL has 2 main classes of data types, which are opaque data types and transparent data types.

An opaque data type is like a black box whose internal representation is completely unknown to the user (and is determined by the depot manager), though its external interace and behaviour are clearly defined. Or, an opaque data type is like an object in a typical programming language whose attributes are all private. Conceptually speaking, all opaque data values are atomic and no sub-components are externally addressable for reading and changing, although the data type can provide its own specific methods or operators to extract or modify sub-components of an opaque data value. An example is extracting a sub-string of a character string to produce a new character string, or extracting a calendar-specific month-day from a temporal type.

A transparent data type is like a partitioned open box, such that each partition is a visibly distinct container or value that can be directly addressed for reading or writing. Or, an opaque data type is like an object in a typical programming language whose attributes are all public. Conceptually speaking, all transparent data types are named collections of zero or more other data types, as if the transparent data value or container was an extra level of namespace. Accessing these sub-component partitions individually is unambiguous and can be done without an accessor method. An example is a single element in an array, or a single member of a set, or a single field in a tuple, or a single tuple in a relation.

Opaque data types are further classified into unrestricted opaque data types and restricted opaque data types. An unrestricted opaque type has the full natural domain of possible values, and that domain is infinite in size for most of them; eg, the unrestricted numerical type can accomodate any number from negative to positive infinity, though the unrestricted boolean type still only has 2 values, false and true. A restricted opaque type is defined as a sub-type of another opaque type (restricted or not) which excludes part of the parent type's domain; eg a new type of numerical type can be defined that can only represent integers between 1 and 100. A trivial case of a restricted type is one declared to be identical in range to the parent type, such as if it simply served as an alias; that is also how you always declare a boolean type. A restricted type can implicitly be used as input to all operations that its parent type could be, though it can only be used conditionally as output.

Note that, for considerations of practicality, as computers are not infinite, IRL requires you to explicitly declare a container (but not a value) to be of a restricted opaque type, having a defined finite range in its domain, though that domain can still be very large. This allows depot manager implementations to know whether or not they need to do something very inefficient in order to store extremely large possible values (such as implement a numeric using a LOB), or whether a more efficient but more limited solution will work (using an implementation-native numeric type); stating your intentions by defining a finite range helps everything work better.

Transparent data types are further classified into collective and disjunctive transparent data types. A collective transparent data type is what you normally think of with transparent types, and includes arrays, sets, relations, and tuples; each one can contain zero or more distinct sub-values at once. A disjunctive transparent data type is the means that IRL provides to simulate both weak data types and normal-but-nullable data types. It looks like a tuple where only one field is allowed to contain a non-empty value at once, and it has a distinct field for each possible strong data type that the weak type can encompass (one being of the null type when simulating nullability); it actually has one more field than that, always valued, which says which of the other fields contains the important value.

OLDER DOCUMENTATION TO REWRITE/REMOVE: DATA TYPES OVERVIEW ^

IRL is strongly typed, following the relational model's ideals of stored data integrity, and the actual practice of SQL and many database products, and Rosetta's own ideals of being rigorously defined. However, its native set of data types also includes ones that have the range of typical weak types such as some database products and languages like Perl use.

A data type is a set of representable values. All data types are based on the concept of domains; any variable or literal that is of a particular data type may only hold a value that is part of the domain that defines the data type. IRL has some native data types that it implicitly understands (eg, booleans, integers, rational numbers, character strings, bit strings, arrays, rows, tables), and you can define custom ones too that are based on these (eg, counting numbers, character strings that are limited to 10 characters in length, rows having 3 specific fields).

All Rosetta::Model "domain" Nodes (and schema objects) are user defined, having a name that you pick, regardless of whether the domain corresponds directly to a native data type, or to one you customized; this is so there won't be any name conflicts regardless of any same named data types that a particular database implementation used in conjunction with Rosetta::Model may have.

Generalities

It is the general case that every data type defines a domain of values that is mutually exclusive from every other data type; 2 artifacts having a common data type (eg, 2 character strings) can always be compared for equality or inequality, and 2 artifacts of different data types (eg, 1 character string and 1 bit string) can not be compared and hence are always considered inequal. Following this, it is mandatory that every native and custom data type define the 'eq' (equal) and 'ne' (not equal) operators for comparing 2 artifacts that are of that same data type. Moreover, it is mandatory that no data type defines for themselves any 'eq' or 'ne' operators for comparing 2 artifacts of different data types.

In order to compare 2 artifacts of different data types for equality or inequality, either one must be cast into the other's data type, or they must both be cast into a common third data type. How exactly this is done depends on the situation at hand.

The simplest casting scenario is when there is a common domain that both artifacts belong to, such as happens when either one artifact's data type is a sub-domain of the other (eg, an integer and a rational number), or the data types of both are sub-domains of a common third data type (eg, even numbers and square whole numbers). Then both artifacts are cast as the common parent type (eg, rationals and integers respectively).

A more difficult but still common casting scenario is when the data types of two artifacts do not have a common actual domain, but yet there is one or more commonly known or explicitly defined way of mapping members of one type's domain to members of the other type's domain. Then both artifacts can be cast according to one of the candidate mappings. A common example of this is numbers and character strings, since numbers are often expressed as characters, such as when they come from user input or will be displayed to the user; sometimes characters are expressed as numbers too, as an encoding. One reason the number/character scenario is said to be more difficult is due to there being multiple ways to express numbers in character strings, such as octal vs decimal vs hexadecimal, so you have to explicitly choose between multiple casting methods or formats for the version you want; in other words, there are multiple members of one domain that map to the same member of another domain, so you have to choose; a cast method can not be picked simply on the data type of the operands.

A different casting scenario occurs when one or both of the data types are composite types, such as 2 tuples that are either of different degrees or that have different attribute names or value types. Dealing with these involves mapping all the attributes of each tuple against the other, with or without casting of the individual attributes, possibly into a third data type having attributes to represent all of those from the two.

Most data types support the extraction of part of an artifact to form a new artifact, which is either of the same data type or a different one. In some cases, even if 2 artifacts can't be compared as wholes, it is possible to compare an extract from one with the other, or extractions from both with each other. Commonly this is done with composite data types like tuples, where some attributes are extracted for comparison, such when joining the tuples, or filtering a tuple from a relation.

Aside from the 'eq' and 'ne' comparison operators, there are no other mandatory operators that must be defined for a given custom data type, though the native ones will include others. However, it is strongly recommended that each data type implement the 'cmp' (comparison) operator so that linearly sorting 2 artifacts of that common data type is a deterministic activity.

IRL requires that all data types are actually self-contained, regardless of their complexity or size. So nothing analagous to a "reference" or "pointer" in the Perl or C or SQL:2003 sense may be stored; the only valid way to say that two artifacts are related is for them to be equal, or have attributes that are equal, or be stored in common or adjoining locations.

Native Null Type

IRL natively supports the special NULL data type, whose value domain is by definition mutually exclusive of the domains of all other data types; in practice, a NULL is distinct from all possible values that the other IRL native primitive types can have. But some native complex types and user customized types could be defined where their domains are a super-set of NULL; those latter types are described as "nullable", while types whose domains are not a super-set of NULL are described as "not nullable".

The NULL data type represents situations where a value of an arbitrary data type is desired but none is yet known; it sits in place of the absent value to indicate that fact. NULL artifacts will always explicitly compare as being unequal to each other; since they all represent unknowns, we can not logically say any are equal, so they are all treated as distinct. This data type corresponds to SQL's concept of NULL, and is similar to Perl's concept of "undef". A NULL does not natively cast between any data types.

Rosetta::Model does not allow you to declare "domain" Nodes that are simply of or based on the data type NULL; rather, to use NULL you must declare "domain" Nodes that are either based on a not-nullable data type unioned with the NULL type, or are based on a nullable data type. The "domain" Node type provides a short-hand to indicate the union of its base type with NULL, in the form of the boolean "is_nullable" attribute; if the attribute is undefined, then the nullability status of the base data type is inherited; if it is defined, then it overrides the parent's status.

All not-nullable native data types default to their concept of empty or nothingness, such as zero or the empty string. All nullable native types, and all not-nullable native types that you customize with a true is_nullable, will default to NULL. In either case, you can define an explicit default value for your custom data type, which will override those behaviours; details are given further below.

Native Primitive Types

These are the simplest data types, from which all others are derived:

BOOLEAN

This data type is a single logical truth value, and can only be FALSE or TRUE. Its concept of nothingness is FALSE.

NUMERIC

This data type is a single rational number. Its concept of nothingness is zero. A subtype of NUMERIC must specify the radix-agnostic "num_precision" and "num_scale" attributes, which determine the maximum valid range of the subtype's values, and the subtype's storage representation can often be derived from it too.

The "num_precision" attribute is an integer >= 1; it specifies the maximum number of significant values that the subtype can represent. The "num_scale" attribute is an integer >= 0 and <= "num_precision"; if it is >= 1, the subtype is a fixed radix point rational number, such that 1 / "num_scale" defines the increment size between adjacent possible values; the trivial case of "num_scale" = 1 means the increment size is 1, and the number is an integer; if "num_scale" = 0, the subtype is a floating radix point rational number where "num_precision" represents the product of the maximum number of significant values that the subtype's mantissa and exponent can represent. IRL does not currently specify how much of a floating point number's "num_precision" is for the mantissa and how much for the exponent, but commonly the exponent takes a quarter.

The meanings of "precision" and "scale" are more generic for IRL than they are in the SQL:2003 standard; in SQL, "precision" (P) means the maximum number of significant figures, and the "scale" (S) says how many of those are on the right side of the radix point. Translating from base-R (eg, R being 10 or 2) to the IRL meanings are as follows (assuming negative numbers are allowed and zero is always in the middle of a range). For fixed-point numerics, a (P,S) becomes (2*R^P,R^S), meaning an integer (P,0) becomes (2*R^P,1). For floating-point numerics, a (P) sort-of becomes (2*R^P,0); I say sort-of because SQL:2003 says that the P shows significant figures in just the mantissa, but IRL currently says that the size of the exponent eats away from that, commonly a quarter.

As examples, a base-10 fixed in SQL defined as [p=10,s=0] (an integer in -10^10..10^10-1) becomes [p=20_000_000_000,s=1] in IRL; the base-10 [p=5,s=2] (a fixed in -1_000.00..999.99) becomes [p=200_000,s=100]; the base-2 [p=15,s=0] (a 16-bit int in -32_768..32_767) becomes [p=65_536,s=1]; the base-2 float defined as [p=31] (a 32-bit float in +/-8_388_608*2^+/-128) becomes [p=4_294_967_296,s=0].

A subtype of NUMERIC may specify the "num_min_value" and/or "num_max_value" attributes, which further reduces the subtype's valid range. For example, a minimum of 1 and maximum of 10 specifies that only numbers in the range 1..10 (inclusive) are allowed. Simply setting the minimum to zero and leaving the maximum unset is the recommended way in IRL to specify that you want to allow any non-negative number. Setting the minimum >= 0 also causes the maximum value range allowable by "num_precision" to shift into the positive, rather than it being half there and half in the negative. Eg, an (P,S) of (256,1) becomes 0..255 when the minimum = 0, whereas it would be -128..127 if the min/max are unset.

CHAR_STR

This data type is a string of characters. Its concept of nothingness is the empty string. A subtype of CHAR_STR must specify the "char_max_length" and "char_repertoire" attributes, which determine the maximum valid range of the subtype's values, and the subtype's storage representation can often be derived from it too.

The "char_max_length" attribute is an integer >= 0; it specifies the maximum length of the string in characters (eg, a 100 means a string of 0..100 characters can be stored). The "char_repertoire" enumerated attribute specifies what individual characters there are to choose from (eg, Unicode 4.1, Ascii 7-bit, Ansel; Unicode is the recommended choice).

A subtype of CHAR_STR may specify the "char_min_length" attribute, which means the length of the character string must be at least that long (eg, to say strings of length 6..10 are required, set min to 6 and max to 10).

STR_BIT

This data type is a string of bits. Its concept of nothingness is the empty string. A subtype of BIT_STR must specify the "bit_max_length" attribute, which determines the maximum valid range of the subtype's values, and the subtype's storage representation can often be derived from it too.

The "bit_max_length" attribute is an integer >= 0; it specifies the maximum length of the string in characters (eg, an 8000 means a string of 0..8000 bits can be stored).

A subtype of BIT_STR may specify the "bit_min_length" attribute, which means the length of the bit string must be at least that long (eg, to say strings of length 24..32 are required, set min to 24 and max to 32).

A subtype of any of these native primitive types can define a default value for the subtype, it can define whether the subtype is nullable or not (they are all not-nullable by default), and it can enumerate an explicit list of allowed values (eg, [4, 8, 15, 16, 23, 42], or ['foo', 'bar', 'baz'], or [B'1100', B'1001']), one each in a child Node (these must fall within the specified range/size limits otherwise defined for the subtype).

Native Scalar Type

IRL has native support for a special SCALAR data type, which is akin to SQLite's weakly typed table columns, or to Perl's weakly typed default scalar variables. This data type is a union of the domains of the BOOLEAN, NUMERIC, CHAR_STR, and BIT_STR data types; it is not-nullable by default. Its concept of nothingness is the empty string.

SEE ALSO ^

Go to Rosetta for the majority of distribution-internal references, and Rosetta::SeeAlso for the majority of distribution-external references.

AUTHOR ^

Darren R. Duncan (perl@DarrenDuncan.net)

LICENCE AND COPYRIGHT ^

This file is part of the Rosetta DBMS framework.

Rosetta is Copyright (c) 2002-2006, Darren R. Duncan.

See the LICENCE AND COPYRIGHT of Rosetta for details.

ACKNOWLEDGEMENTS ^

The ACKNOWLEDGEMENTS in Rosetta apply to this file too.

syntax highlighting: