The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.
SPECIFICATION VERSION

    0.9

STATUS

    In the 0.9.0 series, there will probably still be incompatible syntax
    changes between revision before the spec stabilizes into 1.0 series.

ABOUT

    This document specifies Sah, a schema language for validating data
    structures.

    In this document, schemas and data structures are mostly written in
    pseudo-JSON (JSON with comments // ..., ellipsis ..., or some
    JavaScript).

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
    document are to be interpreted as described in RFC 2119.

SCHEMA

    Although it can contain extra stuffs, a schema is essentially a type
    definition, stating a set of valid values for data.

    Sah schemas are regular data structures, specifically arrays:

     [TYPE_NAME, CLAUSE_SET, EXTRAS]

    TYPE_NAME is a string, CLAUSE_SET is a hash of clauses, and EXTRAS is a
    hash and is optional. Some examples:

     ["int", {"min": 0, "max": 100}]
    
     // a definition of pos_even (positive even natural numbers). "pos" is defined
     // in the EXTRAS part.
     ["pos", {"div_by": 2}, {"def": {"pos": ["int": {"min": 0}]}}]

    A shortcut string form containing only the type name is allowed when
    there are no clauses. It will be normalized into the array form:

     "int"

    The type name can have a * suffix as a shortcut for the "req": 1
    clause. This shortcut exists because stating something is required is
    very common.

     "int*"
    
     // equivalent to
     ["int", {"req": 1}]
    
     ["int*", {"min": 0}]
    
     // equivalent to
     ["int", {"req": 1, "min": 0}]

    A flattened array form is also supported when there are no EXTRAS. It
    will be normalized into the non-flattened form. This shortcut exists to
    save a couple of keystrokes :-) And also reduce the number of nested
    structure, which can get a bit unwieldy for complex schemas.

     ["int", "min", 1, "max", 10]
    
     // is equivalent to
     ["int", {"min": 1, "max": 10}]

TYPE

    Type classifies data and specifies the possible values of data.

    Sah defines several standard types like bool, int, float, str, array,
    hash, and a few others. Please see Sah::Type for the complete list.

    Type name must match this regular expression:

     \A[A-Za-z_][A-Za-z0-9_]+(::[A-Za-z_][A-Za-z0-9_]+)*\z

    A type can have clauses. Most clauses declare constraints (thus,
    constraint clauses). Constraint clauses are like functions, they accept
    an argument, are evaluated against data and return a value. The
    returned value need not strictly be boolean, but for the clause to
    succeed, the return value must evaluate to true. The notion of
    true/false follows Perl's notion: undefined value, empty string (""),
    the string "0", and number 0 are considered false. Everything else is
    true.

    For the schema to succeed, all constraint clauses must evaluate to
    true.

    Aside from declaring constraints, clauses can also declare other
    stuffs. There is the default clause which specifies default value.
    There are metadata clauses which specify metadata, e.g. the summary,
    description, tags clauses.

    Aside from clauses, type can also have type properties. Properties are
    different from clauses in the following ways: 1) they are used to find
    out something about the data, not to test/validate data; 2) they are
    allowed to not accept any argument. A type can have a property and a
    clause with the same name, for example the str type have a len clause
    to test its length against an integer, as well as a len property which
    returns its length. Properties are differentiated from clauses so that
    compilers to human text can generate a description like "string where
    its length is at least 1".

    Type properties can be validated against a schema using the prop or if
    clause.

    Base schema. You can define a schema, declare it as a new type, and
    then write subsequent schemas against that type, along with additional
    clauses. This is very much like subtyping. See "BASE SCHEMA" for more
    information.

BASE SCHEMA

    As mentioned before, you can define a schema as a type and then write
    other schemas against that type. For example:

     // defined as pos_int type
     ["int", {"min": 0}]

    and later:

     // a positive integer, divisible by 5
     ["pos_int", {"div_by": 5}]

    During data validation, base schemas will be replaced by its original
    definition, and all the clause sets will be evaluated. Illustrated by
    the plus sign:

     ["int", {"min": 0} + {"div_by": 5}]

    You can also declare base schemas/types locally using the def key in
    EXTRAS, for example:

     ["throws", {},
      {
          "def": {
              "single_dice_throw":  ["int": {"in": [1, 2, 3, 4, 5, 6]}],
              "sdt":                "single_dice_throw", // short notation
              "dice_pair_throw":    ["array": {"len": 2, "elems": ["sdt", "sdt"]}],
              "dpt":                "dice_pair_throw",   // short notation
              "throw":              ["any": {"of": ["sdt", "dpt"]}],
              "throws":             ["array": {"of": "throw"}],
          }
      }
     ]

    The above schema describes a list of dice throws (throws). Each throw
    can be a single dice throw (sdt) which is a number between 1 and 6, or
    a throw of two dices (dpt) which is a 2-element array (where each
    element is a number between 1 and 6).

    Examples of valid data for this schema:

     [1, [1,3], 6, 4, 2, [3,5]]

    Examples of invalid data:

     1                  // not an array
     [1, [2, 3], 0]     // the third throw is invalid
     [1, [2, 0, 4], 4]  // the second throw is invalid

    All the base schemas names throw, throws, sdt, etc is only declared
    locally and unknown outside the schema. You can even nest this.

 Optional/conditional definition

    If you put a ? suffix after the definition name then it means that the
    definition is optional and can be skipped if the type is already
    defined, e.g.:

      "def": {
          "emailaddr?": ["str", {"req": 1, "match": ".+\@.+"}],
          "username":   ["str", {"req": 1, "match": "^[a-z0-9_]+$"}]
      }

    In the above example, if there is already an emailaddr type defined at
    that time, the definition will be skipped instead of a "cannot redefine
    type" error being generated.

    Optional definition is useful if you want to provide some defaults
    (e.g. a rudimentary validation for email address) but don't mind if the
    validator already has something probably better (a stricter or more
    precise definition of email address).

CLAUSE AND CLAUSE SET

    A clause set is a defhash (see DefHash) containing a mapping of clause
    name and clause values or clause attribute names and clause attribute
    values. Defhash properties map to Sah clauses, while defhash property
    attributes map to Sah clause attributes.

     {
         "CLAUSENAME1": CLAUSEVALUE,
         "CLAUSENAME1.ATTRNAME1": ATTRVALUE1,
         "CLAUSENAME1.ATTRNAME2": ATTRVALUE2,
         "CLAUSENAME1.ATTRNAME1.SUBATTR1": ...,
         ...
         "_IGNORED": ...,
         "CLAUSENAME1._IGNORED": ...
     }

    For convenience, there are also some shortcuts:

      * & suffix (multiple clause values, all must succeed)

       "CLAUSENAME&": [VAL, ...]

      is equivalent to:

       "CLAUSENAME":    [VAL, ...],
       "CLAUSENAME.op": "and"

      * | suffix (multiple clause values, only one must succeed)

       "CLAUSENAME|": [VAL, ...]

      is equivalent to:

       "CLAUSENAME":    [VAL, ...],
       "CLAUSENAME.op": "or"

      * ! prefix (negation)

       "!CLAUSENAME": VAL

      is a shortcut for this:

       "CLAUSENAME": VAL,
       "CLAUSENAME.op": "not"

      * = suffix (expression)

       "CLAUSENAME=": EXPR
      
       "CLAUSENAME.ATTRNAME1=": EXPR

      are respectively equivalent to:

       "CLAUSENAME.is_expr": 1
       "CLAUSENAME": EXPR
      
       "CLAUSENAME.ATTRNAME1.is_expr": 1
       "CLAUSENAME.ATTRNAME1": EXPR

      * (LANG) suffix (value for alternate languages)

       "CLAUSENAME(LANG)": VAL
      
       "CLAUSENAME.ATTRNAME1(LANG)": VAL

      are respectively equivalent to:

       "CLAUSENAME.alt.lang.LANG": VAL
      
       "CLAUSENAME.ATTRNAME1.alt.lang.LANG": VAL

      Examples:

       "name(id_ID)": "bilangan bulat positif"
       "name(en_US)": ["positive integer", "positive integers"]

      are equivalent to:

       "name.alt.lang.id_ID": "bilangan bulat positif"
       "name.alt.lang.en_US": ["positive integer", "positive integers"]

    Every clause has a priority between 0 and 100 to determine the order of
    evaluation (the lower the number, the higher the priority and the
    earlier the clause is evaluated). Most constraint clauses are at
    priority 50 (normal) so the order does not matter, but some clauses are
    early (like default and prefilters) and some are late (like
    postfilters). Variables mentioned in expression also determine
    ordering, for example:

     ["int", {"min=": "0.5*$clause:max", "max": 10}]

    In the above example, although max and min are both at priority 50, min
    needs to be evaluated first because it refers to max (XXX syntax of
    variable not yet finalized).

 Clause name

    This specification comes from DefHash: Clause names must begin with
    letter/underscore and contain letters/numbers/underscores only. All
    clauses which begin with an _ (underscore) is ignored. You can use this
    to embed extra data for other purposes.

 Clause attribute

    This specification comes from DefHash: Attribute name must also only
    contain letters/numbers/underscores, but it can be a dotted-separated
    series of parts, e.g. alt.lang.id_ID. As with clauses, clause
    attributes which begin with _ (underscore) is ignored. You can use this
    to embed extra data.

    Currently known general attributes:

      * prio : INT

      Change the clause's priority for this clause set. Note that this only
      works for clauses which have equal priorities. Otherwise, priority
      value from clause definition takes precedence.

      Example:

       // both "min" and "max" clauses have priority of 50, but we want to make sure
       // that "min" is evaluated first
       ["int*", {"min=": "some expr", "min.prio": 1, "max": 10}]

      * op : STR

      Specify operator for (multiple) clause values. Possible values for
      this attribute include: and, or, none, not. Except for not, the
      presence of op signifies that clause contains multiple values instead
      of a single one.

      There are shortcuts for and, or, and not; see "CLAUSE AND CLAUSE
      SET".

      and specifies that all clause values must succeed for the clause to
      succeed. Example:

       ["str", {"clause": [["min_len", 8], ["match", "\\W"]], "clause.op": "and"}]

      The above schema requires a string to be at least 8 characters long,
      or contains a non-word character. Strings that would validate
      include: abcdefgh or $ or $abcdefg. Strings that would not validate
      include: abcd (fails both min_len and match clauses) or abcdefgh
      (fails the match clause).

      or specifies that any one of clause values must succeed for the
      clause to succeed. Example:

       ["str", {"match": [RE1, RE2, RE3], "match.op": "or"}]

      The above schema specifies that string can match any of the regexes
      RE1/RE2/RE3.

      none specifies that all clause values must fail for the clause to
      succeed. For example:

       ["str", {"match": [RE1, RE2, RE3], "match.op": "none"}]

      The above schema specifies that string must not match any of the
      regexes RE1/RE2/RE3.

      not reverts the success status of clause (in other words, clause must
      fail for validation to succeed). Example:

       ["str", {"match": RE, "match.op", "not"}]

      The above schema specifies not string must not match regex RE.

      * is_expr : BOOL

      Signify that clause contains expression (see "EXPRESSION") instead of
      literal value. Example:

       // a string, minimum 4 characters
       ["str", {"min_len": 4}]
      
       // same thing, albeit a bit fancier
       ["str", {"min_len.is_expr": 1, "min_len": "2*2"}]
      
       // same thing, shortcut notation
       ["str", {"min_len=": "2*2"}]
      
       // for default, we pick a random number between 1 and 10
       ["int", {"default=": "int(10*rand())+1"}]

      Expression is useful for more complex schema, when a clause/attribute
      value needs to be calculated in terms of other values, and/or using
      functions.

      Note that an implementation might not support expression in some
      clauses or attributes, especially clauses that accept argument
      containing schemas as dynamically generated schemas needs the
      compiler to embed an interpreter or compiler in the generated code.

      When is_expr attribute is true, and op is also one that requires
      multiple clause values (like and, or, none), then the expression is
      expected to return an array of values. Otherwise, the clause will
      fail. Example:

       // number which must be divisible by 2, 3, 5
       ["int", {"div_by.is_expr": 1, "div_by.op": "and", "div_by": "[2, 3, 5]"}]
      
       // string must not match any of the blacklist
       ["str", {
           "contains.is_expr": 1,
           "contains.op": "none",
           "contains": "get_blacklist()"
       }]

      * err_level : STR (default: error)

      Valid value: fatal, error, warn. Normally, when clause checking
      fails, an error is generated and it causes validation of the whole
      schema to fail. If err_level is set to warn, however, this only
      generates a warning and does not cause the validation to fail.

       // password
       ["str*", {"clset&": [
         {"min_len": 4},
         {"min_len": 8,
          "min_len.err_level": "warn",
          "min_len.err_msg":   "Although a password less than 8 letters are " +
                               "valid it's highly recommended that a password is " +
                               "at least 8 letters long, for security reasons"}
       ]}]

      In the above example, the err_level and err_msg are attributes for
      the min_len clause. The second clause set basically adds an optional
      restriction for the password: when the min_len clause is not
      satisfied, instead of making the data fails the validation, only a
      warning is issued.

      fatal is the same as error but will make validation exit early,
      without collecting further errors. This only takes effect when
      validation collects full errors instead of just stopping after the
      first error is found.

      * err_msg[.alt.lang.LANGCODE]

      This tells the compiler that instead of the default error message
      from the type handler, a custom error message is supplied. You can
      add translations by adding more attributes with language code
      suffixes. For example:

       ["str", {"match":                "[^A-Za-z0-9_-]",
                "match.err_msg":                "Must not contain naughty characters",
                "match.err_msg.alt.lang.id_ID": "Tidak boleh mengandung karakter aneh-aneh"
       }]

      Another example:

       ["str", {"!in": ["root", "admin"],
                "in.err_msg":                "Sorry, username is reserved",
                "in.err_msg.alt.lang.id_ID": "Maaf, nama user dilarang digunakan"
       }]

      * human[.alt.lang.LANGCODE]

      This is also ignored when validating data, but will be used by the
      human compiler to supply description. You can add translations by
      adding more attributes.

       ["str", {"match":             "[^A-Za-z0-9_-]",
                "match.human":                "Must not contain naughty characters",
                "match.human.alt.lang.id_ID": "Tidak boleh mengandung karakter aneh-aneh"
       }]

      * alt

      This comes from DefHash, mainly used to store translations for name,
      summary, description.

      * result_var : VARNAME (EXPERIMENTAL)

      Specify variable name to store results in.

      Aside from pass/failure, a clause or clause set can also produce some
      value. This attribute specifies where to put the results in. The
      value can then be used by referring to the variable in expression.
      Example:

       ["any", {
           "of": [
               ["str*",    {"min_len": 1, "max_len": 10}], // 1
               ["str*",    {"min_len": 11}],               // 2
               ["array*",  {}],                            // 3
               ["hash*",   {}]                             // 4
           ],
          "of.result_var": "a"
       }]

      Aside from passing/failing the validation, the of clause above also
      produces an index to the schema in the list which matches. So if you
      validate an array, $a in the schema will be set to 3. If you validate
      a string with length 12, $a will be set to 2. If you pass an empty
      string (which does not pass the of clause, $a will not be set.

      Refer to each clause's documentation to find out what value the
      clause returns.

      * c.COMPILER

      This is a namespace for specifying compiler options. Each compiler
      will have its specific options; see documentation on respective
      compiler to see available options. For example:

       // skip clauses which are not implemented in JavaScript. we'll check on the
       // server-side anyway.
       ["str", {
         "soundex": "E460",
         "c.js.ignore_missing_clause_handler": true
       }]

      * x.WHATEVER

      This comes from DefHash and is an alternative to underscore prefix
      for putting extra data in a schema. The difference is that some
      processing tool might strip the underscore clause/attribute.

    Aside from the above general attributes, each clause might recognize
    its own specific attributes. See documentation of respective clauses.

 Clause set merging

    Clause set merging happens when a schema is based on another schema and
    the child schema's clause set contains merge prefixes (explained later)
    in its keys. For example:

     // schema1
     [TYPE1, CLSET1]
    
     // schema2, based on schema1
     [schema1, CLSET2]
    
     // schema3, based on schema2
     [schema2, CLSET3]

    When compiling/evaluating schema2, Sah will check against TYPE1 and
    CLSET1 and then CLSET2. However, when CLSET2 contains a merge prefix
    (marked with an asterisk here for illustration), then Sah will check
    against TYPE1 and merge(CLSET1, *CLSET2).

    When compiling/evaluating schema3, Sah will check against TYPE1 and
    CLSET1 and then CLSET2 and then CLSET3. However, when CLSET2 contains a
    merge prefix, then Sah will check against TYPE1, merge(CLSET1,
    *CLSET2), and then CLSET3. When CLSET2 and CLSET3 contains merge
    prefixes, Sah will check against TYPE1 and merge(CLSET1, *CLSET2,
    *CLSET3). So merging will be done from left to right.

    The base schema's clause set must not contain any merge prefixes.

    Merging is done using Data::ModeMerge, with merge prefixes changed to
    'merge.add.', 'merge.delete.' and so on. In merging, Data::ModeMerge
    allows keys on the right side hash not only to replace but also add,
    subtract, remove keys from the left side. This is powerful because it
    allows schema definition to not only add clauses (restrict types even
    more), but also replace clauses (change type restriction) as well as
    delete clauses (relax type restriction). For more information, refer to
    the Data::ModeMerge documentation.

    Illustration:

     int + {"div_by": 2} + {"div_by": 3}               // must be divisible by 2 & 3
    
     int + {"div_by": 2} + {"merge.normal.div_by": 3} // will be merged and become:
     int + {"div_by": 3}                              // must be divisible by 3 ONLY
    
     int + {"div_by": 2} + {"merge.delete.div_by": 0}  // will be merged and become:
     int + {}                                          // need not be divisible
    
     int + {"in": [1,2,3,4,5]} + {"in": [6]}           // impossible to satisfy
    
     int + {"in": [1,2,3,4,5]} + {"merge.add.in": [6]} // will be merged and become:
     int + {"in": [1,2,3,4,5,6]}
    
     int + {"in": [1,2,3,4,5]}, {"merge.subtract.in": [4]}  // will become:
     int + {"in": [1,2,3,  5]}

    Merging is performed before schema is normalized.

    Merging is not recursive.

EXPRESSION

    XXX: Syntax of variables not yet finalized.

    Sah supports expressions, using Language::Expr minilanguage. See
    Language::Expr::Manual::Syntax for details on the syntax. You can
    specify expression in the check clause, e.g.:

     ["int", {"check": "$_ >= 4"}]

    Alternatively, expression can also be specified in any clause's
    attribute:

     ["int", {"min=": "floor(4.9)"}]

    The above three schemas are equivalent to:

     ["int", {"min", 4}]

    Expression can refer to elements of data and (normalized) schema, and
    can call functions, enabling more complex schema to be defined, for
    example:

     ["array*", {"len": 2, "elems": [
       ["str*", {"match": "^\w+$"}],
       ["str*", {"match=": "${../../0/clause_sets/0/match}",
                 "min_len=": "2*length(${data:../0})"}]
     ]}]

    The above schema requires data to be a two-element array containing
    strings, where the length of the second string has to be at least twice
    the length of the first. Both strings have to comply to the same regex,
    ^\w+$ (which is declared on the first string's clause and referred to
    in the second string's clause).

FUNCTION

    Functions can be used in expressions. The syntax of calling function
    is:

     func()
     func(ARG, ...)

    Functions in Sah can sometimes accept several types of arguments, e.g.
    len(ARRAY) will return the number of elements in the ARRAY, while
    len(STR) will return the number of characters in the string. However,
    when an inappropriate argument is given, an exception will be thrown.

EXTRAS

    The extras part of a schema (the third element) contains various
    stuffs. It is a DefHash that can contain these keys:

      * def

      Subschema definitions.

HISTORY

    2012-07-21 split specification to Sah

    2011-11-23 Data::Sah

    2009-03-30 Data::Schema (first CPAN release)

    Previous incarnation as Schema-Nested (internal)

SEE ALSO

    DefHash

    Sah::Type, Sah::FAQ