The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
/* ------------------------------------------------------------------------
@NAME       : bibtex.g
@DESCRIPTION: PCCTS-based lexer and parser for BibTeX files.  (Or rather,
              for the BibTeX data description language.  This parser
              enforces nothing about the structure and contents of
              entries; that's up to higher-level processors.  Thus, there's
              nothing either particularly bibliographic or TeXish about
              the language accepted by this parser, apart from the affinity
              for curly braces.)

              There are a few minor differences from the language accepted
              by BibTeX itself, but these are generally improvements over
              BibTeX's behaviour.  See the comments in the grammar, at least
              until I write a decent description of the language.

              I have used Gerd Neugebauer's BibTool (yet another BibTeX
              parser, along with a prettyprinter and specialized language
              for a common set of bibhacks) as another check of correctness
              -- there are a few screwball things that BibTeX accepts and
              BibTool doesn't, so I felt justified in rejecting them as
              well.  In general, this parser is a little stricter than
              BibTeX, but a little looser than BibTool.  YMMV.

              Another source of inspiration is Nelson Beebe's bibclean, or
              rather Beebe's article describing bibclean (from TUGboat
              vol. 14 no. 4; also included with the bibclean distribution).

              The product of the parser is an abstract syntax tree that can
              be traversed to be printed in a simple form (see
              print_entry() in bibparse.c) or perhaps transformed to a
              format more convenient for higher-level languages (see my
              Text::BibTeX Perl module for an example).

              Whole files may be parsed by entering the parser at `bibfile';
              in this case, the parser really returns a forest (list of
              ASTs, one per entry).  Alternately, you can enter the parser
              at `entry', which reads and parses a single entry.
@GLOBALS    : the usual DLG and ANTLR cruft
@CALLS      : 
@CREATED    : first attempt: May 1996, Greg Ward
              second attempt (complete rewrite): July 25-28 1996, Greg Ward
@MODIFIED   : Sep 1996, GPW: changed to generate an AST rather than print
                             out each entry as it's encountered
              Jan 1997, GPW: redid the above, because it was lost when
                             my !%&$#!@ computer was stolen
              Jun 1997, GPW: greatly simplified the lexer, and added handling
                             of %-comments, @comment and @preamble entries,
                             and proper scanning of between-entry junk
@VERSION    : $Id$
@COPYRIGHT  : Copyright (c) 1996-99 by Gregory P. Ward.  All rights reserved.

              This file is part of the btparse library.  This library is
              free software; you can redistribute it and/or modify it under
              the terms of the GNU Library General Public License as
              published by the Free Software Foundation; either version 2
              of the License, or (at your option) any later version.
-------------------------------------------------------------------------- */

#header
<<
#define ZZCOL
#define USER_ZZSYN

#include "config.h"
#include "btparse.h"
#include "attrib.h"
#include "lex_auxiliary.h"
#include "error.h"
#include "my_dmalloc.h"

extern char * InputFilename;            /* for zzcr_ast call in pccts/ast.c */
>>

/*
 * The lexer has three modes -- START (between entries, hence it's what
 * we're in initially), LEX_ENTRY (entered once we see an '@' at
 * top-level), and LEX_STRING (for scanning quoted strings).  Note that all
 * the functions called from lexer actions can be found in lex_auxiliary.c.
 *
 * The START mode just looks for '@', discards comments and whitespace,
 * counts lines, and keeps track of any other junk.  The "keeping track"
 * just consists of counting the number of junk characters, which is then
 * reported at the next '@' sign.  This will hopefully let users clean up
 * "old style" implicit comments, and possibly catch some legitimate errors
 * in their files (eg. a complete entry that's missing an '@').
 */

#token AT           "\@"            << at_sign (); >>
#token              "\n"            << newline (); >>
#token COMMENT      "\%~[\n]*\n"    << comment (); >>
#token              "[\ \r\t]+"     << zzskip (); >>
#token              "~[\@\n\ \r\t]+"<< toplevel_junk (); >>

#lexclass LEX_ENTRY

/*
 * The LEX_ENTRY mode is where most of the interesting stuff is -- these
 * tokens control most of the syntax of BibTeX.  First, we duplicate most
 * of the START lexer, in order to handle newlines, comments, and
 * whitespace.
 *
 * Next comes a "number", which is trivial.  This is needed because a
 * BibTeX simple value may be an unquoted digit string; it has to precede
 * the definition of "name" tokens, because otherwise a digit string would
 * be a legitimate "name", which would cause an ambiguity inside entries
 * ("is this a macro or a number?")
 * 
 * Then comes the regexp for a BibTeX "name", which is used for entry
 * types, entry keys, field names, and macro names.  This is basically the
 * same as BibTeX's definition of such "names", with two differences.  The
 * key, fundamental difference is that I have defined names by inclusion
 * rather than exclusion: this regex lists all characters allowed in a
 * type/key/field name/macro name, rather than listing those characters not
 * allowed (as the BibTeX documentation does).  The trivial difference is
 * that I have disallowed a few extra characters: @ \ ~.  Allowing @ could
 * cause confusing BibTeX syntax, and allowing \ or ~ can cause bogus TeX
 * code: try putting "\cite{foo\bar}" in your LaTeX document and see what
 * happens!  I'm also rather skeptical about some of the more exotic
 * punctuation characters being allowed, but since people have been using
 * BibTeX's definition of "names" for a decade or so now, I guess we're
 * stuck with it.  I could always amend name() to warn about any exotic
 * punctuation that offends me, but that should be an option -- and I don't
 * have a mechanism for user selectable warnings yet, so it'll have to
 * wait.
 * 
 * Also note that defining "number" ahead of "name" precludes a string of
 * digits from being a name.  This is usually a good thing; we don't want
 * to accept digit strings as article types or field names (BibTeX
 * doesn't).  However -- dubious as it may seem -- digit strings are
 * legitimate entry keys, so we should accept them there.  This is handled
 * by the grammar; see the `contents' rule below.
 * 
 * Finally, it should be noted that BibTeX does not seem to apply the same
 * lexical rules to entry types, entry keys, and field names -- so perhaps
 * doing so here is not such a great idea.  One immediate manifestation of
 * this is that my grammar in its unassisted state would accept a field
 * name with leading digits; BibTeX doesn't accept this.  I correct this
 * with the check_field_name() function, called from the `field' rule in
 * the grammar and defined in parse_auxiliary.c.
 */
#token              "\n"            << newline (); >>
#token COMMENT      "\%~[\n]*\n"    << comment (); >>
#token              "[\ \r\t]+"     << zzskip (); >>
#token NUMBER       "[0-9]+"
#token NAME         "[a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+"
                                    << name (); >>

/* 
 * Now come the (apparently) easy tokens, i.e. punctuation.  There are a
 * number of tricky bits here, though.  First, '{' can have two very
 * different meanings: at top-level, it's an entry delimiter, and inside an
 * entry it's a string delimiter.  This is handled (in lbrace()) by keeping
 * track of the "entry state" (top-level, after '@', after type, in
 * comment, or in entry) and using that to determine what to do on a '{'.
 * If we're in an entry, lbrace() will switch to the string lexer by
 * calling start_string(); if we're immediately after an entry type token
 * (which is just a name following a top-level '@'), then we force the
 * current token to ENTRY_OPEN, so that '{' and '(' appear identical to the
 * parser.  (This works because the scanner generated by DLG just happens
 * to assign the token number first, and then executes the action.)
 * Anywhere else (ie. at top level or immediately after an '@', we print a
 * warning and leave the token as LBRACE, which will cause a syntax error
 * (because LBRACE is not used anywhere in the grammar).
 *
 * '(' has some similarities to '{', but it's different enough that it 
 * has its own function.  In particular, it may be an entry opener just 
 * like '{', but in one particular case it may be a string opener.  That
 * particular case is where it follows '@' and 'comment'; in that case,
 * lparen() will call start_string() to enter the string lexer.
 *
 * The other delimiter characters are easier, but still warrant an
 * explanation.  '}' should only occur inside an entry, and if found there
 * the token is forced to ENTRY_CLOSER; anywhere else, a warning is printed
 * and the parser should find a syntax error.  ')' should only occur inside
 * an entry, and likewise will trigger a warning if seen elsewhere.
 * (String-closing '}' and ')' are handled by the string lexer, below.)
 *
 * The other punctuation characters are trivial.  Note that a double quote
 * can start a string anywhere (except at top-level!), but if it occurs in
 * a weird place a syntax error will eventually occur.
 */
#token LBRACE       "\{"            << lbrace (); >>
#token RBRACE       "\}"            << rbrace (); >>
#token ENTRY_OPEN   "\("            << lparen (); >>
#token ENTRY_CLOSE  "\)"            << rparen (); >>
#token EQUALS       "="
#token HASH         "\#"
#token COMMA        ","
#token              "\""            << start_string ('"'); >>


#lexclass LEX_STRING

/*
 * Here's a reasonably decent attempt at lexing BibTeX strings.  There are
 * a couple of sneaky tricks going on here that aren't strictly necessary,
 * but can make the user's life a lot easier.
 *
 * First, here's what a simple and straightforward BibTeX string lexer 
 * would do:
 *   - keep track of brace-depth by incrementing/decrementing a counter
 *     whenever it sees `{' or `}'
 *   - if the string was started with a `{' and it sees a `}' which
 *     brings the brace-depth to 0, end the string
 *   - if the string was started with a `"' and it sees another `"' at
 *     brace-depth 0, end the string
 *   - any other characters are left untouched and become part of the
 *     string
 *
 * (Note that the simple act of counting braces makes this lexer
 * non-regular -- there's a bit more going on here than you might
 * think from reading the regexps.  So sue me.)
 *
 * The first, most obvious refinement to this is to check for newlines
 * and other whitespace -- we should convert either one to a single
 * space (to simplify future processing), as well as increment zzline on
 * newline.  Note that we don't do any collapsing of whitespace yet --
 * newlines surrounded by spaces make that rather tricky to handle
 * properly in the lexer (because newlines are handled separately, in
 * order to increment zzline), so I put it off to a later stage.  (That
 * also gives us the flexibility to collapse whitespace or not,
 * according to the user's whim.)
 * 
 * A PCCTS lexer to handle these requirements would look something like this:
 * 
 * #token     "\n"             << newline_in_string (); >>
 * #token     "[\r\t]"         << zzreplchar (' '); zzmore (); >>
 * #token     "\{"             << open_brace(); >>
 * #token     "\}"             << close_brace(); >>
 * #token     "\""             << quote_in_string (); >>
 * #token     "~[\n\{\}\"]+"   << zzmore (); >>
 *
 * where the functions called are the same as currently in lex_auxiliary.c.
 * 
 * However, I've added some trickery here that lets us heuristically detect
 * runaway strings.  The heuristic is as follows: anytime we have a newline
 * in a string, that's reason to suspect a runaway.  We follow up on this
 * suspicion by slurping everything that could reasonably be part of the
 * string and still be in the same line (i.e., a string of anything except
 * newline, braces, parentheses, double-quote, and backslash), and then
 * calling check_runaway_string().  This function then "backs up" to the
 * beginning of the slurped string (the newline), and scans ahead looking
 * for one of two patterns: "@name[{(]", or "name=" (with optional
 * whitespace between the "tokens").  (Actually, it first makes a pass over
 * the string to convert all whitespace characters -- including the sole
 * newline -- to spaces.  So, it's effectively looking for "\ *\@\ *NAME\
 * *[\{\(]" (DLG regexp syntax) or "\ *NAME\ *=", where
 * NAME="[a-z][a-z0-9+/:'.-]*" -- that is, something that looks like the
 * start of an entry or a new field, but in a string (where they almost
 * certainly shouldn't occur).  Of course, there are no explicit regexps
 * there -- it's all coded as a little hand-crafted automaton in C.
 *
 * At any rate, if either one of these patterns is matched,
 * check_runaway_string() prints a warning and sets a flag so that we don't
 * print that warning -- or indeed, even scan for the suspect patterns --
 * more than once for the current string.  (Because chances are if it
 * occurs once, it'll occur again and again and again.)
 *
 * There is also some trickery going on to deal with '@comment' entries.
 * Syntactically, these are just AT NAME STRING, where NAME must be
 * 'comment'.  This means that an '@comment' entry has no delimiters, it
 * just has a string.  To make them look a bit more like the other kinds of
 * entries (which are delimited with '{' ... '}' or '(' ... ')', the STRING
 * here is special: it's delimited either by braces or parentheses, rather
 * than by the usual braces or double-quotes.  Thus, we treat parentheses
 * much like braces in this lexer, to handle the '@comment(...)' case.  And
 * there's an explicit check for the erroneous '@comment"..."' case in
 * start_string(), just to be complete.
 *
 * So that explains all the regexps in this lexer: the first one (starting
 * with newline) triggers the check for a runaway string.  Then, we have a
 * pattern to convert any single whitespace char (apart from newline) to a
 * space; note that any whitespace chars that are matched in the
 * newline-regexp will be converted by check_runaway_string(), and won't be
 * matched by the whitespace regexp here.  Then, we check for braces;
 * open_brace() and close_brace() take care of counting brace-depth and
 * determining if we have hit the end of the string.  lparen_in_string()
 * and rparen_in_string() do the same for parentheses, to handle
 * '@comment(...)'.  Then, if a double quote is seen, we call
 * quote_in_string(); this takes care of ending strings quoted by double
 * quotes.  Finally, the "fall-through" regexp handles most strings (except
 * for stuff that comes after a newline).
 */
#token        "\n~[\n\{\}\(\)\"\\]*" << check_runaway_string (); >>
#token        "[\r\t]"           << zzreplchar (' '); zzmore (); >>
#token        "\{"               << open_brace (); >>
#token        "\}"               << close_brace (); >>
#token        "\("               << lparen_in_string (); >>
#token        "\)"               << rparen_in_string (); >>
#token STRING "\""               << quote_in_string (); >>
#token        "~[\n\{\}\(\)\"]+" << zzmore (); >>

#lexclass START


/* At last, the grammar!  After that lexer, this is a snap. */

/* 
 * `bibfile' is the rule to recognize an entire BibTeX file.  Note that I
 * don't actually use this as the start rule myself; I have a function
 * bt_parse_entry() (in input.c), which takes care of setting up the lexer
 * and parser state in such a way that the parser can be entered multiple
 * times (at the `entry' rule) on the same input stream.  Then, the user
 * calls bt_parse_entry() until end of file is reached, at which point it
 * cleans up its mess.  The `bibfile' rule should work, but I never
 * actually use it, so it hasn't been tested in quite a while.
 */
bibfile!     : << AST *last; #0 = NULL; >>
               ( entry
                 <<                       /* a little creative forestry... */
                    if (#0 == NULL)
                       #0 = #1;
                    else
                       last->right = #1;
                    last = #1;
                 >>    
               )* ;

/*
 * `entry' is the rule that I actually use to enter the parser -- it parses
 * a single entry from the input stream (that is, the lexer scans past
 * junk until an '@' is seen at top-level, and that '@' becomes the AT 
 * token which starts an entry).
 *
 * `entry_metatype()' returns the value of a global variable maintained by
 * lex_auxiliary.c that tells us how to parse the entry.  This is needed
 * because, while the different things that look like BibTeX entries
 * (string definition, preamble, actual entry, etc.) have a similar lexical
 * makeup, the syntax is different.  In `entry', we just use the entry
 * metatype to determine the nodetype field of the AST node for the entry;
 * below, in `body' and `contents', we'll actually use it (in the form of
 * semantic predicates) to select amongst the various syntax options.
 */
entry        : << bt_metatype metatype; >>
               AT! NAME^
               <<
                  metatype = entry_metatype();
                  #1->nodetype = BTAST_ENTRY;
                  #1->metatype = metatype;
               >>
               body[metatype]
             ;

/*
 * `body' is what comes after AT NAME: either a single string, delimited by
 * {} or () (where NAME == 'comment'), or the more usual case of the entry
 * contents, delimited by an entry 'opener' and 'closer' (either
 * parentheses or braces).
 */
body [bt_metatype metatype]
             : << metatype == BTE_COMMENT >>?
               STRING     << #1->nodetype = BTAST_STRING; >>
             | ENTRY_OPEN! contents[metatype] ENTRY_CLOSE!
             ;

/* 
 * `contents' is where we select and accept the syntax for the guts of the
 * entry, based on the type of entry that we're parsing.  We find this
 * out from the `nodetype' field of the top AST node for the entry, which
 * is passed in as `entry_type'.  General entries (ie. any unrecognized
 * entry type) and `modify' entries have a name (the key), a comma, and
 * list of "field = value" assignments.  Macro definitions ('@string') are
 * similar, but without the key-comma pair.  Preambles have just a single
 * value, and aliases have a single "field = value" assignment.  (Note that
 * '@modify' and '@alias' are BibTeX 1.0 additions -- I'll have to check
 * the compatibility of my syntax with BibTeX 1.0 when it is released.)
 * '@comment' entries are handled differently, by the `body' rule above.
 */
contents [bt_metatype metatype]
             : << metatype == BTE_REGULAR /* || metatype == BTE_MODIFY */ >>?
               ( NAME | NUMBER ) << #1->nodetype = BTAST_KEY; >> 
               COMMA!
               fields
             | << metatype == BTE_MACRODEF >>?
               fields
             | << metatype == BTE_PREAMBLE >>?
               value
//           | << metatype == BTE_ALIAS >>?
//             field
             ;

/*
 * `fields' is a comma-separated list of fields.  Note that BibTeX has a
 * little wart in that it allows a single extra comma after the last field
 * only.  This is easy enough to handle, we just have to do it in the
 * traditional BNFish way (loop by recursion) rather than use EBNF
 * trickery.
 */
fields       : field { COMMA! fields }
             | /* epsilon */
             ;

/* `field' recognizes a single "field = value" assignment. */
field        : NAME^
               << #1->nodetype = BTAST_FIELD; check_field_name (#1); >>
               EQUALS! value
               << 
#if DEBUG > 1
                  printf ("field: fieldname = %p (%s)\n"
                          "       first val = %p (%s)\n",
                          #1->text, #1->text, #2->text, #2->text);
#endif
               >>
             ;

/* `value' is a sequence of simple_values, joined by the '#' operator. */
value        : simple_value ( HASH! simple_value )* ;

/* `simple_value' is a single string, number, or macro invocation. */
simple_value : STRING      << #1->nodetype = BTAST_STRING; >>
             | NUMBER      << #1->nodetype = BTAST_NUMBER; >>
             | NAME        << #1->nodetype = BTAST_MACRO; >>
             ;