/* ------------------------------------------------------------------------
@NAME : bibtex.g
@DESCRIPTION: PCCTS-based lexer and parser for BibTeX files. (Or rather,
for the BibTeX data description language. This parser
enforces nothing about the structure and contents of
entries; that's up to higher-level processors. Thus, there's
nothing either particularly bibliographic or TeXish about
the language accepted by this parser, apart from the affinity
for curly braces.)
There are a few minor differences from the language accepted
by BibTeX itself, but these are generally improvements over
BibTeX's behaviour. See the comments in the grammar, at least
until I write a decent description of the language.
I have used Gerd Neugebauer's BibTool (yet another BibTeX
parser, along with a prettyprinter and specialized language
for a common set of bibhacks) as another check of correctness
-- there are a few screwball things that BibTeX accepts and
BibTool doesn't, so I felt justified in rejecting them as
well. In general, this parser is a little stricter than
BibTeX, but a little looser than BibTool. YMMV.
Another source of inspiration is Nelson Beebe's bibclean, or
rather Beebe's article describing bibclean (from TUGboat
vol. 14 no. 4; also included with the bibclean distribution).
The product of the parser is an abstract syntax tree that can
be traversed to be printed in a simple form (see
print_entry() in bibparse.c) or perhaps transformed to a
format more convenient for higher-level languages (see my
Text::BibTeX Perl module for an example).
Whole files may be parsed by entering the parser at `bibfile';
in this case, the parser really returns a forest (list of
ASTs, one per entry). Alternately, you can enter the parser
at `entry', which reads and parses a single entry.
@GLOBALS : the usual DLG and ANTLR cruft
@CALLS :
@CREATED : first attempt: May 1996, Greg Ward
second attempt (complete rewrite): July 25-28 1996, Greg Ward
@MODIFIED : Sep 1996, GPW: changed to generate an AST rather than print
out each entry as it's encountered
Jan 1997, GPW: redid the above, because it was lost when
my !%&$#!@ computer was stolen
Jun 1997, GPW: greatly simplified the lexer, and added handling
of %-comments, @comment and @preamble entries,
and proper scanning of between-entry junk
@VERSION : $Id$
@COPYRIGHT : Copyright (c) 1996-99 by Gregory P. Ward. All rights reserved.
This file is part of the btparse library. This library is
free software; you can redistribute it and/or modify it under
the terms of the GNU Library General Public License as
published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.
-------------------------------------------------------------------------- */
#header
<<
#define ZZCOL
#define USER_ZZSYN
#include "config.h"
#include "btparse.h"
#include "attrib.h"
#include "lex_auxiliary.h"
#include "error.h"
#include "my_dmalloc.h"
extern char * InputFilename; /* for zzcr_ast call in pccts/ast.c */
>>
/*
* The lexer has three modes -- START (between entries, hence it's what
* we're in initially), LEX_ENTRY (entered once we see an '@' at
* top-level), and LEX_STRING (for scanning quoted strings). Note that all
* the functions called from lexer actions can be found in lex_auxiliary.c.
*
* The START mode just looks for '@', discards comments and whitespace,
* counts lines, and keeps track of any other junk. The "keeping track"
* just consists of counting the number of junk characters, which is then
* reported at the next '@' sign. This will hopefully let users clean up
* "old style" implicit comments, and possibly catch some legitimate errors
* in their files (eg. a complete entry that's missing an '@').
*/
#token AT "\@" << at_sign (); >>
#token "\n" << newline (); >>
#token COMMENT "\%~[\n]*\n" << comment (); >>
#token "[\ \r\t]+" << zzskip (); >>
#token "~[\@\n\ \r\t]+"<< toplevel_junk (); >>
#lexclass LEX_ENTRY
/*
* The LEX_ENTRY mode is where most of the interesting stuff is -- these
* tokens control most of the syntax of BibTeX. First, we duplicate most
* of the START lexer, in order to handle newlines, comments, and
* whitespace.
*
* Next comes a "number", which is trivial. This is needed because a
* BibTeX simple value may be an unquoted digit string; it has to precede
* the definition of "name" tokens, because otherwise a digit string would
* be a legitimate "name", which would cause an ambiguity inside entries
* ("is this a macro or a number?")
*
* Then comes the regexp for a BibTeX "name", which is used for entry
* types, entry keys, field names, and macro names. This is basically the
* same as BibTeX's definition of such "names", with two differences. The
* key, fundamental difference is that I have defined names by inclusion
* rather than exclusion: this regex lists all characters allowed in a
* type/key/field name/macro name, rather than listing those characters not
* allowed (as the BibTeX documentation does). The trivial difference is
* that I have disallowed a few extra characters: @ \ ~. Allowing @ could
* cause confusing BibTeX syntax, and allowing \ or ~ can cause bogus TeX
* code: try putting "\cite{foo\bar}" in your LaTeX document and see what
* happens! I'm also rather skeptical about some of the more exotic
* punctuation characters being allowed, but since people have been using
* BibTeX's definition of "names" for a decade or so now, I guess we're
* stuck with it. I could always amend name() to warn about any exotic
* punctuation that offends me, but that should be an option -- and I don't
* have a mechanism for user selectable warnings yet, so it'll have to
* wait.
*
* Also note that defining "number" ahead of "name" precludes a string of
* digits from being a name. This is usually a good thing; we don't want
* to accept digit strings as article types or field names (BibTeX
* doesn't). However -- dubious as it may seem -- digit strings are
* legitimate entry keys, so we should accept them there. This is handled
* by the grammar; see the `contents' rule below.
*
* Finally, it should be noted that BibTeX does not seem to apply the same
* lexical rules to entry types, entry keys, and field names -- so perhaps
* doing so here is not such a great idea. One immediate manifestation of
* this is that my grammar in its unassisted state would accept a field
* name with leading digits; BibTeX doesn't accept this. I correct this
* with the check_field_name() function, called from the `field' rule in
* the grammar and defined in parse_auxiliary.c.
*/
#token "\n" << newline (); >>
#token COMMENT "\%~[\n]*\n" << comment (); >>
#token "[\ \r\t]+" << zzskip (); >>
#token NUMBER "[0-9]+"
#token NAME "[a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+"
<< name (); >>
/*
* Now come the (apparently) easy tokens, i.e. punctuation. There are a
* number of tricky bits here, though. First, '{' can have two very
* different meanings: at top-level, it's an entry delimiter, and inside an
* entry it's a string delimiter. This is handled (in lbrace()) by keeping
* track of the "entry state" (top-level, after '@', after type, in
* comment, or in entry) and using that to determine what to do on a '{'.
* If we're in an entry, lbrace() will switch to the string lexer by
* calling start_string(); if we're immediately after an entry type token
* (which is just a name following a top-level '@'), then we force the
* current token to ENTRY_OPEN, so that '{' and '(' appear identical to the
* parser. (This works because the scanner generated by DLG just happens
* to assign the token number first, and then executes the action.)
* Anywhere else (ie. at top level or immediately after an '@', we print a
* warning and leave the token as LBRACE, which will cause a syntax error
* (because LBRACE is not used anywhere in the grammar).
*
* '(' has some similarities to '{', but it's different enough that it
* has its own function. In particular, it may be an entry opener just
* like '{', but in one particular case it may be a string opener. That
* particular case is where it follows '@' and 'comment'; in that case,
* lparen() will call start_string() to enter the string lexer.
*
* The other delimiter characters are easier, but still warrant an
* explanation. '}' should only occur inside an entry, and if found there
* the token is forced to ENTRY_CLOSER; anywhere else, a warning is printed
* and the parser should find a syntax error. ')' should only occur inside
* an entry, and likewise will trigger a warning if seen elsewhere.
* (String-closing '}' and ')' are handled by the string lexer, below.)
*
* The other punctuation characters are trivial. Note that a double quote
* can start a string anywhere (except at top-level!), but if it occurs in
* a weird place a syntax error will eventually occur.
*/
#token LBRACE "\{" << lbrace (); >>
#token RBRACE "\}" << rbrace (); >>
#token ENTRY_OPEN "\(" << lparen (); >>
#token ENTRY_CLOSE "\)" << rparen (); >>
#token EQUALS "="
#token HASH "\#"
#token COMMA ","
#token "\"" << start_string ('"'); >>
#lexclass LEX_STRING
/*
* Here's a reasonably decent attempt at lexing BibTeX strings. There are
* a couple of sneaky tricks going on here that aren't strictly necessary,
* but can make the user's life a lot easier.
*
* First, here's what a simple and straightforward BibTeX string lexer
* would do:
* - keep track of brace-depth by incrementing/decrementing a counter
* whenever it sees `{' or `}'
* - if the string was started with a `{' and it sees a `}' which
* brings the brace-depth to 0, end the string
* - if the string was started with a `"' and it sees another `"' at
* brace-depth 0, end the string
* - any other characters are left untouched and become part of the
* string
*
* (Note that the simple act of counting braces makes this lexer
* non-regular -- there's a bit more going on here than you might
* think from reading the regexps. So sue me.)
*
* The first, most obvious refinement to this is to check for newlines
* and other whitespace -- we should convert either one to a single
* space (to simplify future processing), as well as increment zzline on
* newline. Note that we don't do any collapsing of whitespace yet --
* newlines surrounded by spaces make that rather tricky to handle
* properly in the lexer (because newlines are handled separately, in
* order to increment zzline), so I put it off to a later stage. (That
* also gives us the flexibility to collapse whitespace or not,
* according to the user's whim.)
*
* A PCCTS lexer to handle these requirements would look something like this:
*
* #token "\n" << newline_in_string (); >>
* #token "[\r\t]" << zzreplchar (' '); zzmore (); >>
* #token "\{" << open_brace(); >>
* #token "\}" << close_brace(); >>
* #token "\"" << quote_in_string (); >>
* #token "~[\n\{\}\"]+" << zzmore (); >>
*
* where the functions called are the same as currently in lex_auxiliary.c.
*
* However, I've added some trickery here that lets us heuristically detect
* runaway strings. The heuristic is as follows: anytime we have a newline
* in a string, that's reason to suspect a runaway. We follow up on this
* suspicion by slurping everything that could reasonably be part of the
* string and still be in the same line (i.e., a string of anything except
* newline, braces, parentheses, double-quote, and backslash), and then
* calling check_runaway_string(). This function then "backs up" to the
* beginning of the slurped string (the newline), and scans ahead looking
* for one of two patterns: "@name[{(]", or "name=" (with optional
* whitespace between the "tokens"). (Actually, it first makes a pass over
* the string to convert all whitespace characters -- including the sole
* newline -- to spaces. So, it's effectively looking for "\ *\@\ *NAME\
* *[\{\(]" (DLG regexp syntax) or "\ *NAME\ *=", where
* NAME="[a-z][a-z0-9+/:'.-]*" -- that is, something that looks like the
* start of an entry or a new field, but in a string (where they almost
* certainly shouldn't occur). Of course, there are no explicit regexps
* there -- it's all coded as a little hand-crafted automaton in C.
*
* At any rate, if either one of these patterns is matched,
* check_runaway_string() prints a warning and sets a flag so that we don't
* print that warning -- or indeed, even scan for the suspect patterns --
* more than once for the current string. (Because chances are if it
* occurs once, it'll occur again and again and again.)
*
* There is also some trickery going on to deal with '@comment' entries.
* Syntactically, these are just AT NAME STRING, where NAME must be
* 'comment'. This means that an '@comment' entry has no delimiters, it
* just has a string. To make them look a bit more like the other kinds of
* entries (which are delimited with '{' ... '}' or '(' ... ')', the STRING
* here is special: it's delimited either by braces or parentheses, rather
* than by the usual braces or double-quotes. Thus, we treat parentheses
* much like braces in this lexer, to handle the '@comment(...)' case. And
* there's an explicit check for the erroneous '@comment"..."' case in
* start_string(), just to be complete.
*
* So that explains all the regexps in this lexer: the first one (starting
* with newline) triggers the check for a runaway string. Then, we have a
* pattern to convert any single whitespace char (apart from newline) to a
* space; note that any whitespace chars that are matched in the
* newline-regexp will be converted by check_runaway_string(), and won't be
* matched by the whitespace regexp here. Then, we check for braces;
* open_brace() and close_brace() take care of counting brace-depth and
* determining if we have hit the end of the string. lparen_in_string()
* and rparen_in_string() do the same for parentheses, to handle
* '@comment(...)'. Then, if a double quote is seen, we call
* quote_in_string(); this takes care of ending strings quoted by double
* quotes. Finally, the "fall-through" regexp handles most strings (except
* for stuff that comes after a newline).
*/
#token "\n~[\n\{\}\(\)\"\\]*" << check_runaway_string (); >>
#token "[\r\t]" << zzreplchar (' '); zzmore (); >>
#token "\{" << open_brace (); >>
#token "\}" << close_brace (); >>
#token "\(" << lparen_in_string (); >>
#token "\)" << rparen_in_string (); >>
#token STRING "\"" << quote_in_string (); >>
#token "~[\n\{\}\(\)\"]+" << zzmore (); >>
#lexclass START
/* At last, the grammar! After that lexer, this is a snap. */
/*
* `bibfile' is the rule to recognize an entire BibTeX file. Note that I
* don't actually use this as the start rule myself; I have a function
* bt_parse_entry() (in input.c), which takes care of setting up the lexer
* and parser state in such a way that the parser can be entered multiple
* times (at the `entry' rule) on the same input stream. Then, the user
* calls bt_parse_entry() until end of file is reached, at which point it
* cleans up its mess. The `bibfile' rule should work, but I never
* actually use it, so it hasn't been tested in quite a while.
*/
bibfile! : << AST *last; #0 = NULL; >>
( entry
<< /* a little creative forestry... */
if (#0 == NULL)
#0 = #1;
else
last->right = #1;
last = #1;
>>
)* ;
/*
* `entry' is the rule that I actually use to enter the parser -- it parses
* a single entry from the input stream (that is, the lexer scans past
* junk until an '@' is seen at top-level, and that '@' becomes the AT
* token which starts an entry).
*
* `entry_metatype()' returns the value of a global variable maintained by
* lex_auxiliary.c that tells us how to parse the entry. This is needed
* because, while the different things that look like BibTeX entries
* (string definition, preamble, actual entry, etc.) have a similar lexical
* makeup, the syntax is different. In `entry', we just use the entry
* metatype to determine the nodetype field of the AST node for the entry;
* below, in `body' and `contents', we'll actually use it (in the form of
* semantic predicates) to select amongst the various syntax options.
*/
entry : << bt_metatype metatype; >>
AT! NAME^
<<
metatype = entry_metatype();
#1->nodetype = BTAST_ENTRY;
#1->metatype = metatype;
>>
body[metatype]
;
/*
* `body' is what comes after AT NAME: either a single string, delimited by
* {} or () (where NAME == 'comment'), or the more usual case of the entry
* contents, delimited by an entry 'opener' and 'closer' (either
* parentheses or braces).
*/
body [bt_metatype metatype]
: << metatype == BTE_COMMENT >>?
STRING << #1->nodetype = BTAST_STRING; >>
| ENTRY_OPEN! contents[metatype] ENTRY_CLOSE!
;
/*
* `contents' is where we select and accept the syntax for the guts of the
* entry, based on the type of entry that we're parsing. We find this
* out from the `nodetype' field of the top AST node for the entry, which
* is passed in as `entry_type'. General entries (ie. any unrecognized
* entry type) and `modify' entries have a name (the key), a comma, and
* list of "field = value" assignments. Macro definitions ('@string') are
* similar, but without the key-comma pair. Preambles have just a single
* value, and aliases have a single "field = value" assignment. (Note that
* '@modify' and '@alias' are BibTeX 1.0 additions -- I'll have to check
* the compatibility of my syntax with BibTeX 1.0 when it is released.)
* '@comment' entries are handled differently, by the `body' rule above.
*/
contents [bt_metatype metatype]
: << metatype == BTE_REGULAR /* || metatype == BTE_MODIFY */ >>?
( NAME | NUMBER ) << #1->nodetype = BTAST_KEY; >>
COMMA!
fields
| << metatype == BTE_MACRODEF >>?
fields
| << metatype == BTE_PREAMBLE >>?
value
// | << metatype == BTE_ALIAS >>?
// field
;
/*
* `fields' is a comma-separated list of fields. Note that BibTeX has a
* little wart in that it allows a single extra comma after the last field
* only. This is easy enough to handle, we just have to do it in the
* traditional BNFish way (loop by recursion) rather than use EBNF
* trickery.
*/
fields : field { COMMA! fields }
| /* epsilon */
;
/* `field' recognizes a single "field = value" assignment. */
field : NAME^
<< #1->nodetype = BTAST_FIELD; check_field_name (#1); >>
EQUALS! value
<<
#if DEBUG > 1
printf ("field: fieldname = %p (%s)\n"
" first val = %p (%s)\n",
#1->text, #1->text, #2->text, #2->text);
#endif
>>
;
/* `value' is a sequence of simple_values, joined by the '#' operator. */
value : simple_value ( HASH! simple_value )* ;
/* `simple_value' is a single string, number, or macro invocation. */
simple_value : STRING << #1->nodetype = BTAST_STRING; >>
| NUMBER << #1->nodetype = BTAST_NUMBER; >>
| NAME << #1->nodetype = BTAST_MACRO; >>
;