The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
ucpp-1.3 is a C preprocessor compliant to ISO-C99.

Author: Thomas Pornin <pornin@bolet.org>
Main site: http://pornin.nerim.net/ucpp/



INTRODUCTION
------------

A C preprocessor is a part of a C compiler responsible for macro
replacement, conditional compilation and inclusion of header files.
It is often found as a stand-alone program on Unix systems.

ucpp is such a preprocessor; it is designed to be quick and light,
but anyway fully compliant to the ISO standard 9899:1999, also known
as C99. ucpp can be compiled as a stand-alone program, or linked to
some other code; in the latter case, ucpp will output tokens, one
at a time, on demand, as an integrated lexer.

ucpp operates in two modes:
-- lexer mode: ucpp is linked to some other code and outputs a stream of
tokens (each call to the lex() function will yield one token)
-- non-lexer mode: ucpp preprocesses text and outputs the resulting text
to a file descriptor; if linked to some other code, the cpp() function
must be called repeatedly, otherwise ucpp is a stand-alone binary.



INSTALLATION
------------

1. Uncompress the archive file and extract the source files.

2. Edit tune.h. Here is a short explanation of compile-time options:

  LOW_MEM
     Enable memory-saving functions; this is for low-end and old systems,
     but seems to be good for larger systems too. Keep it.
  NO_LIBC_BUF
  NO_UCPP_BUF
     Two options used to disable the two bufferings inside ucpp. Define
     both options for maximum memory savings but you will probably want
     to keep libc buffering for decent performance. Define none on large
     systems (modern 32 or 64-bit systems).
  UCPP_MMAP
     With this option, if ucpp internal buffering is active, ucpp will
     try to mmap() the input files. This might yield a slight performance
     improvement, but will work only on a limited set of architectures.
  PRAGMA_TOKENIZE
     Make ucpp generate tokenized PRAGMA tokens on #pragma and _Pragma();
     tokenization is made this way: tokens are assembled as a null
     terminated array of unsigned chars; if a token has a string value
     (as defined by the STRING_TOKEN macro), the value follows the token,
     terminated by PRAGMA_TOKEN_END (by default, a newline character cast
     to unsigned char). Whitespace tokens are skipped. The "name" value
     of the PRAGMA token is a pointer to that array. This setting is
     irrelevant in non-lexer mode.
  PRAGMA_DUMP
     In non-lexer mode, keep #pragma in output; non-void _Pragma() are
     translated to the equivalent #pragma. Irrelevant in lexer mode.
  NO_PRAGMA_IN_DIRECTIVE
     Do not evaluate _Pragma() inside #if, #include, #include_next and #line
     directives; instead, emit an error (since the remaining _Pragma will
     surely imply a syntax error).
  DSHARP_TOKEN_MERGE
     When two tokens are to be merged with the `##' operator, but fail
     because they do not merge into a single valid token, ucpp keeps those
     two tokens separate by adding an extra space between them in text
     output. With this option on, that extra space is not added, which means
     that some tokens may merge partially if the text output is preprocessed
     again. See tune.h for details.
  INMACRO_FLAG
     In lexer mode, set the inmacro flag to 1 if the current token comes
     from a macro replacement, 0 otherwise. macro_count maintains an
     increasing counter of such replacements. CONTEXT tokens count as
     one macro replacement each. #pragma, and _Pragma() that do not come
     from a macro replacement, also count as one macro replacement each.
     This setting is irrelevant in non-lexer mode.
  STD_INCLUDE_PATH
     Default include path in stand-alone ucpp.
  STD_MACROS
     Default predefined macros in stand-alone ucpp.
  STD_ASSERT
     Default assertions in stand-alone ucpp.
  NATIVE_SIGNED
  NATIVE_UNSIGNED
  NATIVE_UNSIGNED_BITS
  NATIVE_SIGNED_MIN
  NATIVE_SIGNED_MAX
  SIMUL_ARITH_SUBTYPE
  SIMUL_SUBTYPE_BITS
  SIMUL_NUMBITS
  WCHAR_SIGNEDNESS
     Those options define how #if expressions are evaluated; see the
     cross-compilation section of this file for more info, and the
     comments in tune.h. Extra info is found in arith.h and arith.c,
     at the possible expense of your mental health.
  DEFAULT_LEXER_FLAGS
  DEFAULT_CPP_FLAGS
     Default flags in respectively lexer and non-lexer modes.
  POSIX_JMP
     Define this if your architecture defines sigsetjmp() and
     siglongjmp(); it is known to (very slightly) improve performance
     on AIX systems.
  MAX_CHAR_VAL
     ucpp will consider characters whose value is equal or above
     MAX_CHAR_VAL as outside the C source charset (so they will be
     treated just like '@', for instance). For ASCII systems, 128
     is fine. 256 is a safer value, but uses more (static) memory.
     For performance reasons, use a power of two. If MAX_CHAR_VAL is
     correctly adjusted, ucpp should be compatible with any character
     set.
  UNBREAKABLE_SPACE
     If you want an extra-whitespace character, define this macro to that
     character. For instance, define this to 160 on an ISO-8859-1 system
     if you want the 'unbreakable space' to be considered as whitespace.
  SEMPER_FIDELIS
     With this option set, ucpp, when used as a lexer, will pass
     whitespace tokens to its caller, and those tokens will have their
     true content; this is intended for reconstruction of the source
     line. Beware that some comments may have embedded newlines.
  COPY_LINE_LENGTH
     ucpp can maintain a copy of the current source line, up to that
     length. Irrelevant to stand-alone version.
  *_MEMG
     Those settings modify ucpp behaviour, wrt memory allocations. With
     higher values, ucpp will perform less malloc() calls and will run
     faster, but it will use more memory. Reduce INPUT_BUF_MEMG and
     OUTPUT_BUF_MEMG on low-memory systems, if you kept ucpp buffering
     (see NO_UCPP_BUF option).

3. Edit the Makefile. You should define the variables CC and FLAGS;
   there are the following options:

  -DAUDIT
     Enable internal sanity checks; this slows down a bit ucpp. Do not
     define unless you plan to debug ucpp.
  -DMEM_CHECK
     With this setting, ucpp will check for the return value of malloc()
     and exit with a diagnostic when out of memory. MEM_CHECK is implied
     by AUDIT.
  -DMEM_DEBUG
     Enable memory debug code. This will track memory leaks and several
     occurrences of memory management errors; it will also slow down
     things and increase memory consumption, so you probably do not
     want to use this option.
  -DINLINE=foobar
     The ucpp code uses "inline" qualifier for some functions; by
     default, that qualifier is macro-replaced with nothing. Define
     INLINE to the correct replacement for your compiler, if supported.
     Note that all "inline" functions in ucpp are also "static". For any
     C99-compliant compiler, the GNU compiler (gcc), and the Compaq C
     compiler under Linux/Alpha, no -DINLINE is needed (see tune.h for
     details).

4. Compile by typing "make". This should produce the ucpp executable
   file. You might see some warning messages, especially with gcc:
   gcc believes some variables might be used prior to their
   initialization; ignore those messages.

5. Install wherever you want the binary and the man page ucpp.1. I
   have not provided an install sequence because I didn't bother.

6. If you do not have the make utility, compile each file separately
   and link them together. The exact details depend on your compiler.
   You must define the macro STAND_ALONE when compiling cpp.c (there
   is such a definition, commented out, in cpp.c, line 34).

There is no "configure" script because:
-- I do not like the very idea of a "configure" script.
-- ucpp is written in ANSI-C and should be fairly portable.
-- There is no such thing as "standard" settings for a C preprocessor.
   The predefined system macros, standard assertions,... must be tuned
   by the sysadmin.
-- The primary goal of ucpp is to be included in compilers. The
   stand-alone version is mainly a debugging tool.

Please note that you need an ISO-C90 (formerly ANSI) C compiler suite
(including the standard library) to compile ucpp. If your compiler is
not C99 (or later), read the cross-compilation section in this README
file.

The C90 and C99 standards state that external linkage names might be
considered equal or different based upon only their first 6 characters;
this rule might make ucpp not compile on a conformant C implementation.
I have yet to see such an implementation, however.

If you want to use ucpp as an integrated preprocessor and lexer, see the
section REUSE. Compiling ucpp as a library is an exercise left to the
reader.

With the LOW_MEM code enabled, ucpp can run on a Minix-i86 or Msdos
16-bit small-memory-model machine. It will not be fully compliant
on such an architecture to C99, since C99 states that at least one
source code with 4095 simultaneously defined macros must be processed;
ucpp will be limited to about 1500 macros (at most) due to memory
restrictions. At least ucpp can preprocess its own code in these
conditions. LOW_MEM is on by default because it seems to improve
performance on large systems.



LICENSE
-------

The copyright notice and license is at the beginning of the Makefile and
each source file. It is basically a BSD license, without the advertising
subclause (which BSD dropped recently anyway) and with no reference to
Berkeley (since the code is all mine, written from scratch). Informally,
this means that you can reuse and redistribute the code as you want,
provided that you state in the documentation (or any substantial part of
the software) of redistributed code that I am the original author. (If
you press a cdrom with 200 software packages, I do not insist on having
my name on the cover of the cdrom -- just keep a Readme file somewhere
on the cdrom, with the copyright notice included.)

As a courteous gesture, if you reuse my code, please drop me a mail.
It raises my self-esteem.



REUSE
-----

The code has been thought as part of a bigger project; it might be
used as an integrated lexer, that will read files, process them as a
C preprocessor, and output a stream of C tokens. To include this code
into a project, compile with STAND_ALONE undefined.

To use the preprocessor and lexer, several steps should be performed.
See the file 'sample.c' for an example.

1. call init_cpp(). This function initializes the lexer automaton.

2. set the following global variables:
	no_special_macros
		non-zero if the special macros (__FILE__ and others)
		should not be defined. This is a global flag since
		it affects the redefinition of such macros (which are
		allowed if the special macros are not defined)
	c99_compliant
		if non-zero, define __STDC_VERSION__ to 199901L; this
		is the default; otherwise, do not define __STDC_VERSION__.
		Note that ucpp will accept to undefine __STDC_VERSION__
		with a #undef directive.
	c99_hosted
		if strictly positive, define __STDC_HOSTED__ to 1.
		If zero, define __STDC_HOSTED__ to 0. If negative,
		do not define __STDC_HOSTED__. The default is 1.
	emit_defines and emit_assertions should be set to 0 for
	the step 3.

3. call init_tables(). This function initializes the macro table
   and other things; it will intialize assertions if it has a non-zero
   argument.

4. call init_include_path(). This function will reset the include
   path to the list of paths given as argument.

5. set the following global variables
	emit_dependencies
		set to 1 if dependencies should be emitted during
		preprocessing
		set to 2 if dependencies should also be emitted for
		system include files
	emit_defines
		set to non-zero if #define macro definitions should be
		emitted when macros are defined
	emit_assertions
		set to non-zero if #define macro definitions should be
		emitted when macros are defined
	emit_output
		the FILE * where the above items are sent if one of the
		three emit_ variables is set to non zero
	transient_characters
		this is for some cross-compilation; see the relevant
		part in this README file for details

6. call set_init_filename() with the initial filename as argument;
   the second argument indicates whether the filename is real or
   conventional ("real" means "an fopen() on it will work").

7. initialize your struct lexer_state:
	call init_lexer_state()
	call init_lexer_mode() if the preprocessor is supposed to
	   output a list of tokens, otherwise set the flags field
	   to DEFAULT_CPP_FLAGS and set the output field to the
	   FILE * where output should be sent
	(init_lexer_mode(), if called at all, must be called after
	 init_lexer_state())
	adjust the flags field; here is the meaning of flags:

WARN_STANDARD
	emit the standard warnings
WARN_ANNOYING
	emit the useless and annoying warnings
WARN_TRIGRAPHS
	count trigraphs encountered; it is up to the caller to emit
	a warning if some trigraphs were indeed encountered; the count
	is stored in the count_trigraphs field of the struct lexer_state
WARN_TRIGRAPHS_MORE
	emit a warning for each trigraph encountered
WARN_PRAGMA
	emit a warning for each non-void _Pragma encountered in non-lexer
	mode (because these are dumped as #pragma in the output) and for each
	#pragma too, if ucpp was compiled without PRAGMA_DUMP
FAIL_SHARP
	emit errors on '#' tokens beginning a line and not followed
	by a valid cpp directive
CCHARSET
	emit errors when non-C characters are encountered; if this flag
	is not set, each non-C character will be considered as a BUNCH
	token (since C99 states that non-C characters are allowed as
	long as they "disappear" during preprocessing [through macro
	replacement and stringification for instance], this flag must
	not be set, for maximum C99 compliance)
DISCARD_COMMENTS
	do not keep comments in output (irrelevant in lexer mode)
CPLUSPLUS_COMMENTS
	understand new style comments (//) (mandatory for C99)
LINE_NUM
	emit #line directives when entering a file, if not in lexer mode;
	emit CONTEXT token in lexer mode for #line and new files
GCC_LINE_NUM
	if LINE_NUM is set, emit gcc-like directives instead of #line
HANDLE_ASSERTIONS
	understand assertions in #if expressions (and #assert, #unassert)
HANDLE_PRAGMA
	make PRAGMA tokens for #pragma; irrelevant in non-lexer mode
	(handling of some pragmas is required in C99 but is not of
	the competence of the preprocessor; without this flag, ucpp will
	ignore the contents of #pragma and _Pragma directives)
MACRO_VAARG
	understand macros with a variable number of arguments (mandatory
	for C99)
UTF8_SOURCE
	understand UTF-8 encoding: multibyte characters are considered
	equivalent to letters as far as syntax is concerned (they can
	be used in identifiers)
LEXER
	act as a lexer, outputting tokens
TEXT_OUTPUT
	this flag should be set to 0 if ucpp works as a lexer, 1 otherwise.
	It is somehow redundant with the LEXER flag, but the presence of
	those two different flags is needed in ucpp.
KEEP_OUTPUT
	in non-lexer mode, emit the result of preprocessing
COPY_LINE
	maintain a copy of the last read line in the copy_line field of
	the struct lexer_state ; see below for how to use this buffer
HANDLE_TRIGRAPHS
	understand trigraphs, such as ??/ for \. This option should be
	set by default, except for some legacy code.

	There are other flags, but they are for private usage of ucpp.

8. adjust the input field in the lexer_state to the FILE * from where
   source file is read. If you use the UCPP_MMAP compile-time option,
   and your input file is eligible to mmap(), then you can call
   fopen_mmap_file() to open it, then set_input_file() to set ls->input
   and some other internal options. Do not call set_input_file() unless
   you just called fopen_mmap_file() just before on the same file.

9. call add_incpath() to add an include path, define_macro() and
   undef_macro() to add or remove macros, make_assertion() and
   destroy_assertion() to add or remove assertions.

10. call enter_file() (this is needed only in non-lexer mode, or if
    LINE_NUM is set).


Afterwards:

-- if you are in lexer mode, call lex(); each call will make the ctok
   field point to the next token. A non-zero return value is an error.
   lex() skips whitespace tokens. The memory used by the string value
   of some tokens (identifiers, numbers...) is automatically freed,
   so copy the contents of each such token if you want to keep it
   (tokens with a string content are identified by the STRING_TOKEN
   macro applied to their type).
   When lex() returned a non-zero value: if it is CPPERR_EOF, then
   end-of-input was reached. Otherwise, it is a genuine error and
   ls->ctok is an undefined token; skip it and call lex() again to
   ignore the error.

-- otherwise, call cpp(); each call will analyze one or more tokens
   (one token if it did find neither a cpp directive nor a macro name).
   A positive return value is an error.

For both functions, if the return value is CPPERR_EOF (which is a
strictly positive value), then it means that the end of file was
reached. Call check_cpp_errors() after end of file for pending errors
(unfinished #if constructions for instance). In non-lexer mode,
call flush_output().

In the struct lexer_state, the following fields might be read:
	line		   the current input line number
	oline		   the current output line number (in non-lexer mode)
	flags		   the flags described above
	count_trigraphs	   the number of trigraphs encountered
	inmacro		   the current token comes from a macro
	macro_count	   the current macro counter
"flags" is an unsigned long and might be modified; the three others
are of long type.


To perform another preprocessing: use free_lexer_state() to release
memory used by the buffers referenced in lexer_state, and go back to
step 2. The different tables (macros, assertions...) should be reset to
their respective initial contents.

There is also the wipeout() function: when called, it should release
(almost) all memory blocks allocated dynamically. After a wipeout(),
ucpp should be back to its state at step 2 (init_cpp() initializes only
static tables, that are never freed nor modified afterwards).


The COPY_LINE buffer: the struct lexer_state contains two interesting
fields, copy_line[] and cli. If the COPY_LINE flag is on, each read
line is stored in this buffer, up to (at most) COPY_LINE_LENGTH - 1
characters (COPY_LINE_LENGTH is defined in tune.h). The last character
of the buffer is always a zero, and if the line was read entirely, it is
zero terminated; the trailing newline is not included.

The purpose of this buffer is error-reporting. When an error occurs
(cpp() returns a strictly positive value, or lex() returns a non-zero
value), if your struct lexer_state is called ls, use this code:

	if (ls.cli != 0) ls.copy_line[ls.cli] = 0;

This will add a trailing 0 if the line was not read entirely.

You can disable the COPY_LINE buffer by defining NO_UCPP_COPY_LINE
(in tune.h, for example). This will make the code slightly faster.


ucpp may be configured at runtime to accept alternate characters as
possible parts of identifiers. Typical intended usage is for the '$'
and '@' characters. The two relevant functions are set_identifier_char()
and unset_identifier_char(). When this call is issued:
	set_identifier_char('$');
then for all the remaining input, the '$' character will be considered
as just another letter, as far as identifier tokenizing is concerned. This
is for identifiers only; numeric constants are not modified by that setting.
This call resets things back:
	unset_identifier_char('$');
Those two functions modify the static table which is initialized by
init_cpp(). You may call init_cpp() at any time to restore the table
to its standard state.

When using this feature, take care of the following points:

-- Do NOT use a character whose numeric value (as an `unsigned char'
cast into an `int') is greater than or equal to MAX_CHAR_VAL (in tune.h).
This would lead to unpredictable results, including an abrupt crash of
ucpp. ucpp makes absolutely no check whatsoever on that matter: this is
the programmer's responsibility.

-- If you use a standard character such as '+' or '{', tokens which
begin with those characters cease to exist. This can be troublesome.
If you use set_identifier_char() on the '<' character, the handling of
#include directives will be greatly disturbed. Therefore the use of any
standard C character in set_identifier_char() of unset_identifier_char()
is declared unsupported, forbidden and altogether unwise.

-- Stricto sensu, when an extra character is declared as part of an
identifier, ucpp behaviour cease to conform to C99, which mandates that
characters such as '$' or '@' must be treated as independant tokens of
their own. Therefore, if your purpose is to use ucpp in a conformant
C implementation, the use of set_identifier_char() should be made at
least a runtime option.

-- When enabling a new character in the middle of a macro replacement,
the effect of that replacement may be delayed up to the end of that
macro (but this is a "may" !). If you wish to trigger this feature with
a custom #pragma or _Pragma(), you should remember it (for instance,
usine _Pragma() in a macro replacement, and then the extra character
in the same macro replacement, is not reliable).



REENTRANT API
-------------

You can build ucpp with UCPP_REENTRANT defined if you plan to use
the ucpp lexer in a multithreaded application.

See the file 'sample_r.c' for an example.

When using the reentrant API, you can create multiple preprocessor
objects. A new object is created using new_cpp(). It should finally
be destroyed using del_cpp(). Furthermore, most API functions expect
a pointer to valid preprocessor object as their first argument.
These functions are:

	add_incpath()
	check_cpp_errors()
	cpp()
	destroy_assertion()
	enter_file()
	flush_output()
	fopen_mmap_file()
	init_assertions()
	init_cpp()
	init_include_path()
	init_tables()
	define_macro()
	undef_macro()
	lex()
	make_assertion()
	print_assertions()
	report_context()
	set_identifier_char()
	set_init_filename()
	set_input_file()
	unset_identifier_char()
	init_macros()
	print_defines()
	is_macro_defined()
	get_macro_definition()
	iterate_macros()
	wipeout()

Additionally, ucpp_ouch(), ucpp_error() and ucpp_warning() are no
longer global functions. They can also be defined separately for
each preprocessor object:

	struct CPP cpp;
	cpp = new_cpp();
	cpp->ucpp_ouch = my_ouch_func;
	/* ... */
	del_cpp(cpp);

Each of these functions receives a pointer to the corresponding
preprocessor object as its first parameter. There is also a
callback_arg member defined in the preprocessor object structure
which can be used to pass an additional pointer to the callback
function.

If you additionally define UCPP_CLONE, you can also clone an
existing preprocessor object:

	clone = clone_cpp(original);

The cloned object will be identical to the original object, except
for its internal lexer states, which means you cannot clone a
preprocessor object while it is preprocessing source code.



COMPATIBILITY NOTES
-------------------

The C language has a lengthening history. Nowadays, C comes in three
flavours:

-- Traditional C, aka "K&R". This is the language first described by
Brian Kernighan and Dennis Ritchie, and implemented in the first C
compiler that was ever coded. There are actually several dialects of
K&R, and all of them are considered deprecated.

-- ISO 9899:1990, aka C90, aka C89, aka ANSI-C. Formalized by ANSI
in 1989 and adopted by ISO the next year, it is the C flavour many C
compilers understand. It is mostly backward compatible with K&R C, but
with enhancements, clarifications and several new features.

-- ISO 9899:1999, aka C99. This is an evolution on C90, almost fully
backward compatible with C90. C99 introduces many new and useful
features, however, including in the preprocessor.

There was also a normative addendum in 1995, that added a few features
to C90 (for instance, digraphs) that are also present in C99. It is
sometimes refered to as "C95" or "AMD 1".


ucpp implements the C99 standard, but can be used in a stricter mode,
to enforce C90 compatibility (it will, however, still recognize some
constructions that are not in plain C90).

ucpp also knows about several extensions to C99:

-- Assertions: this is an extension to the defined() operator, with
   its own namespace. Assertions seem to be used in several places,
   therefore ucpp knows about them. It is recommended to enable
   assertions by default on Solaris systems.
-- Unicode: the C99 norm specifies that extended characters, from
   the ISO-10646 charset (aka "unicode") can be used in identifiers
   with the notations \u and \U. ucpp also accepts (with the proper
   flag) the UTF-8 encoding in the source file for such characters.
-- #include_next directive: it works as a #include, but will look
   for files only in the directories specified in the include path
   after the one the current file was found. This is a GNU-ism that
   is useful for writing transparent wrappers around header files.

Assertions and unicode are activated by specific flags; the #include_next
support is always active.

The ucpp code itself should be compatible with any ISO-C90 compiler.
The cpp.c file is rather big (~ 64kB), it might confuse old 16-bit C
compilers; the macro.c file is somewhat large also (~ 47kB).

The evaluation of #if expressions is subject to some subtleties, see the
section "cross-compilation".

The lexer code makes no assumption about the source character set, but
the following: source characters (those which have a syntactic value in
C; comment and string literal contents are not concerned) must have a
strictly positive value that is strictly lower than MAX_CHAR_VAL. The
strict positivity is already assured by the C standard, so you just need
to adjust MAX_CHAR_VAL.

ucpp has been tested succesfully on ASCII/ISO-8859-1 and EBCDIC systems.
Beware that UTF-8 is NOT compatible with EBCDIC.

Pragma handling: when used in non-lexer mode, ucpp tries to output a
source text that, when read again, will yield the exact same stream of
tokens. This is not completely true with regards to line numbering in
some tricky macro replacements, but it should work correctly otherwise,
especially with pragma directives if the compile-time option PRAGMA_DUMP
was set: #pragma are dumped, non-void _Pragma() are converted to the
corresponding #pragma and dumped also.

ucpp does not macro-replace the contents of #pragma and _Pragma();
If you want a macro-replaced pragma, use this:

#define pragma_(x)	_Pragma(#x)
#define pragma(x)	pragma_(x)

Anyway, pragmas do not nest (an _Pragma() cannot be evaluated if it is
inside a #pragma or another _Pragma).


I wrote ucpp according to what is found in "The C Programming Language"
from Brian Kernighan and Dennis Ritchie (2nd edition) and the C99
standard; but I could have misinterpreted some points. On some tricky
points I got help from the helpful people from the comp.std.c newsgroup.
For assertions and #include_next, I mimicked the behaviour of GNU cpp,
as is stated in the GNU cpp info documentation. An open question is
related to the following code:

#define undefined	!
#define makeun(x)	un ## x
#if makeun(defined foo)
qux
#else
bar
#endif

ucpp will replace 'defined foo' with 0 first (since foo is not defined),
then it will replace the macro makeun, and the expression will become
'un0', which is replaced by 0 since this is a remaining identifier. The
expression evaluates to false, and 'bar' is emitted.
However, some other preprocessors will replace makeun first, considering
that it is not part of a 'defined' operator application; this will
produce the macro 'undefined', which is replaced, and the expression
becomes '!foo'. 'foo' is replaced by 0, the expression evaluates to
true, and 'qux' is emitted.

My opinion is that the behaviour is undefined, because use of the
'defined' operator does not match an allowed form prior to macro
replacement (I mean, its syntax matches, but its use is reconverted
to inexistant and therefore is not anymore matching). Other people
think that the behaviour is well-specified, and contrary to what ucpp
does. The only thing clear to me is that the wording of the standard
(paragraph 6.10.1.3) is unclear.

Since the ucpp behaviour makes ucpp code simpler and cleaner, and
that it is unlikely that any real-life code would ever be disturbed
by that interpretation of the standard, ucpp will keep its current
behaviour until convincing evidence of my misinterpretation of the
standard is given to me. The problem can only occur if one uses ## to
make a 'defined' operator disappear from a #if expression (everybody
agrees that the generation of a 'defined' operator triggers undefined
behaviour).


Another point about macro replacement has been discussed at length in
several occasions. It is about the following code:

#define CAT(a, b)    CAT_(a, b)
#define CAT_(a, b)   a ## b
#define AB(x, y)     CAT(x, y)
CAT(A, B)(X, Y)

ucpp will produce `CAT(X,Y)' as replacement for the last line, whereas
some other preprocessors output `XY'. The answer to the question
"which behaviour is correct" seems to be "this is not defined by the
C standard". It is the answer that has been actually given by the C
standardization committee in 1992, to the defect report #017, question
23, which asked that very same question. Since the wording of the
standard has not changed in these parts from the 1990 to the 1999
version, the preprocessor behaviour on the above-stated code should
still be considered as undefined.

It seems, however, that there used to be a time (around 1988) when the
committee members agreed upon a precise macro-replacement algorithm,
which specified quite clearly the preprocessor behaviour in such
situation. ucpp behaviour is occasionnaly claimed as "incorrect" with
regards to that algorithm. Since that macro replacement algorithm has
never been published, and the committee itself backed out from it in
1992, I decided to disregard those feeble claims.

It is possible, however, that at some point in the future I rewrite the
ucpp macro replacement code, since that code is a bit messy and might be
made to use less memory in some occasions. It is then possible that, in
the aftermath of such a rewrite, the ucpp behaviour for the above stated
code become tunable. Don't hold your breath, though.


About _Pragma: the standard is not clear about when this operator is
evaluated, and if it is allowed inside #if directives and such. For
ucpp, I coded _Pragma as a special macro with lazy replacement: it will
be evaluated wherever a macro could be replaced, and only at the end of
the macro replacement (for practical purposes, _Pragma can be considered
as a macro taking one argument, and being replaced by nothing, except
for some tricky uses of the # and ## operators). This means that, by
default, ucpp will evaluate _Pragma inside some directives (mainly, #if,
#include, #include_next and #line), but it can be taught not to do so by
defining NO_PRAGMA_IN_DIRECTIVE in tune.h.



CROSS-COMPILATION
-----------------

If compiled with a C99 development suite, ucpp should be fully
C99-compliant on the host platform (up to my own understanding of the
standard -- remember that this software is distributed as-is, without
any guarantee). However, if a pre-C99 compiler is used, or if the
target machine is not the host machine (for instance when you build a
cross-compiler), the evaluation of #if expressions is subject to some
cross-compiling issues:


-- character constants: when evaluating expressions, character constants
are interpreted in the source character set context; this is allowed
by the standard but this can lead to problems with code that expects
this interpretation to match the one made in the C code. To ease
cross-compilation, you can define a conversion array, and make the
global variable transient_characters point to it. The array should
contain 256 int; transient_characters[x] is the value of the character
whose value is x in the source character set.

This facility is provided for inclusion of ucpp inside another code;
if you want a stand-alone ucpp with that conversion, hard-code the
conversion table into eval.c and make transient_characters[] statically
point to it. Alternatively, you could provide an option syntax to
provide such a table on command-line, if you feel like it.


-- wide character constants signedness: by default, ucpp makes wide
characters as signed as what plain chars are on the build host. To
force wide character constant signedness, define WCHAR_SIGNEDNESS to 0
(for unsigned) or 1 (for signed). Beware, however, that "native" wide
character constants, even signed, are considered positive. Non-wide
character constants are, according to the C99 standard, of type int, and
therefore always signed.


-- evaluation type: C90 states that all constants in #if expressions
are considered as either long or unsigned long, and that the evaluation
is performed with operands of that size. In C99, the situation is
equivalent, except that the types used are intmax_t and uintmax_t, as
defined in <stdint.h>.

ucpp can use two expression evaluators: one uses native integer types
(one signed and one unsigned), the other evaluator emulates big integer
numbers by representing them with two values of some unsigned type. The
emulated type handles signed values in two's complement representation,
and can be any width ranging from 2 bits to twice the size of the
underlying native unsigned type used. An odd width is allowed. When
right shifting an emulated signed negative value, it is left-padded with
bits set to 1 (this is sign extension).

When the ARITHMETIC_CHECKS macro is defined in tune.h, all occurrences
of implementation-defined or undefined behaviour during arithmetic
evaluation are reported as errors or warned upon. This includes all
overflows and underflows on signed quantities, constants too large,
and so on. Errors (which terminate immediately evaluation) are emitted
for division by 0 (on / and % operators) and overflow (on / operator);
otherwise, warnings are emitted and the faulty evaluation takes place.
This prevents ucpp from crashing on typical x86 machines, while still
allowing to use some extensions.



FUTURE EVOLUTIONS
-----------------

ucpp is quite complete now. There was a longstanding project of
"traditional" preprocessing, but I dropped it because it would not
map cleanly on the token-based ucpp structure. Maybe I will code a
string-based preprocessor one day; it would certainly use some of the
code from lexer.c, eval.c, mem.c and nhash.c. However, making such a
tool is almost irrelevant nowadays. If one wants to handle such project,
using ucpp as code base, I would happily provide some help, if needed.



CHANGES
-------

From 1.2 to 1.3:

* brand new integer evaluation code, with precise evaluation and checks
* new hash table implementation, with binary trees
* relaxed attitude on failed `##' operators
* bugfix on macro definition on command-line wrt nesting macros
* support for up to 32766 macro arguments in LOW_MEM code
* support for optional additional "identifier" characters such as '$' or '@'

From 1.1 to 1.2:

* bugfix: numerous memory leaks
* new function: wipeout(); this should release all malloc() blocks
* bugfix: missing "newline" and trailing "context" tokens
* improved included files name caching
* included memory leak detection code

From 1.0 to 1.1:

* bugfix: missing newline when exiting from a non-newline-terminated file
* bugfix: crash when resetting due to definition of the _Pragma pseudo-macro
* bugfix: handling of additional "optional" whitespace with SEMPER_FIDELIS
* improved handling of unreplaced arg macros wrt output line
* tricky handling of utterly tricky #include
* bugfix: spurious token `~=' eliminated

From 0.9 to 1.0:

* bugfix: crash after erroneous #assert
* changed ERR_SHARP to FAIL_SHARP, EMUL_UINTMAX to SIMUL_UINTMAX
* made "inline" default on gcc and DEC ccc (Linux/Alpha)
* semantic of -I is now Unix-like (added directories are looked first)
* added -J flag (to add include directories after the system ones)
* cleaned up non-ascii issues
* bugfix: missing brace in no-LOW_MEM code
* bugfix: argument number check in variadic macros
* bugfix: crash in non-lexer mode after some cases of unreplaced macro
* bugfix: _Pragma() handling wrt # and ##
* made evaluation of _Pragma() optional in #if, #include and #line
* bugfix: re-dump of multiline #pragma
* added the inmacro and macro_count flags
* added mmap() support
* added option to retain whitespace content in lexer mode

From 0.8 to 0.9:

* added check for division by 0 in #if evaluation
* added check for non-standard line numbers
* added check for trailing garbage in most directives
* corrected signedness of char constants (always int, therefore always signed)
* made LOW_MEM code, so that ucpp runs smoothly on low memory architectures
* multiple bugfixes (using the GNU cpp testsuite)
* added handling of _Pragma (as a macro)
* added tokenization of pragma directives
* added conservation of pragma directives in text output
* produced Msdos 16-bit small memory model executable
* produced Minix-86 executable

From 0.7 to 0.8:

* added some support for Amiga systems
* fixed extra spacing in stringified tokens
* fixed bug related to %:% and tolerated rogue sharps
* namespace cleanup
* bugfix for macro redefinition
* added warning for evaluated comma operators in #if (ISO requirement)
* -Dfoo now defines foo with content 1 (and not void content)
* trigraphs can be disabled (for incorrect but legacy code)
* fixed semantics for #include "file" (local directory)
* fixed detection of protected files
* produced a Msdos 16-bit executable

From 0.6 to 0.7:

* officially changed the goal to full C99 compliance
* added the CONTEXT token and let NEWLINE tokens go
* added report_context() for error reporting
* enforced matching of #if/#endif (file-global nesting level = 0)
* added support of C99 digraphs
* added UTF-8 encoding support
* added universal character names
* rewrote #if expressions (sizes fixed, bignum, signed/unsigned fixed)
* fixed incomplete evaluation of #if expressions
* added transient_characters[]

From 0.5 to 0.6:

* disappearance of error_nonl()
* added extra optional warnings for trigraphs
* some bugfixes, especially in lexer mode
* handled MacIntosh files correctly

From 0.4 to 0.5:

* nicer #pragma handling (a token can be emitted)
* bugfix in lexer mode after #line and #error
* sample.c   an example of code linked with ucpp
* made #if expressions conforming to standard signed/unsigned handling
* added the copy_line[] buffer feature

From 0.3 to 0.4:

* relaxed interpretation of '#include foo' when foo ends up, after macro
  substitution, with a '<bar>' content
* corrected the 'double-dot' bug
* corrected two bugs related to the treatment of macro aborted calls (due
  to lack of arguments)
* some namespaces cleanup, to ease integration into other code
* documented the way to include ucpp into another program
* made newlines embedded into strings illegal (and reported as such)

From 0.2 to 0.3:

* added support for system predefined macros
* made several bugfixes
* checked C99 compliance for most of the features
* ucpp now accepts non-C characters on standard when used stand-alone
* removed many useless spaces in the output

From 0.1 to 0.2:

* added support for assertions
* added support for macros with variable arguments
* split the pharaonic cpp.c file into many
* made several bugfixes
* relaxed the behaviour with regards to the void arguments
* made C++-like comments an option



THANKS TO
---------

Volker Barthelmann, Neil Booth, Stephen Davies, Stéphane Ecolivet,
Marc Espie, Marcus Holland-Moritz, Antoine Leca, Cyrille Lefevre,
Dave Rivers, Loic Tortay and Laurent Wacrenier, for suggestions and
beta-testing.

Paul Eggert, Douglas A. Gwyn, Clive D.W. Feather, and the other guys from
comp.std.c, for explanations about the standard.

Dave Brolley, Jamie Lokier and Neil Booth, for discussion about tricky
points on nesting macros.

Brian Kernighan and Dennis Ritchie, for bringing C to mortal Men.