The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=head1 NAME

makepp_signatures -- How makepp knows when files have changed

=for vc $Id: makepp_signatures.pod,v 1.26 2015/07/13 20:54:05 pfeiffer Exp $

=head1 DESCRIPTION

=for genindex '\w+' makepp_signatures.pod

B<C:>E<nbsp>L<C|/c_compilation_md5>,
  L</c_compilation_md5>,E<nbsp>
B<M:>E<nbsp>L</md5>,E<nbsp>
B<P:>E<nbsp>L</plain>,E<nbsp>
B<S:>E<nbsp>L</shared_object>,E<nbsp>
B<X:>E<nbsp>L</xml>,
  L<xml_space|/xml>

Each file is associated with a I<signature>, which is a string that changes if
the file has changed.  Makepp compares signatures to see whether it needs to
rebuild anything.  The default signature for files is a concatenation of the
file's modification time and its size, unless you're executing a C/C++
compilation command, in which case the default signature is a cryptographic
checksum on the file's contents, ignoring comments and whitespace.  If you
want, you can switch to a different method, or you can define your own
signature functions.

How the signature is actually used is controlled by the I<build check method>
(see L<makepp_build_check|makepp_build_check>).  Normally, if a file's
signature changes, the file itself is considered to have changed, and makepp
forces a rebuild.

If makepp is building a file, and you don't think it should be, you might want
to check the build log (see L<makepplog|makepplog>).  Makepp writes an
explanation of what it thought each file depended on, and why it chose to
rebuild.

There are several signature methods included in makepp.  Makepp usually picks
the most appropriate standard one automatically.  However, you can change the
signature method for an individual rule by using
L<C<:signature>|makepp_rules/signature_signature_method> modifier on the rule
which depends on the files you want to check, or for all rules in a makefile
by using the L<C<signature>|makepp_statements/signature_name> statement, or
for all makefiles at once using the L<C<-m> or
C<--signature-method>|makepp_command/m_method> command line option.

=head2 Mpp::Signature methods included in the distribution

=over 4

=item plain (actually nameless)

The plain signature method is the file's modification time and the file's
size, concatenated.  These values are quickly obtainable from the operating
system and almost always change when the file changes.  For symlinks it uses
the values of the linkee.  If there is no linkee, i.e. it's a dangling
symlink, then it uses its own values, but prepends a C<0> to mark the fact.

Makepp used to look only at the file's modification time, but if you run
makepp several times within a second (e.g., in a script that's building
several small things), sometimes modification times won't change.  Then,
hopefully the file's size will change.

If the case where you may run makepp several times a second is a problem
for you, you may find that using the C<md5> method is somewhat more
reliable.  If makepp builds a file, it flushes its cached MD5 signatures
even if the file's date hasn't changed.

For efficiency's sake, makepp won't reread the file and recompute the complex
signatures below if this plain signature hasn't changed since the last time it
computed it.  This can theoretically cause a problem, since it's possible to
change the file's contents without changing its date and size.  In practice,
this is quite hard to do so it's not a serious danger.  In the future, as more
filesystems switch to timestamps of under a second, hopefully Perl will give us
access to this info, making this failsafe.

=item c_compilation_md5

=item C

This is the method for input files to C like compilers.  It checks if a file's
name looks like C or C++ source code, including things like Corba IDL.  If it
does, this method applies.  If it doesn't, it falls back to L<plain
signatures|/plain (actually nameless)> for binary files (determined by name or
else by content) and else to L</md5>.

The idea is to be independent of formatting changes.  This is done by pulling
everything up as far as possible, and by eliminating insignificant spaces.
Words are exempt from pulling up, since they might be macros containing
C<__LINE__>, so they remain on the line where they were.

    // ignored comment

    #ifdef XYZ
 	#include <xyz.h>
    #endif

    int a = 1;

    #line 20
    void f
    (
 	int b
    )
    {
 	a += b + ++c;
    }

 	/* more ignored comment */

is treated as though it were

    #ifdef XYZ
    #include<xyz.h>
    #endif



    int a=1;
    #line 20
    void f(

    int b){


    a+=b+ ++c;}

That way you can reindent your code or add or change comments without
triggering a rebuild, so long as you don't change the line numbers.  (This
signature method recompiles if line numbers have changed because that causes
calls to C<__LINE__> and most debugging information to change.)  It also
ignores whitespace and comments B<after> the last token.  This is useful for
preventing a useless rebuild if your VC adds lines at a C<$>C<Log$> tag when
checking in.

This method is particularly useful for the following situations:

=over 4

=item *

You want to make changes to the comments in a commonly included header
file, or you want to reformat or reindent part of it.  For one project
that I worked on a long time ago, we were very unwilling to correct
inaccurate comments in a common header file, even when they were
seriously misleading, because doing so would trigger several hours of
rebuilds.  With this signature method, this is no longer a problem.

=item *

You like to save your files often, and your editor (unlike emacs) will
happily write a new copy out even if nothing has changed.

=item *

You have C/C++ source files which are generated automatically by other
build commands (e.g., yacc or some other preprocessor).  For one system
I work with, we have a preprocessor which (like yacc) produces two
output files, a C<.cxx> and a C<.h> file:

    %.h %.cxx: %.qtdlg $(HLIB)/Qt/qt_dialog_generator
 	$(HLIB)/Qt/qt_dialog_generator $(input)

Every time the input file changed, the resulting F<.h> file also was
rewritten, and ordinarily this would trigger a rebuild of everything
that included it.  However, most of the time the contents of the F<.h>
file didn't actually change (except for a comment about the build time
written by the preprocessor), so a recompilation was not actually
necessary.

=back

Actually in practice this saves less recompiles than you'd hope for, because
mere comment changes often add lines.  In order for logging with C<__LINE__>
or the debugger to match your source, this requires recompilation.  So this
signature is specially useless for the C<tangle> family of tools from literate
programming, where your code resides in some bigger file and even changes to a
documentation section irrelevant to code will be reflected in the extracted
source via a C<#line> directive.

If you can live with wrong line numbers during development, you can set the
variable C<makepp_signature_C_flat> (with an uppercase C) to some true value
(like C<1>).  Then, whereas the compiler still sees the real file, the above
example will be flattened for signing as:

    #ifdef XYZ
    #include<xyz.h>
    #endif
    int a=1;void f(int b){a+=b+ ++c;}

Note that signatures are only recalculated when files change.  So you can
build for everyone in a repository without this option, and those who want the
option can set it when building in their sandbox.  When they first locally
change a file, even only trivially, that will cause a recompilation, because
with this option a totally different signature is calculated.  But then they
can reformat the file as much as they want without further recompilation.

The opposite is also true: Just omitting this option after it was set and
recompiling will not fix your line numbers.  So, if line numbers matter, don't
do a production build in the same sandbox without L<cleaning|makeppclean> first.

=item md5

This is the default method, for files not recognized by the L</C> method.
Computes an MD5 checksum of the file's contents, rather than looking at the
file's date or size.  This means that if you change the date on the file but
don't change its contents, makepp won't try to rebuild anything that depends
on it.

This is particularly useful if you have some file which is often
regenerated during the build process that other files depend on, but
which usually doesn't actually change.  If you use the C<md5> signature
checking method, makepp will realize that the file's contents haven't
changed even if the file's date has changed.  (Of course, this won't
help if the files have a timestamp written inside of them, as archive
files do for example.)

=item shared_object

This method only works if you have the utility C<nm> in your path, and it
accepts the C<-P> option to output Posix format.  In that case only the names
and types of symbols in dynamically loaded libraries become part of their
signature.  The result is that you can change the coding of functions without
having to relink the programs that use them.

In the following command the parser will detect an implicit dependency on
F<$(LIBDIR)/libmylib.so>, and build it if necessary.  However the link command
will only be reperformed whenever the library exports a different set of
symbols:

    myprog: $(OBJECTS) :signature shared_object
 	$(LD) -L$(LIBDIR) -lmylib $(inputs) -o $(output)

This works as long as the functions' interfaces don't change.  But in that
case you'd change the declaration, so you'd also need to change the callers.

Note that this method only applies to files whose name looks like a shared
library.  For all other files it falls back to C<c_compilation_md5>, which may
in turn fall back to others.

=item xml

=item xml_space

These are two similar methods which treat xml canonically and differ only in
their handling of whitespace.  The first completely ignores it around tags and
considers it like a single space elsewhere, making the signature immune to
formatting changes.  The second respects any whitespace in the xml, which is
necessary even if just a small part requires that, like a C<< <pre> >> section
in an xhtml document.

Common to both methods is that they sign the essence of each xml document.
Presence or not of a BOM or C<< <?xml?> >> header is ignored.  Comments are
ignored, as is whether text is protected as C<CDATA> or with entities.  Order
and quoting style of attributes doesn't matter, nor does how you render empty
tags.

For any file which is not valid xml, or if the Expat based C<XML::Parser> or
the C<XML::LibXML> parser is not installed, this falls back to method md5.  If
you switch your Perl installation from one of the parsers to the others,
makepp will think the files are different as soon as their timestamp changes.
This is because the result of either parser is logically equivalent, but they
produce different signatures.  In the unlikely case that this is a problem,
you can force use of only C<XML::LibXML> by setting in Perl:

    $Mpp::Signature::xml::libxml = 1;

=back

=head2 Extending applicability

The C<C> or C<c_compilation_md5> method has a built in list of suffixes it
recognizes as being C or C-like.  If it gets applied to other files it falls
back to simpler signature methods.  But many file types are syntactically
close enough to C++ for this method to be useful.  Close enough means C++
comment and string syntax and whitespace is meaningless except one space
between words (and C++'s problem cases C<- ->, C<+ +>, C</ *> and C<< < < >>).

It (and its subclasses) can now easily be extended to other suffixes.
Anyplace you can specify a signature you can now tack on one one of these
syntaxes to make the method accept additional filenames:

=over

=item C.I<suffix1,suffix2,suffix3>

One or more comma-separated suffixes can be added to the method by a colon.
For example C<C.ipp,tpp> means that besides the built in suffixes it will also
apply to files ending in F<.ipp> or F<.tpp>, which you might be using for the
inline and template part of C++ headers.

=item C.(I<suffix-regexp>)

This is like the previous, but instead of enumerating suffixes, you give a
Perl regular expression to match the ones you want.  The previous example
would be C<C.(ipp|tpp)> or C<C.([it]pp)> in this syntax.

=item C(I<regexp>)

Without a dot the Perl regular expression can match anywhere in the file name.
If it includes a slash, it will be tried against the fully qualified filename,
otherwise only against the last part, without any directory.  So if you have
C++ style suffixless headers in a directory F<include>, use C<C(include/)> as
your signature method.  However the above suffix example would be quite nasty
this way, C<C(\.(?:ipp|tpp)$$)> or C<C(\.[it]pp$$)> because C<$> is the
expansion character in makefiles.

=back


=head2 Shortcomings

Signature methods apply to all files of a rule.  Now if you have a compiler
that takes a C like source code and an XML configuration file you'd either
need a combined signature method that smartly handles both file types, or you
must choose an existing method which will not know whether a change in the
other file is significant.

In the future signature method configuration may be changed to
filename-pattern, optionally per command.

=head2 Custom methods

You can, if you want, define your own methods for calculating file
signatures and comparing them.  You will need to write a Perl module to
do this.  Have a look at the comments in C<Mpp/Signature.pm> in the
distribution, and also at the existing signature algorithms in
C<Mpp/Signature/*.pm> for details.

Here are some cases where you might want a custom signature method:

=over 4

=item *

When you want all changes in a file to be ignored.  Say you always want
F<dateStamp.o> to be a dependency (to force a rebuild), but you don't
want to rebuild if only F<dateStamp.o> has changed.  You could define a
signature method that inherits from C<c_compilation_md5> that recognizes
the F<dateStamp.o> file by its name, and always returns a constant value
for that file.

=item *

When you want to ignore part of a file.  Suppose that you have a program
that generates a file that has a date stamp in it, but you don't want to
recompile if only the date stamp has changed.  Just define a signature
method similar to C<c_compilation_md5> that understands your file format
and skips the parts you don't want to take into account.

=back