The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=head1 NAME

makepp_tutorial_compilation -- Unix compilation commands

=for vc $Id: makepp_tutorial_compilation.pod,v 1.4 2012/02/07 22:26:15 pfeiffer Exp $

=head1 DESCRIPTION

Skip this this manual page if you have a good grasp on what the compilation
commands do.

I find that distressingly few people seem to be taught in their programming
classes is how to go about compiling programs once they've written them.
Novices rely either on a single memorized command, or else on the builtin
rules in make.  I have been surprised by extremely computer literate people
who learned to compile without optimization because they simply never were
told how important it is.  Rudimentary knowledge of how compilation commands
work may make your programs run twice as fast or more, so it's worth at least
five minutes.  This page describes just about everything you'll need to know
to compile C or C++ programs on just about any variant of Unix.

The examples will be mostly for C, since C++ compilation is identical except
that the name of the compiler is different.  Suppose you're compiling source
code in a file called C<xyz.c> and you want to build a program called C<xyz>.
What must happen?

You may know that you can build your program in one step, using a command like
this:

    cc -g xyz.c -o xyz

This will work, but it conceals a two-step process that you must understand if
you are writing makefiles.  (Actually, there are more than two steps, but you
only have to understand two of them.)  For a program of more than one module,
the two steps are usually explicitly separated.

=head2 Compilation

The first step is the translation of your C or C++ source code into a binary
file called an object file.  Object files usually have an extension of
C<.o>. (For some more recent projects, C<.lo> is also used for a slightly
different kind of object file.)

The command to produce an object file on Unix looks something like this:

    cc -g -c xyz.c -o xyz.o

C<cc> is the C compiler.  Sometimes alternate C compilers are used; a very
common one is called C<gcc>.  A common C++ compiler is the GNU compiler,
usually called C<g++>.  Virtually all C and C++ compilers on Unix have the
same syntax for the rest of the command (at least for basic operations), so
the only difference would be the first word.

We'll explain what the C<-g> option does later.

The C<-c> option tells the C compiler to produce a C<.o> file as output.  (If
you don't specify C<-c>, then it performs the second compilation step
automatically.)

The S<C<-o xyz.o>> option tells the compiler what the name of the object file
is.  You can omit this, as long as the name of the object file is the same as
the name of the source file except for the C<.o> extension.

For the most part, the order of the options and the file names does not
matter.  One important exception is that the output file must immediately
follow C<-o>.

=head2 Linking

The second step of building a program is called I<linking>.
An object file cannot be run directly; it's an intermediate form
that must be I<linked> to other components in order to produce a
program.  Other components might include:

=over 4

=item *

Libraries.  A I<library>, roughly speaking, is a collection of object modules
that are included as necessary.  For example, if your program calls the
C<printf> function, then the definition of the C<printf> function must be
included from the system C library.  Some libraries are automatically linked
into your program (e.g., the one containing C<printf>) so you never need to
worry about them.

=item *

Object files derived from other source files in your program.  If you write
your program so that it actually has several source files, normally you would
compile each source file to a separate object file and then link them all
together.

=back


The I<linker> is the program responsible for taking a collection of object
files and libraries and linking them together to produce an executable file.
The executable file is the program you actually run.

The command to link the program looks something like this:

    cc -g xyz.o -o xyz

It may seem odd, but we usually run the same program (C<cc>) to perform the
linking.  What happens under the surface is that the C<cc> program immediately
passes off control to a different program (the linker, sometimes called the
loader, or C<ld>) after adding a number of complex pieces of information to
the command line.  For example, C<cc> tells C<ld> where the system library is
that includes the definition of functions like C<printf>.  Until you start
writing shared libraries, you usually do not need to deal directly with C<ld>.

If you do not specify S<C<-o xyz>>, then the output file will be called
C<a.out>, which seems to me to be a completely useless and confusing
convention.  So always specify C<-o> on the linking step.

If your program has more than one object file, you should specify all the
object files on the link command.

=head2 Why you need to separate the steps

Why not just use the simple, one-step command, like this:

    cc -g xyz.c -o xyz

instead of the more complicated two-stage compilation

    cc -g -c xyz.c -o xyz.o
    cc -g xyz.o -o xyz

if internally the first is converted into the second?  The difference is
important only if there is more than one module in your program.  Suppose we
have an additional module, C<abc.c>.  Now our compilation looks like this:

    # One-stage command.
    cc -g xyz.c abc.c -o xyz

or

    # Two-stage command.
    cc -g -c xyz.c -o xyz.o
    cc -g -c abc.c -o abc.o
    cc -g xyz.o abc.o -o xyz


The first method, of course, is converted internally into the second method.
This means that both C<xyz.c> and C<abc.c> are recompiled each time the
command is run.  But if you only changed C<xyz.c>, there's no need to
recompile C<abc.c>, so the second line of the two-stage commands does not need
to be done.  This can make a huge difference in compilation time, especially
if you have many modules.  For this reason, virtually all makefiles keep the
two compilation steps separate.

That's pretty much the basics, but there are a few more little details you
really should know about.

=head2 Debugging vs. optimization

Usually programmers compile a program either either for debug or for speed.
Compilation for speed is called I<optimization>; compiling with optimization
can make your code run up to 5 times faster or more, depending on your code,
your processor, and your compiler.

With such dramatic gains possible, why would you ever not want to optimize?
The most important answer is that optimization makes use of a debugger much
more difficult (sometimes impossible).  (If you don't know anything about a
debugger, it's time to learn.  The half hour or hour you'll spend learning the
basics will be repaid many many times over in the time you'll save later when
debugging.  I'd recommend starting with a GUI debugger like C<kdbg>, C<ddd>,
or C<gdb> run from within emacs (see the info pages on gdb for instructions on
how to do this).)  Optimization reorders and combines statements, removes
unnecessary temporary variables, and generally rearranges your code so that
it's very tough to follow inside a debugger.  The usual procedure is to write
your code, compile it without optimization, debug it, and then turn on
optimization.

In order for the debugger to work, the compiler has to cooperate not only by
not optimizing, but also by putting information about the names of the symbols
into the object file so the debugger knows what things are called.  This is
what the C<-g> compilation option does.

If you're done debugging, and you want to optimize your code, simply replace
C<-g> with C<-O>.  For many compilers, you can specify increasing levels of
optimization by appending a number after C<-O>.  You may also be able to
specify other options that increase the speed under some circumstances
(possibly trading off with increased memory usage).  See your compiler's man
page for details.  For example, here is an optimizing compile command that I
use frequently with the C<gcc> compiler:

    gcc -O6 -malign-double -c xyz.c -o xyz.o

You may have to experiment with different optimization options for the
absolute best performance.  You may need different options for different
pieces of code.  Generally speaking, a simple optimization flag like C<-O6>
works with many compilers and usually produces pretty good results.

Warning: on rare occasions, your program doesn't actually do exactly the same
thing when it is compiled with optimization.  This may be due to (1) an
invalid assumption you made in your code that was harmless without
optimization, but causes problems because the compiler takes the liberty of
rearranging things when you optimize; or (2) sadly, compilers have bugs too,
including bugs in their optimizers.  For a stable compiler like C<gcc> on a
common platform like an Pentium, optimization bugs are seldom a problem (as of
the year 2000--there were problems a few years ago).

If you don't specify either C<-g> or C<-O> in your compilation command, the
resulting object file is suitable neither for debugging nor for running fast.
For some reason, this is the default.  So always specify either C<-g> or
C<-O>.

On some systems, you must supply C<-g> on both the compilation and linking
steps; on others (e.g. Linux), it needs to be supplied only on the
compilation step.  On some systems, C<-O> actually does something different in
the linking phase, while on others, it has no effect.  In any case, it's
always harmless to supply C<-g> or C<-O> for both commands.

=head2 Warnings

Most compilers are capable of catching a number of common programming errors
(e.g., forgetting to return a value from a function that's supposed to return
a value).  Usually, you'll want to turn on warnings.  How you do this depends
on your compiler (see the man page), but with the C<gcc> compiler, I usually
use something like this:

    gcc -g -Wall -c xyz.c -o xyz.o

(Sometimes I also add C<-Wno-uninitialized> after C<-Wall> because of a
warning that is usually wrong that crops up when optimizing.)

These warnings have saved me many many hours of debugging.

=head2 Other useful compilation options

Often, necessary include files are stored in some directory other than the
current directory or the system include directory (F</usr/include>).  This
frequently happens when you are using a library that comes with include files
to define the functions or classes.

Suppose, for example, you are writing an application that uses the Qt
libraries.  You've installed a local copy of the Qt library in
F</home/users/joe/qt>, which means that the include files are stored in the
directory F</home/users/joe/qt/include>.  In your code, you want to be able to
do things like this:

    #include <qwidget.h>

instead of

    #include "/home/users/joe/qt/include/qwidget.h"

You can tell the compiler to look for include files in a different directory
by using the C<-I> compilation option:

    g++ -I/home/users/joe/qt/include -g -c mywidget.cpp -o mywidget.o

There is usually no space between the C<-I> and the directory name.

When the C++ compiler is looking for the file F<qwidget.h>, it will look in
F</home/users/joe/qt/include> before looking in the system include directory.
You can specify as many C<-I> options as you want.

=head2 Using libraries

You will often have to tell the linker to link with specific external
libraries, if you are calling any functions that aren't part of the standard C
library.  The C<-l> (lowercase L) option says to link with a specific library:

    cc -g xyz.o -o xyz -lm

C<-lm> says to link with the system math library, which you will need if you
are using functions like C<sqrt>.

B<Beware:> if you specify more than one C<-l> option, the order can make a
difference on some systems.  If you are getting undefined variables when you
know you have included the library that defines them, you might try moving
that library to the end of the command line, or even including it a second
time at the end of the command line.

Sometimes the libraries you will need are not stored in the default place for
system libraries.  C<-labc> searches for a file called F<libabc.a> or
F<libabc.so> or F<libabc.sa> in the system library directories (F</usr/lib>
and usually a few other places too, depending on what kind of Unix you're
running).  The C<-L> option specifies an additional directory to search for
libraries.  To take the above example again, suppose you've installed the Qt
libraries in F</home/users/joe/qt>, which means that the library files are in
F</home/users/joe/qt/lib>.  Your link step for your program might look
something like this:

    g++ -g test_mywidget.o mywidget.o -o test_mywidget -L/home/users/joe/qt/lib -lqt


(On some systems, if you link in Qt you will need to add other libraries as
well (e.g., S<C<-L/usr/X11R6/lib -lX11 -lXext>>).  What you need to do will
depend on your system.)

Note that there is no space between C<-L> and the directory name.  The C<-L>
option usually goes before any C<-l> options it's supposed to affect.

How do you know which libraries you need?  In general, this is a hard
question, and varies depending on what kind of Unix you are running.  The
documentation for the functions or classes you are using should say what
libraries you need to link with.  If you are using functions or classes from
an external package, there is usually a library you need to link with; the
library will usually be a file called C<libabc.a> or C<libabc.so> or
C<libabc.sa> if you need to add a C<-labc> option.

=head2 Some other confusing things

You may have noticed that it is possible to specify options which normally
apply to compilation on the linking step, and options which normally apply to
linking on the compilation step.  For example, the following commands are
valid:

    cc -g -L/usr/X11R6/lib -c xyz.c -o xyz.o
    cc -g -I/somewhere/include xyz.o -o xyz


The irrelevant options are ignored; the above commands are exactly equivalent
to this:

    cc -g -c xyz.c -o xyz.o
    cc -g xyz.o -o xyz