The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.
=encoding utf8

=head1 NAME

Bio::Gonzales::Project::Functions - organize your computational experiments

=head1 SYNOPSIS

Inspired by L<A Quick Guide to Organizing Computational Biology
Projects|http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000424>
this module makes it easy to organise computational biology projects.

    $ gonzp init human_genome
    $ cd human_genome/analysis

    $ gonzp analysis genome_assembly
    $ cd genome_assembly

    # set up scripts, Makefile, etc.
    # ...

    $ make human_genome_assembly


    $ gonzp analysis genome_annotation # finds the project directory automatically
    $ cd ../genome_annotation

    # set up scripts, Makefile, etc.
    # ...

    $ make human_genome_annotation


=head1 DESCRIPTION

=head2 Project Layout

Create it with C<< gonzp init <project_name> >>

A project consists of a root directory, containing everything, the paper-draft,
analyses, 3rd-party documentation (and perhaps literature), scripts, etc. The
whole system is based on Makefiles (to start the different analysis steps) and
perl modules (surprise, surprise!!).

The documentation goes into the C<README> file, in whatever format (plain text,
markdown, textile, ...) you prefer.

Thus, the basic layout is of an C<example> project is:

    example/Makefile  (a Makefile to start single analyses)
    example/README    (a overview documentation of the computational experiment)

    example/analysis/ (all analyses go in here)

    example/data/     (3rd-party data, such as the uniprot database or 
                       experimental results, common to the whole computational
                       experiment go in here)

    example/paper/    (the paper draft goes in here)

    example/doc/      (3rd-party documentation)

    example/lib/      (if some scripts or analyses have a lot in common,
                       creating a module/library might be helpful)

=head3 analysis

Create it with C<< gonzp analysis <analysis_name> >>

The analysis directory contains all analyses that have been done. One directory
per analysis. The layout in C<example/analysis> is therefore:

    ./important_computational_experiment/Makefile     (the Makefile to start single analysis steps)
    ./important_computational_experiment/av           (the analysis version)
    ./important_computational_experiment/README       (some analysis-specific documentation)
    ./important_computational_experiment/gonz.conf.yml (configuration stuff, e.g. file locations or parameters)
    ./important_computational_experiment/2014-01-28/  (the analysis directory derived from the version stored in "av")
    ./important_computational_experiment/data/        (analysis-specific data)
    ./important_computational_experiment/playground/  (here you can try stuff)
    ./important_computational_experiment/bin/         (a directory to store the scripts)

=head2 the analysis version

The analysis version is just a single string and defaults to the day the
analysis was created. The contents of the C<av> file are e.g.:


    $ cat important_computational_experiment/av
    2014-01-28

Cange it to whatever you want. A common use case is to change input data or
parameters without clobbering the previous results. Therefore, change the
analysis version to a different date and rerun the whole analysis.

The analysis version is integral part of L<Bio::Gonzales::Project::Functions> and therefore accessible via

=over 4

=item The Makefile

as C< $(AV) > variable.

=item Via an exported function of L<Bio::Gonzales::Project::Functions>, B<< nfi($filename) >>

For example you want to calculate the average number of leaves for 4 plant
accessions. You have 3 replicates, so 12 records:

Input data C<data/leaves.txt>:

    accession num_leaves
    ACC_001 3
    ACC_001 4
    ACC_001 6
    ACC_002 8
    ACC_002 14
    ACC_002 12
    ACC_003 18
    ACC_003 10
    ACC_003 12
    ACC_004 10
    ACC_004 4
    ACC_004 7


Script C<bin/calc_number_of_avg_leaves.pl>

    #!/usr/bin/env perl
    # created on 2014-01-28
    
    use warnings;
    use strict;
    use 5.010;
    
    use Bio::Gonzales::Project::Functions;
    use List::Util qw(sum);
    
    # read in some raw data
    open my $fh, '<', 'data/leaves.txt' or die "Can't open filehandle: $!";
    
    my %num_leaves;
    
    <$fh>;    # get rid of the header
    while ( my $line = <$fh> ) {
      chomp $line;
      my ( $acc, $num_leaves ) = split /\t/, $line;
    
      push @{ $num_leaves{$acc} }, $num_leaves;
    }
    
    close $fh;
    
    # nfi = new file in the current analysis version directory
    # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
    my $result_file = nfi("avg_num_leaves.tsv");
    
    # open the result file
    open my $result_fh, '>', $result_file or die "Can't open filehandle: $!";
    
    # calculate the result and write it
    while ( my ( $acc, $leaves ) = each %num_leaves ) {
      my $sum   = sum @$leaves;
      my $count = scalar @$leaves;
      my $avg   = $sum / $count;
    
      say $result_fh join( "\t", $acc, $avg );
    }
    
    close $result_fh;
 
=back

=item Via an exported variable of L<Bio::Gonzales::Project::Functions>, B<< $ANALYSIS_VERSION >>

The script changes slightly, see here the changed lines:

original:

    use Bio::Gonzales::Project::Functions;

    ...

    # nfi = new file in the current analysis version directory
    # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
    my $result_file = nfi("avg_num_leaves.tsv");

changed:

    use Bio::Gonzales::Project::Functions qw(:DEFAULT $ANALYSIS_VERSION);  # CHANGED

    ...

    # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
    my $result_file = "$ANALYSIS_VERSION/avg_num_leaves.tsv";  # CHANGED
    
=head2 Configuration

The configuration is stored in C<gonz.conf.yml> and accessible via commandline
and perl functions. The format of the configuration is
L<YAML|http://www.yaml.org>. You can therefore freely store any configuration
in various data formats, such as lists or dictionaries.

=head3 Access via commandline

The access via commandline is intended to be used in the C<Makefile>. The
commandline script is called C<gonzconf>. See

    gonzconf --help

for help. C<gonzconf> looks for the C<gonzconf.yml> and extracts parts of the
configuration. Example:

C<gonz.conf.yml>

    ---
    genotypes:
      - genotype_1
      - genotype_2
      - genotype_3

Make target:

    GENOTYPES=$(shell gonzconf --flat genotypes)
    analysis:
      for g in $(GENOTYPES); do \
        echo "analysing $$g"; \
      done

=head3 Access in perl

In perl scripts the configuration can be accessed via the C<gonzconf> function.

=over 4

=item B<< my $config = gonzconf() >>

Calling the function without arguments returns the complete configuration. It
can be accessed as normal perl array or hash (depending on the configuration).

Example:

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    use 5.010;
    
    use Bio::Gonzales::Project::Functions;
    
    my $config = gonzconf();
    my @genotypes = @{$config->{genotypes}};
    
    for my $genotype (@genotypes) {
      say "analysing genotype $genotype";
    }

=item B<< my $config_entry = gonzconf($entry) >>

C<gonzconf> can take one argument to access entries of the top layer directly.
By "top layer", gonzconf assumes that the structure of the configuration is
organised as hash/dictionary.

Example:

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    use 5.010;
    
    use Bio::Gonzales::Project::Functions;
    
    my @genotypes = @{gonzconf("genotypes")};
    
    for my $genotype (@genotypes) {
      say "analysing genotype $genotype";
    }


=back

=head2 Logging

L<Bio::Gonzales::Project::Functions> comes with logging included. The logged info is
stored in C<$ANALYSIS_VERSION/gonzlog>. Therefore every analysis has a
different log file. 5 log levels are available: debug, info, warn, error, fatal 

=head3 Access via commandline

Run

    gonzlog <namespace> <message>

to log something. The log level is hardcoded to "info".

=head3 Access via perl


L<Bio::Gonzales::Project::Functions> exports the function C<gonzlog> by default. To log stuff you run

    gonzlog->info("message");

    # or
    
    my $log = gonzlog();
    $log->info("message");

The namespace is the filename of the invoking script.

=head1 SEE ALSO

=head1 AUTHOR

jw bargsten, C<< <joachim.bargsten at wur.nl> >>

=cut