The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Bio::Gonzales::Project

Inspired by A Quick Guide to Organizing Computational Biology Projects this module makes it easy to organise computational biology projects.

Project Layout

Create it with gonzp init <project_name>

A project consists of a root directory, containing everything, the paper-draft, analyses, 3rd-party documentation (and perhaps literature), scripts, etc. The whole system is based on Makefiles (to start the different analysis steps) and perl modules (surprise, surprise!!).

The documentation goes into the README file, in whatever format (plain text, markdown, textile, ...) you prefer.

Thus, the basic layout is of an example project is:

    example/Makefile  (a Makefile to start single analyses)
    example/README    (a overview documentation of the computational experiment)

    example/analysis/ (all analyses go in here)

    example/data/     (3rd-party data, such as the uniprot database or 
                       experimental results, common to the whole computational
                       experiment go in here)

    example/paper/    (the paper draft goes in here)

    example/doc/      (3rd-party documentation)

    example/lib/      (if some scripts or analyses have a lot in common,
                       creating a module/library might be helpful)

analysis

Create it with gonzp add <analysis_name>

The analysis directory contains all analyses that have been done. One directory per analysis. The layout in example/analysis is therefore:

    ./important_computational_experiment/Makefile     (the Makefile to start single analysis steps)
    ./important_computational_experiment/av           (the analysis version)
    ./important_computational_experiment/README       (some analysis-specific documentation)
    ./important_computational_experiment/gonzconf.yml (configuration stuff, e.g. file locations or parameters)
    ./important_computational_experiment/2014-01-28/  (the analysis directory derived from the version stored in "av")
    ./important_computational_experiment/data/        (analysis-specific data)
    ./important_computational_experiment/playground/  (here you can try stuff)
    ./important_computational_experiment/bin/         (a directory to store the scripts)

the analysis version

The analysis version is just a single string and defaults to the day the analysis was created. The contents of the av file are e.g.:

    $ cat important_computational_experiment/av
    2014-01-28

Cange it to whatever you want. A common use case is to change input data or parameters without clobbering the previous results. Therefore, change the analysis version to a different date and rerun the whole analysis.

The analysis version is integral part of Bio::Gonzales::Project and therefore accessible via

The Makefile

as $(AV) variable.

Via an exported function of Bio::Gonzales::Project, nfi($filename)

For example you want to calculate the average number of leaves for 4 plant accessions. You have 3 replicates, so 12 records:

Input data data/leaves.txt:

    accession num_leaves
    ACC_001 3
    ACC_001 4
    ACC_001 6
    ACC_002 8
    ACC_002 14
    ACC_002 12
    ACC_003 18
    ACC_003 10
    ACC_003 12
    ACC_004 10
    ACC_004 4
    ACC_004 7

Script bin/calc_number_of_avg_leaves.pl

    #!/usr/bin/env perl
    # created on 2014-01-28
    
    use warnings;
    use strict;
    use 5.010;
    
    use Bio::Gonzales::Project;
    use List::Util qw(sum);
    
    # read in some raw data
    open my $fh, '<', 'data/leaves.txt' or die "Can't open filehandle: $!";
    
    my %num_leaves;
    
    <$fh>;    # get rid of the header
    while ( my $line = <$fh> ) {
      chomp $line;
      my ( $acc, $num_leaves ) = split /\t/, $line;
    
      push @{ $num_leaves{$acc} }, $num_leaves;
    }
    
    close $fh;
    
    # nfi = new file in the current analysis version directory
    # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
    my $result_file = nfi("avg_num_leaves.tsv");
    
    # open the result file
    open my $result_fh, '>', $result_file or die "Can't open filehandle: $!";
    
    # calculate the result and write it
    while ( my ( $acc, $leaves ) = each %num_leaves ) {
      my $sum   = sum @$leaves;
      my $count = scalar @$leaves;
      my $avg   = $sum / $count;
    
      say $result_fh join( "\t", $acc, $avg );
    }
    
    close $result_fh;
 
Via an exported variable of Bio::Gonzales::Project, $ANALYSIS_VERSION

The script changes slightly, see here the changed lines:

original:

    use Bio::Gonzales::Project;

    ...

    # nfi = new file in the current analysis version directory
    # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
    my $result_file = nfi("avg_num_leaves.tsv");

changed:

    use Bio::Gonzales::Project qw(:DEFAULT $ANALYSIS_VERSION);  # CHANGED

    ...

    # here the result file will be e.g. "2014-01-28/avg_num_leaves.tsv", depending on the analysis version
    my $result_file = "$ANALYSIS_VERSION/avg_num_leaves.tsv";  # CHANGED
    

Configuration

The configuration is stored in gonzconf.yml and accessible via commandline and perl functions. The format of the configuration is YAML. You can therefore freely store any configuration in various data formats, such as lists or dictionaries.

Access via commandline

The access via commandline is intended to be used in the Makefile. The commandline script is called gonzconf. See

    gonzconf --help

for help. gonzconf looks for the gonzconf.yml and extracts parts of the configuration. Example:

gonzconf.yml

    ---
    genotypes:
      - genotype_1
      - genotype_2
      - genotype_3

Make target:

    GENOTYPES=$(shell gonzconf --flat genotypes)
    analysis:
      for g in $(GENOTYPES); do \
        echo "analysing $$g"; \
      done

Access in perl

In perl scripts the configuration can be accessed via the gonzconf function.

my $config = gonzconf()

Calling the function without arguments returns the complete configuration. It can be accessed as normal perl array or hash (depending on the configuration).

Example:

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    use 5.010;
    
    use Bio::Gonzales::Project;
    
    my $config = gonzconf();
    my @genotypes = @{$config->{genotypes}};
    
    for my $genotype (@genotypes) {
      say "analysing genotype $genotype";
    }
my $config_entry = gonzconf($entry)

gonzconf can take one argument to access entries of the top layer directly. By "top layer", gonzconf assumes that the structure of the configuration is organised as hash/dictionary.

Example:

    #!/usr/bin/env perl
    
    use warnings;
    use strict;
    use 5.010;
    
    use Bio::Gonzales::Project;
    
    my @genotypes = @{gonzconf("genotypes")};
    
    for my $genotype (@genotypes) {
      say "analysing genotype $genotype";
    }

Logging

Bio::Gonzales::Project comes with logging included. The logged info is stored in $ANALYSIS_VERSION/gonzlog. Therefore every analysis has a different log file. 5 log levels are available: debug, info, warn, error, fatal

Access via commandline

Run

    gonzlog <namespace> <message>

to log something. The log level is hardcoded to "info".

Access via perl

Bio::Gonzales::Project exports the function gonzlog by default. To log stuff you run

    gonzlog->info("message");

    # or
    
    my $log = gonzlog();
    $log->info("message");

The namespace is the filename of the invoking script.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 145:

'=item' outside of any '=over'

Around line 168:

You forgot a '=back' before '=head2'