The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

PerlIO::via::SeqIO - PerlIO layer for biological sequence formats

SYNOPSIS

 use PerlIO::via::SeqIO;

 # open a FASTA file for reading:
 open( my $f, "<:via(SeqIO)", 'my.fas');

 # open an EMBL file for writing
 open( my $e, ">:via(SeqIO::embl)", 'my.embl');

 # convert
 print $e $_ while (<$f>);

 # add comments (this really works)
 while (<$f>) {
   # get the real sequence object
   my $seq = O($_);
   if ($seq->desc =~ /Pongo/) {
     print $e "# this one is almost human...";
   }
   print $e $_; 
 }

 # a one-liner, sort of
 $ alias scvt="perl -Ilib \"-MPerlIO::via::SeqIO qw(open)\" -e \"open(STDIN, '<:via(SeqIO)'); open(STDOUT, '>:via(SeqIO::'.shift().')'); while (<STDIN>) { print }\""
 $ cat my.fas | scvt gcg > my.gcg

DESCRIPTION

PerlIO::via::SeqIO attempts to provide an easy option for harnessing the magic sequence format I/O of the BioPerl (http://bioperl.org) toolkit. Opening a biological sequence file under via(SeqIO) yields a filehandle that can be used to read and write Bio::Seq objects sequentially with an absolute minimum of setup code.

via(SeqIO) also allows the user to mix plain text and sequence formats on a single filehandle transparently. Different sequence formats can be written to a single file by a simple filehandle tweak.

DETAILS

Basics

Here's the basic idea, in code converting FASTA to EMBL format:

 open($in, '<:via(SeqIO)', 'my.fas');
 open($out, '>:via(SeqIO::embl)', 'my.embl');
 while (<$in>) {
   print $out $_;
 }
Specifying sequence formats (or not)

On reading, you can rely on Bio::SeqIO's format guesser by invoking an unqualifed

 open($in, '<:via(SeqIO)', 'mystery.txt');

or you can specify the format, like so:

 open($in, '<:via(SeqIO::embl)', 'mystery.txt');

On writing, a qualified invocation is required;

 open($out, '>:via(SeqIO)', 'my.fas');        # throws
 open($out, '>:via(SeqIO::fasta)', 'my.fas'); # that's better
Retrieving the sequence object itself

This does what you mean:

 open($in, '<:via(SeqIO)', 'my.fas');
 open($out, '>:via(SeqIO::embl)', 'my.embl');
 while (<$in>) {
   print $out $_;
 }

However, $_ here is not the sequence object itself. To get that use the all-purpose object getter O():

 while (<$in>) {
   print join("\t", O($_)->id, O($_)->desc), "\n";
 }

If you

 use subs qw(O);

then this DWYM:

 while (<$in>) {
   print O->id;
 }
Writing a de novo sequence object

Use the T() mapper to convert a Bio::Seq object into a thing that can be formatted by via(SeqIO):

 open($seqfh, ">:via(SeqIO::embl)", "my.embl");
 my $result = Bio::SearchIO->new( -file=>'my.blast' )->next_result;
 while(my $hit = $result->next_hit()){
   while(my $hsp = $hit->next_hsp()){
     my $aln = $hsp->get_aln;
       print $seqfh T($_) for ($aln->each_seq);
     }
   }
Writing plain text

Interspersing plain text among your sequences is easy; just print the desired text to the handle. See the "SYNOPSIS".

Even the following works:

 open($in, "<:via(SeqIO)", 'my.fas')
 open($out, ">:via(SeqIO::embl)", 'annotated.txt');

 $seq = <$in>;
 print $out "In EMBL format, the sequence would be rendered:", $s;
Pipe through a gzip layer

You can use the Perlio layer PerlIO::via::gzip to decompress and compress via(SeqIO) input and output.

Compressed output:

 open(my $tfh,"<:via(SeqIO)", "test.fas");
 open(my $zfh,'>:via(SeqIO::embl):via(gzip)', 'test.embl.gz');
 while (<$tfh>) {
     print $zfh $_;
 }
 close($zfh);

GOTCHA: the close is required.

Decompressed input:

 open($tfh,"<:via(gzip):via(SeqIO::fasta)", "test.fas.gz");
 open(my $zfh,'>:via(SeqIO::embl)', 'test.embl');
 while (<$tfh>) {
     print $zfh $_;
 }

When reading via gzip, the sequence format must be explicitly specified in the via(SeqIO) mode spec.

Conversion, gzip to gzip:

 open(my $tfh, "<:via(gzip):via(SeqIO::fasta)", "test.fas.gz");
 open(my $zfh, ">:via(gzip):via(SeqIO::embl)", "test.embl.gz");
 local $/;
 print $zfh <$tfh>;
 close($zfh);
Redirecting STDIN/STDOUT/DATA through via(SeqIO)

Import the open() function provided by the module, like so

 use PerlIO::via::SeqIO qw(open);

This will provide the following kind of two-argument open functionality

 open(STDIN, '<:via(SeqIO)');
 open(STDOUT, '>:via(SeqIO::gcg)');
 while (<STDIN>) {
   print;
 }

which will allow

 cat my.gcg | perl your.pl > out

your.pl can read STDIN and acquire the sequence objects by using the object getter O():

 use PerlIO::via::SeqIO qw(open O);
 open (STDIN, '<:via(SeqIO)');
 while (<STDIN>) {
  $seqobj = O($_);
  ...
 }

The format of the input in this case will be guessed by the Bio::SeqIO machinery.

The imported open() should pass through other uses of open unharmed. This is tested in 001_passthru.t. Please ping the "AUTHOR" if there are issues.

Switching write formats

You can also easily switch write formats. (Why? Because...who knows?) Use set_write_format right off the handle:

 open($in, "<:via(SeqIO)", 'my.fas')
 open($out, ">:via(SeqIO::embl)", 'multi.txt');

 $seq1 = <$in>;
 print "This is sequence 1 in embl format:\n";
 print $out $seq1;
 $out->set_write_format('gcg');
 print $out "while this is sequence 1 in GCG format:\n"
 print $out $seq1;
Supported Formats

The supported formats are contained in @PerlIO::via::SeqIO::SUPPORTED_FORMATS. Currently they are

 fasta, embl, gcg, genbank, pir

UTILITIES

The O() and T() methods are exported by default.

The open hook needs to be available for the 2-argument open redirections (see "DETAILS") to work. Do

 use PerlIO::via::SeqIO qw(open);

O()

 Title   : O
 Usage   : $o = O($sym) # not an object method
 Function: get the object "represented" by the argument
 Returns : the right object
 Args    : PerlIO::via::SeqIO GLOB, or 
           *PerlIO::via::SeqIO::TFH (tied fh) or
           scalar string (sprintf-rendered Bio::SeqI object)
 Example : $seqobj = O($s = <$seqfh>);

T()

 Title   : T
 Usage   : T($seqobj) # not an object method
 Function: Transform a real Bio::Seq object to a
           via(SeqIO)-writeable thing
 Returns : A thing writeable as a formatted sequence
           by a via(SeqIO) filehandle
 Args    : a[n array of] Bio::Seq or related object[s]
 Example : print $seqfh T($seqobj);

set_write_format()

 Title   : set_write_format
 Usage   : $fh->set_write_format($format)
 Function: Set a write handle to write a specified 
           sequence format
 Returns : true on success
 Args    : scalar string; a supported format 
           (see @PerlIO::via::SeqIO::SUPPORTED_FORMATS)
 Note    : call off filehandle directly

SEE ALSO

PerlIO, PerlIO::via, Bio::SeqIO, Bio::Seq, http://bioperl.org

AUTHOR - Mark A. Jensen

 Email maj -at- fortinbras -dot- us
 http://fortinbras.us
 http://bioperl.org/wiki/Mark_Jensen