Vcf - Vcf.pm from VCFTools
version 1.123050
From the command line: perl -MVcf -e validate example.vcf perl -I/path/to/the/module/ -MVcf -e validate_v32 example.vcf
From a script: use Vcf;
my $vcf = Vcf->new(file=>'example.vcf.gz',region=>'1:1000-2000'); $vcf->parse_header(); # Do some simple parsing. Most thorough but slowest way how to get the data. while (my $x=$vcf->next_data_hash()) { for my $gt (keys %{$$x{gtypes}}) { my ($al1,$sep,$al2) = $vcf->parse_alleles($x,$gt); print "\t$gt: $al1$sep$al2\n"; } print "\n"; } # This will split the fields and print a list of CHR:POS while (my $x=$vcf->next_data_array()) { print "$$x[0]:$$x[1]\n"; } # This will return the lines as they were read, including the newline at the end while (my $x=$vcf->next_line()) { print $x; } # Only the columns NA00001, NA00002 and NA00003 will be printed. my @columns = qw(NA00001 NA00002 NA00003); print $vcf->format_header(\@columns); while (my $x=$vcf->next_data_array()) { # this will recalculate AC and AN counts, unless $vcf->recalc_ac_an was set to 0 print $vcf->format_line($x,\@columns); } $vcf->close();
About : Validates the VCF file. Usage : perl -MVcf -e validate example.vcf.gz # (from the command line) validate('example.vcf.gz'); # (from a script) validate(\*STDIN); Args : File name or file handle. When no argument given, the first command line argument is interpreted as the file name.
About : Same as validate, but assumes v3.2 VCF version. Usage : perl -MVcf -e validate_v32 example.vcf.gz # (from the command line) Args : File name or file handle. When no argument given, the first command line argument is interpreted as the file name.
About : Creates new VCF reader/writer. Usage : my $vcf = Vcf->new(file=>'my.vcf', version=>'3.2'); Args : fh .. Open file handle. If neither file nor fh is given, open in write mode. file .. The file name. If neither file nor fh is given, open in write mode. region .. Optional region to parse (requires tabix indexed VCF file) silent .. Unless set to 0, warning messages may be printed. strict .. Unless set to 0, the reader will die when the file violates the specification. version .. If not given, '4.0' is assumed. The header information overrides this setting.
About : (Re)Open file. No need to call this explicitly unless reading from a different region is requested. Usage : $vcf->open(); # Read from the start $vcf->open(region=>'1:12345-92345'); Args : region .. Supported only for tabix indexed files
About : Close the filehandle Usage : $vcf->close(); Args : none Returns : close exit status
About : Reads next VCF line. Usage : my $vcf = Vcf->new(); my $x = $vcf->next_line(); Args : none
About : Reads next VCF line and splits it into an array. The last element is chomped. Usage : my $vcf = Vcf->new(); $vcf->parse_header(); my $x = $vcf->next_data_array(); Args : Optional line to parse
About : Parsing big VCF files with many sample columns is slow, not parsing unwanted samples may speed things a bit. Usage : my $vcf = Vcf->new(); $vcf->set_samples(include=>['NA0001']); # Exclude all but this sample. When the array is empty, all samples will be excluded. $vcf->set_samples(exclude=>['NA0003']); # Include only this sample. When the array is empty, all samples will be included. my $x = $vcf->next_data_hash(); Args : Optional line to parse
About : Reads next VCF line and splits it into a hash. This is the slowest way to obtain the data. Usage : my $vcf = Vcf->new(); $vcf->parse_header(); my $x = $vcf->next_data_hash(); # Or having a VCF data line $line my $x = $vcf->next_data_hash($line); Args : Optional line to parse.
About : Reads (and stores) the VCF header. Usage : my $vcf = Vcf->new(); $vcf->parse_header(); Args : silent .. do not warn about duplicate header lines
About : Stores the header lines and meta information, such as fields types, etc. Args : silent .. do not warn about duplicate column names
Usage : $vcf->get_header_line(key=>'INFO', ID=>'AC') $vcf->get_header_line(key=>'FILTER', ID=>'q10') $vcf->get_header_line(key=>'reference') $vcf->get_header_line(key=>'contig',ID=>'20') Args : Header line filter as in the example above Returns : List ref of header line hashes matching the filter
Usage : $vcf->add_header_line({key=>'INFO', ID=>'AC',Number=>-1,Type=>'Integer',Description=>'Allele count in genotypes'}) $vcf->add_header_line({key=>'reference',value=>'1000GenomesPilot-NCBI36'}) Args : Header line hash as in the example above Hash with additional parameters [optional] silent .. do not warn about existing header keys append .. append timestamp to the name of the new one Returns :
Usage : $vcf->remove_header_line(key=>'INFO', ID=>'AC') Args : Returns :
Usage : $vcf->parse_header_line(q[##reference=1000GenomesPilot-NCBI36]) $vcf->parse_header_line(q[##INFO=NS,1,Integer,"Number of Samples With Data"]) Args : Returns :
About : Stores the column names as array $$self{columns} and hash $$self{has_column}{COL_NAME}=index. The indexes go from 1. Usage : $vcf->_read_column_names(); Args : none
About : When no header is present, fake column names as the default mandatory ones + numbers Args : The number of genotype columns; 0 if no genotypes but FORMAT present; <0 if FORMAT and genotypes not present
About : Returns the header. Usage : print $vcf->format_header(); Args : The columns to include on output [optional]
About : Returns the header. Usage : $x = $vcf->next_data_hash(); print $vcf->format_line($x); $x = $vcf->next_data_array(); print $vcf->format_line($x); Args 1 : The columns or hash in the format returned by next_data_hash or next_data_array. 2 : The columns to include [optional]
About : Control if the AC and AN values should be updated. Usage : $vcf->recalc_ac_an(1); $x = $vcf->next_data_hash(); print $vcf->format_line($x); Args 1 : 0 .. never recalculate 1 .. recalculate if present 2 .. recalculate if present and add if missing
Usage : my $idx = $vcf->get_tag_index('GT:PL:DP:SP:GQ','PL',':'); Arg 1 : Field 2 : The tag to find 3 : Tag separator Returns : Index of the tag or -1 when not found
Usage : my $field = $vcf->remove_field('GT:PL:DP:SP:GQ',1,':'); # returns 'GT:DP:SP:GQ' Arg 1 : Field 2 : The index of the field to remove 3 : Field separator Returns : Modified string
Usage : my $col = $vcf->replace_field('GT:PL:DP:SP:GQ','XX',1,':'); # returns 'GT:XX:DP:SP:GQ' Arg 1 : Field 2 : The index of the field to replace 3 : Replacement 4 : Field separator Returns : Modified string
Usage : my $line = $vcf->next_line; my @items = split(/\t/,$line); $af = $vcf->get_info_field('DP=14;AF=0.5;DB','AF'); # returns 0.5 $af = $vcf->get_info_field('DP=14;AF=0.5;DB','DB'); # returns 1 $af = $vcf->get_info_field('DP=14;AF=0.5;DB','XY'); # returns undef Arg 1 : The VCF line broken into an array 2 : The tag to retrieve Returns : undef when tag is not present, the tag value if present, or 1 if flag is present
Usage : my $line = $vcf->next_line; my @items = split(/\t/,$line); my $idx = $vcf->get_tag_index($$line[8],'PL',':'); my $pl = $vcf->get_field($$line[9],$idx) unless $idx==-1; Arg 1 : The VCF line broken into an array 2 : The index of the field to retrieve 3 : The delimiter [Default is ':'] Returns : The tag value
Usage : my $line = $vcf->next_line; my @items = split(/\t/,$line); my $idx = $vcf->get_tag_index($$line[8],'PL',':'); my $pls = $vcf->get_sample_field(\@items,$idx) unless $idx==-1; Arg 1 : The VCF line broken into an array 2 : The index of the field to retrieve Returns : Array of values
About : Faster alternative to regexs, extract the mandatory columns Usage : my $line=$vcf->next_line; my @cols = $vcf->split_mandatory($line); Arg : Returns : Pointer to the array of values
About : Faster alternative to regexs Usage : my ($a1,$a2,$a3) = $vcf->split_gt('0/0/1'); # returns (0,0,1) Arg : Diploid genotype to split into alleles Returns : Array of values
About : Generalization of split_gt Usage : my ($a1,$a2,$a3) = $vcf->split_gt('0/0|1',qw(| /)); # returns (0,0,1) Arg : Diploid genotype to split into alleles Returns : Array of values
About : Faster alternative to regexs Usage : my $gt = $vcf->decode_genotype('G',['A','C'],'0/0'); # returns 'G/G' Arg 1 : Ref allele 2 : Alt alleles 3 : The genotype to decode Returns : Decoded GT string
Usage : my $x = $vcf->next_data_hash(); $vcf->validate_alt_field($$x{ALT}); Args : The ALT arrayref Returns : Error message in case of an error.
Usage : my $x = $vcf->next_data_hash(); my ($alleles,$seps,$is_phased,$is_empty) = $vcf->parse_haplotype($x,'NA00001'); for my $allele (@$alleles) { my ($type,$len,$ht) = $vcf->event_type($x,$allele); } or my ($type,$len,$ht) = $vcf->event_type($ref,$al); Args : VCF data line parsed by next_data_hash or the reference allele : Allele Returns : 's' for SNP and number of SNPs in the record 'i' for indel and a positive (resp. negative) number for the length of insertion (resp. deletion) 'r' identical to the reference, length 0 'o' for other (complex events) and the number of affected bases 'b' breakend 'u' unknown
About : Checks the header for the presence of tags with variable number of fields (Number=A or Number=G, such as GL) Usage : $vcf->parse_header(); my $agtags = $vcf->has_AGtags(); Args : None Returns : Hash {fmtA=>[tags],fmtG=>[tags],infoA=>[tags],infoG=>[tags]} or undef if none is present
About : Breaks tags with variable number of fields (that is where Number is set to 'A' or 'G', such as GL) into hashes Usage : my $x = $vcf->next_data_hash(); my $values = $vcf->parse_AGtags($x); Args : VCF data line parsed by next_data_hash : Mapping between ALT representations based on different REFs [optional] : New REF [optional] Returns : Hash {Allele=>Value}
About : Format tag with variable number of fields (that is where Number is set to 'A' or 'G', such as GL) Usage : Args : : : Returns :
About : Deprecated, use parse_haplotype instead. Usage : my $x = $vcf->next_data_hash(); my ($al1,$sep,$al2) = $vcf->parse_alleles($x,'NA00001'); Args : VCF data line parsed by next_data_hash : The genotype column name Returns : Alleles and the separator. If only one allele is present, $sep and $al2 will be an empty string.
About : Similar to parse_alleles, supports also multiploid VCFs. Usage : my $x = $vcf->next_data_hash(); my ($alleles,$seps,$is_phased,$is_empty) = $vcf->parse_haplotype($x,'NA00001'); Args : VCF data line parsed by next_data_hash : The genotype column name Returns : Two array refs and two boolean flags: List of alleles, list of separators, and is_phased/empty flags. The values can be cashed and must be therefore considered read only!
Usage : my ($alleles,$seps,$is_phased,$is_empty) = $vcf->parse_haplotype($x,'NA00001'); print $vcf->format_haplotype($alleles,$seps);
Usage : my $x = { REF=>'A', gtypes=>{'NA00001'=>{'GT'=>'A/C'}}, FORMAT=>['GT'], CHROM=>1, POS=>1, FILTER=>['.'], QUAL=>-1 }; $vcf->format_genotype_strings($x); print $vcf->format_line($x); Args 1 : VCF data line in the format as if parsed by next_data_hash with alleles written as letters. 2 : Optionally, a subset of columns can be supplied. See also format_line. Returns : Modifies the ALT array and the genotypes so that ref alleles become 0 and non-ref alleles numbers starting from 1. If the key $$vcf{trim_redundant_ALTs} is set, ALT alleles not appearing in any of the sample column will be removed.
Usage : $vcf->format_header_line({key=>'INFO', ID=>'AC',Number=>-1,Type=>'Integer',Description=>'Allele count in genotypes'}) Args : Returns :
Usage : my $rec=$vcf->next_data_hash(); $vcf->remove_columns($rec,remove=>['NA001','NA0002']); Args : VCF hash pointer : list of columns to remove or a lookup hash with column names to keep (remove=>[] or keep=>{}) Returns :
Usage : $vcf->add_columns('NA001','NA0002'); Args : Returns :
Usage : $x=$vcf->next_data_hash(); $vcf->add_format_field($x,'FOO'); $$x{gtypes}{NA0001}{FOO}='Bar'; print $vcf->format_line($x); Args : The record obtained by next_data_hash : The field name Returns :
Usage : $x=$vcf->next_data_hash(); $vcf->remove_format_field($x,'FOO'); print $vcf->format_line($x); Args : The record obtained by next_data_hash : The field name Returns :
Usage : $x=$vcf->next_data_array(); $$x[7]=$vcf->add_info_field($$x[7],'FOO'=>'value','BAR'=>undef,'BAZ'=>''); print join("\t",@$x)."\n"; Args : The record obtained by next_data_array : The INFO field name and value pairs. If value is undef and the key is present in $$x[7], it will be removed. To add fields without a value, use empty string ''. Returns : The formatted INFO.
Usage : $x=$vcf->next_data_array(); $$x[6]=$vcf->add_filter($$x[6],'SnpCluster'=>1,'q10'=>0); print join("\t",@$x)."\n"; Args : The record obtained by next_data_array or next_data_hash : The key-value pairs for filter to be added. If value is 1, the filter will be added. If 0, the filter will be removed. Returns : The formatted filter field.
Usage : my $x = $vcf->next_data_hash(); $vcf->validate_filter_field($$x{FILTER}); Args : The FILTER arrayref Returns : Error message in case of an error.
About : Version specific header validation code. Usage : my $vcf = Vcf->new(); $vcf->parse_header(); $vcf->validate_header(); Args :
About : Version specific line validation code. Usage : my $vcf = Vcf->new(); $vcf->parse_header(); $x = $vcf->next_data_hash; $vcf->validate_line($x); Args :
Usage : my $x = $vcf->next_data_hash(); $vcf->validate_info_field($$x{INFO},$$x{ALT}); Args : The INFO hashref Returns : Error message in case of an error.
Usage : my $x = $vcf->next_data_hash(); $vcf->validate_gtype_field($$x{gtypes}{NA00001},$$x{ALT},$$x{FORMAT}); Args : The genotype data hashref The ALT arrayref Returns : Error message in case of an error.
About : Validates the VCF file. Usage : my $vcf = Vcf->new(file=>'file.vcf'); $vcf->run_validation('example.vcf.gz'); Args : File name or file handle.
About : Get list of chromosomes from the VCF file. Must be bgzipped and tabix indexed. Usage : my $vcf = Vcf->new(); $vcf->get_chromosomes(); Args : none
About : Get list of samples. Usage : my $vcf = Vcf->new(); $vcf->parse_header(); my (@samples) = $vcf->get_samples(); Args : none
About : Convenient way to get data for a sample Usage : my $rec = $vcf->next_data_array(); my $sample_col = $vcf->get_column($rec, 'NA0001'); Args 1 : Array pointer returned by next_data_array 2 : Column/Sample name
About : Mapping between zero-based VCF column and its name Usage : my $vcf = Vcf->new(); $vcf->parse_header(); my $name = $vcf->get_column_name(1); # returns POS Args : Index of the column (0-based)
About : Mapping between VCF column name and its zero-based index Usage : my $vcf = Vcf->new(); $vcf->parse_header(); my $name = $vcf->get_column_index('POS'); # returns 1 Args : Name of the column
Vcf.pm. Module for validation, parsing and creating VCF files. Supported versions: 3.2, 3.3, 4.0, 4.1
VCFv4.0 specific functions
Usage : $vcf->parse_header_line(q[##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">]) $vcf->parse_header_line(q[reference=1000GenomesPilot-NCBI36]) Args : Returns :
About : A tool for merging VCFv4.0 records. The subroutine unifies the REFs and creates a mapping from the original haplotypes to the haplotypes based on the new REF. Consider the following example: REF ALT G GA GT G GT GA GT GAA GTC G G <DEL> my $map={G=>{GA=>1},GT=>{G=>1,GA=>1,GAA=>1},GTC=>{G=>1},G=>{'<DEL>'=>1}}; my $new_ref=$vcf->fill_ref_alt_mapping($map); The call returns GTC and $map is now G GA -> GTC GATC GT G -> GTC GC GT GA -> GTC GAC GT GAA -> GTC GAAC GTC G -> GTC G G <DEL> -> GTC <DEL> Args : Returns : New REF string and fills the hash with appropriate ALT.
VCFv4.1 specific functions
Andrew J. Page <ap13@sanger.ac.uk>
This software is Copyright (c) 2012 by Wellcome Trust Sanger Institute.
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007
To install Bio::Pipeline::Comparison, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::Pipeline::Comparison
CPAN shell
perl -MCPAN -e shell install Bio::Pipeline::Comparison
For more information on module installation, please visit the detailed CPAN module installation guide.