version 0.002
my $foo = Bio::SNP::Inherit->new( manifest_filename => 'manifest.tab', data_filename => 'data.tab' ); #Upon object construction, this outputs a summary file # 'data.tab_summary.tab' and a detailed file 'data.tab_abh.tab' # containing parental allele designations for each sample that has # parents defined for it in the manifest file
This is a module for converting Single Nucleotide Polymorphism (SNP) genotype data to parental allele designations. This helps with creating files suitable for mapping, identifying and characterizing crossovers, and also helps with quality control.
Since the integrity of the data in the manifest file is absolutely vital, building an object fails if there are duplicate sample ids in the manifest file.
Name of the file containing information for each sample id Required in the constructor The first line contains headers and the remaining lines contain tab-delimited fields in the following order: sample id or "Institute Sample Label" (e.g. "WG0096796-DNAA05" ) sample name or "Sample name" (e.g. "B73xB97" ) group name or "Group" (e.g. "NAM F1" ) parentA or "Mother" (e.g. "WG0096795-DNAA01" ) parentB or "Father" (e.g. "WG0096796-DNAF01" ) replicate of or "Replicate(s)" (id of sample that this replicates e.g. "WG0096796-DNAA05" ) AxB F1 or "F1 of parentA and parentB" (e.g. "WG0096795-DNAA02" ) The last four fields can be blank, if they are not applicable. However, being blank when they are applicable will result in failure of the program to analyze the data properly
Name of the tab-delimited file containing the data to be processed. Required in the constructor. The text '[Data]' in a line indicates that remaining lines are all data. The next line contains column headers, which are in fact the sample ids. Sample ids missing from the manifest file will not be processed. The next line contains the name of the SNP in the first field and data in the remaining fields. Data must be in the format of SNP_name{tab}AA{tab}GG{tab}.
Upon object construction, two files are produced: one that summarizes the input and another that that describes the genotypes of samples in terms of their "parents". For example, a sample with a genotype of "CG" whose 'parentA' has a genotype of "CC" and whose 'parentB' has a genotype of "GG" would have a heterozygous genotype, labeled as 'H'. Here are the possible allele designations that result: Allele designations for informative genotypes: A = parentA genotype B = parentB genotype H = heterozygous genotype Allele designations for noninformative genotypes: ~ = nonpolymorphic parents (i.e. both parents have same genotype) - = missing data -- = missing data for at least one parental % = polymorphic parent Error codes: # = conflict of nonpolymorphic expectation, meaning both parents have the same genotype, but the sample has a different genotype. For example, parentA and parentB both have the genotype 'CC', but the sample has a genotype of 'TT'. ! = nonparental genotype, meaning each parent has a different genotype, but the sample has at least one allele not seen in either parent. For example, getting 'AG' for the offspring when the parents have 'GG' and 'TT'. (This should not even be seen when the data was obtained from a biallelic assay.) !! = genotype of the F1 for parentA x parentB is incongruent with the genotype for parentA See the bundled tests for examples.
Output report detailing which samples have been processed and in what way. Also give descendents and ancestor relationships. Document ability to process files using F1 and parentA info (i.e. in the absence of parentB info). Add simple means of adding map info so that distances and chromosomes are output along with the marker names. Give crossover info? Give introgressions/regions attributable to specific ancestor(s). Use benchmarking to find out which (if any) to memoize: _nonredundant_chars _trim _is_comprised_from _sorted_characters _sort_and_join _chars_from _sorted_first_two_char Test bad file names
TODO
Please report any you find. None have been reported as of the current release.
This is ALPHA code. Use at your own risk. There are some major changes that I want to do to it.
Be consciencious with the preparation of your input files (i.e. manifest file and data file). Correct results depend on correct input files.
Christopher Bottoms, <molecules at cpan.org>
<molecules at cpan.org>
You can find documentation for this module with the perldoc command.
perldoc Bio::SNP::Inherit
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
Copyright 2010 Christopher Bottoms.
To install Bio::SNP::Inherit, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::SNP::Inherit
CPAN shell
perl -MCPAN -e shell install Bio::SNP::Inherit
For more information on module installation, please visit the detailed CPAN module installation guide.