Manual of the intial release
Prepare binary input files
The bash script prep.sh converts a bam file and a bai file of a sample into a binary format *.reveel.
We provide this script for two reasons: (i) to save space.
The created binary file costs much smaller space than the bam and bai files.
(ii) to facilitate the parallelism of the main tool.
We allow users to partition a chromosome into a series of segments;
the reveel genotyper can be applied on these segments separately.
To do so, please specify an interested segment in the command line when running bash prep.sh.
Once the *.reveel files of all the samples for this segment are created,
simply use the "cat" command in Unix to concatenate all the *.reveel files for this segment and name the output as *.reveels.
This *.reveels file will be used as an input of the genotyper.
Prerequisite
SAMtools, reference fasta
Usage
Command line is
bash prep.sh <bam> <reference.fasta> <chr> <lo> <hi> <output_prefix>
Arguments
Argument name | Type | Details |
bam | string | the path + filename of the bam file. The bai file is not explicitly specified in the command; it should be named as bam_filename+".bai". |
reference.fasta | string | the path + filename of the reference fasta file |
chr | string | chromosome |
lo | int | the start position of a segment (inclusive) |
hi | int | the end position of a segment (exclusive) |
output_prefix | string | the prefix of output files |
Output
<output_prefix>.reveel
Usage example
bash prep.sh HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam chrom20.fasta 20 43000000 48000000 HG00096
Population genotyper
We provides three modes: regular Reveel, Reveel-lite, and Reveel with Beagle. Reveel-lite is the most time efficient mode; Reveel with Beagle is the most accurate mode; regular Reveel balances the time efficiency and genotyping accuracy.
To give you a flavor of how these three modes compare with each other, here we show their performance on a simulated data set with 1,000 samples.
No matter which mode you choose, we wrap the whole pipeline into a single bash file: reveel_genotyper.sh to give users a friendly interface.
In case you need more information, the explanation of our population genotyper reveel_caller, which is wrapped in reveel_genotyper.sh, can be found at the end of this section.
Mode | Genotyping accuracy | Running time |
regular Reveel | 99.9477% | 33.7 minutes |
Reveel-lite | 99.9365% | 19.7 minutes |
Reveel with Beagle | 99.9644% | 47.0 minutes |
Prerequisite
beagle.jar and java (if choose Reveel with Beagle mode)
Usage
Command line is
bash reveel_genotyper.sh <reveels> <samples> <region> <mode> <block> <threshold> <error_rate> <repick_iter> <chr> <offset> <output_prefix>
Arguments
Argument name | Type | Details |
reveels | string | the *.reveels file prepared in the pre-processing step |
samples | int | number of samples in the data set |
region | int | region length. The number of base pairs within the selected region. Remind that, the pre-processing has two parameters: lo and hi. The parameter region should be hi minus lo. |
mode | int | mode. 0: regular Reveel; 1: Reveel-lite; 2: Reveel with Beagle |
block | int | block size. For each SNP, we look for its similar sites within the block the SNP belongs to. We recommend 500,000 or 1,000,000 for this parameter. Larger block size usually results in better genotyping accuracy but longer CPU time. |
threshold | float | we set a threshold to differentiate likely polymorphic sites from constant sites. If the data set includes thousands of samples, we recommend setting this parameter to be 1. Higher threshold gives better precision but lower recall of SNP calling. |
error_rate | float | sequencing base error rate. Users can give a rough estimation, because the experiments show our tool is not sensitive to the estimation error. In our experiments, we used 0.001. Users can try this value to begin with. |
repick_iter | int | how many times the genotyper re-picks k-nearest neighbors. The current version supports 1 to 5 times. We recommend 3 or 5. |
chr | string | chromosome |
offset | int | This is just for output. Write lo here. |
output_prefix | string | the prefix of output files |
Output
<output_prefix>.out Each line is a polymorphic site. Column 1, chromosome. Column 2, position. Column 3, major allele. Column 4, minor allele. Column 5, minor allele frequency. Column 6-end, genotypes of samples, in the order of users concatenating the reveel files.
Usage example
bash reveel_genotyper.sh chr20.43000000-48000000.reveels 100 5000000 0 500000 0.5 0.001 3 chr20 43000000 chr20.43000000-48000000
Options of command reveel_caller shortlist
Option name | Type | Details |
-b | int | block size [default: 1000000] |
-s | int | sample number [default: 2000] |
-r | int | region length [default: 1000000] |
-t | float | candidate threshold [default: 0.5] |
-p | float | per base error rate [default: 0.001] |
-i | int | refinement iteration [default: 10] |
-k | int | k-nearest neighbor selection iteration [default: 5] |
-M | int | mode. 0: regular, 1: lite, 2: with beagle [default: 0] |
Caveats
Suppose the samples are from different populations, our experiments show that running the reveel genotyper on each population separately could give better results.
The prep.sh script uses samtools' default minimum mapping quality for an alignment to be used and the default minimum base quality for a base to be considered. If you have a desired setting, please edit prep.sh.
If launching multiple jobs on the same machine, please run the jobs within containers.
|