Manual of the intial release

Please contact Lin Huang <linhuang@cs.stanford.edu> for questions or comments

Prepare binary input files

The bash script prep.sh converts a bam file and a bai file of a sample into a binary format *.reveel. We provide this script for two reasons: (i) to save space. The created binary file costs much smaller space than the bam and bai files. (ii) to facilitate the parallelism of the main tool. We allow users to partition a chromosome into a series of segments; the reveel genotyper can be applied on these segments separately. To do so, please specify an interested segment in the command line when running bash prep.sh.

Once the *.reveel files of all the samples for this segment are created, simply use the "cat" command in Unix to concatenate all the *.reveel files for this segment and name the output as *.reveels. This *.reveels file will be used as an input of the genotyper.

Prerequisite

SAMtools, reference fasta

Usage

Command line is

bash prep.sh <bam> <reference.fasta> <chr> <lo> <hi> <output_prefix>

Arguments

Argument name Type Details
bam string the path + filename of the bam file. The bai file is not explicitly specified in the command; it should be named as bam_filename+".bai".
reference.fasta string the path + filename of the reference fasta file
chr string chromosome
lo int the start position of a segment (inclusive)
hi int the end position of a segment (exclusive)
output_prefix string the prefix of output files

Output

<output_prefix>.reveel

Usage example

bash prep.sh HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam chrom20.fasta 20 43000000 48000000 HG00096

Population genotyper

We provides three modes: regular Reveel, Reveel-lite, and Reveel with Beagle. Reveel-lite is the most time efficient mode; Reveel with Beagle is the most accurate mode; regular Reveel balances the time efficiency and genotyping accuracy. To give you a flavor of how these three modes compare with each other, here we show their performance on a simulated data set with 1,000 samples.

No matter which mode you choose, we wrap the whole pipeline into a single bash file: reveel_genotyper.sh to give users a friendly interface. In case you need more information, the explanation of our population genotyper reveel_caller, which is wrapped in reveel_genotyper.sh, can be found at the end of this section.

Mode Genotyping accuracy Running time
regular Reveel 99.9477% 33.7 minutes
Reveel-lite 99.9365% 19.7 minutes
Reveel with Beagle 99.9644% 47.0 minutes

Prerequisite

beagle.jar and java (if choose Reveel with Beagle mode)

Usage

Command line is

bash reveel_genotyper.sh <reveels> <samples> <region> <mode> <block> <threshold> <error_rate> <repick_iter> <chr> <offset> <output_prefix>

Arguments

Argument name Type Details
reveels string the *.reveels file prepared in the pre-processing step
samples int number of samples in the data set
region int region length. The number of base pairs within the selected region. Remind that, the pre-processing has two parameters: lo and hi. The parameter region should be hi minus lo.
mode int mode. 0: regular Reveel; 1: Reveel-lite; 2: Reveel with Beagle
block int block size. For each SNP, we look for its similar sites within the block the SNP belongs to. We recommend 500,000 or 1,000,000 for this parameter. Larger block size usually results in better genotyping accuracy but longer CPU time.
threshold float we set a threshold to differentiate likely polymorphic sites from constant sites. If the data set includes thousands of samples, we recommend setting this parameter to be 1. Higher threshold gives better precision but lower recall of SNP calling.
error_rate float sequencing base error rate. Users can give a rough estimation, because the experiments show our tool is not sensitive to the estimation error. In our experiments, we used 0.001. Users can try this value to begin with.
repick_iter int how many times the genotyper re-picks k-nearest neighbors. The current version supports 1 to 5 times. We recommend 3 or 5.
chr string chromosome
offset int This is just for output. Write lo here.
output_prefix string the prefix of output files

Output

<output_prefix>.out      Each line is a polymorphic site. Column 1, chromosome. Column 2, position. Column 3, major allele. Column 4, minor allele. Column 5, minor allele frequency. Column 6-end, genotypes of samples, in the order of users concatenating the reveel files.

Usage example

bash reveel_genotyper.sh chr20.43000000-48000000.reveels 100 5000000 0 500000 0.5 0.001 3 chr20 43000000 chr20.43000000-48000000

Options of command  reveel_caller shortlist

Option name Type Details
-b int block size [default: 1000000]
-s int sample number [default: 2000]
-r int region length [default: 1000000]
-t float candidate threshold [default: 0.5]
-p float per base error rate [default: 0.001]
-i int refinement iteration [default: 10]
-k int k-nearest neighbor selection iteration [default: 5]
-M int mode. 0: regular, 1: lite, 2: with beagle [default: 0]

Caveats

  • Suppose the samples are from different populations, our experiments show that running the reveel genotyper on each population separately could give better results.

  • The prep.sh script uses samtools' default minimum mapping quality for an alignment to be used and the default minimum base quality for a base to be considered. If you have a desired setting, please edit prep.sh.

  • If launching multiple jobs on the same machine, please run the jobs within containers.