Manual of the intial release

Please contact Lin Huang <linhuang@cs.stanford.edu> for questions or comments

Prepare binary input files

The bash script prep.sh converts a bam file and a bai file of a sample into a binary format *.reveel. We provide this script for two reasons: (i) to save space. The created binary file costs much smaller space than the bam and bai files. (ii) to facilitate the parallelism of the main tool. We allow users to partition a chromosome into a series of segments; the reveel genotyper can be applied on these segments separately. To do so, please specify an interested segment in the command line when running bash prep.sh.

Once the *.reveel files of all the samples for this segment are created, simply use the "cat" command in Unix to concatenate all the *.reveel files for this segment and name the output as *.reveels. This *.reveels file will be used as an input of the genotyper.

Prerequisite

SAMtools, reference fasta

Usage

Command line is

bash prep.sh <bam> <reference.fasta> <chr> <lo> <hi> <output_prefix>

Arguments

Argument name	Type	Details
bam	string	the path + filename of the bam file. The bai file is not explicitly specified in the command; it should be named as bam_filename+".bai".
reference.fasta	string	the path + filename of the reference fasta file
chr	string	chromosome
lo	int	the start position of a segment (inclusive)
hi	int	the end position of a segment (exclusive)
output_prefix	string	the prefix of output files

Output

<output_prefix>.reveel

Usage example

bash prep.sh HG00096.chrom20.ILLUMINA.bwa.GBR.low_coverage.20120522.bam chrom20.fasta 20 43000000 48000000 HG00096

Population genotyper

We provides three modes: regular Reveel, Reveel-lite, and Reveel with Beagle. Reveel-lite is the most time efficient mode; Reveel with Beagle is the most accurate mode; regular Reveel balances the time efficiency and genotyping accuracy. To give you a flavor of how these three modes compare with each other, here we show their performance on a simulated data set with 1,000 samples.

No matter which mode you choose, we wrap the whole pipeline into a single bash file: reveel_genotyper.sh to give users a friendly interface. In case you need more information, the explanation of our population genotyper reveel_caller, which is wrapped in reveel_genotyper.sh, can be found at the end of this section.

Mode	Genotyping accuracy	Running time
regular Reveel	99.9477%	33.7 minutes
Reveel-lite	99.9365%	19.7 minutes
Reveel with Beagle	99.9644%	47.0 minutes

Prerequisite

beagle.jar and java (if choose Reveel with Beagle mode)

Usage

Command line is

bash reveel_genotyper.sh <reveels> <samples> <region> <mode> <block> <threshold> <error_rate> <repick_iter> <chr> <offset> <output_prefix>

Arguments

Argument name	Type	Details
reveels	string	the *.reveels file prepared in the pre-processing step
samples	int	number of samples in the data set
region	int	region length. The number of base pairs within the selected region. Remind that, the pre-processing has two parameters: lo and hi. The parameter region should be hi minus lo.
mode	int	mode. 0: regular Reveel; 1: Reveel-lite; 2: Reveel with Beagle
block	int	block size. For each SNP, we look for its similar sites within the block the SNP belongs to. We recommend 500,000 or 1,000,000 for this parameter. Larger block size usually results in better genotyping accuracy but longer CPU time.
threshold	float	we set a threshold to differentiate likely polymorphic sites from constant sites. If the data set includes thousands of samples, we recommend setting this parameter to be 1. Higher threshold gives better precision but lower recall of SNP calling.
error_rate	float	sequencing base error rate. Users can give a rough estimation, because the experiments show our tool is not sensitive to the estimation error. In our experiments, we used 0.001. Users can try this value to begin with.
repick_iter	int	how many times the genotyper re-picks k-nearest neighbors. The current version supports 1 to 5 times. We recommend 3 or 5.
chr	string	chromosome
offset	int	This is just for output. Write lo here.
output_prefix	string	the prefix of output files

Output

<output_prefix>.out Each line is a polymorphic site. Column 1, chromosome. Column 2, position. Column 3, major allele. Column 4, minor allele. Column 5, minor allele frequency. Column 6-end, genotypes of samples, in the order of users concatenating the reveel files.

Usage example

bash reveel_genotyper.sh chr20.43000000-48000000.reveels 100 5000000 0 500000 0.5 0.001 3 chr20 43000000 chr20.43000000-48000000

Options of command reveel_caller shortlist

Option name	Type	Details
-b	int	block size [default: 1000000]
-s	int	sample number [default: 2000]
-r	int	region length [default: 1000000]
-t	float	candidate threshold [default: 0.5]
-p	float	per base error rate [default: 0.001]
-i	int	refinement iteration [default: 10]
-k	int	k-nearest neighbor selection iteration [default: 5]
-M	int	mode. 0: regular, 1: lite, 2: with beagle [default: 0]

Caveats

Suppose the samples are from different populations, our experiments show that running the reveel genotyper on each population separately could give better results.
The prep.sh script uses samtools' default minimum mapping quality for an alignment to be used and the default minimum base quality for a base to be considered. If you have a desired setting, please edit prep.sh.
If launching multiple jobs on the same machine, please run the jobs within containers.