CRaVe Download and Instructions

CRaVe

Section 0:

To download a beta-version of CRaVe 0.0.2, please choose

	UNIX:	CRaVe
	R-library (Windows 7):	CRaVe
	R-library (UNIX):	CRaVe

To install CRaVe, follow these simple steps: setup

Our goal is to make this software as user-friendly as possible and therefore we encourage our users to provide feedback. First, as this is a beta-version, we do expect there to be some bugs, so please email us if you encounter any problems: contact DCEG. However, we also encourage you to let us know if we omitted any options, features, or methods that could make CRaVe more useful.

Section I: Introduction

CRaVe is a free, open source, software package designed to perform a range of association tests between sets of SNPs and a phenotype. The basic input for CRaVe includes three types of information:

1) Genotypes for a set of individuals at a list of SNPs.

	Subject 1	Subject 2	Subject 3
SNP 1	AT	CC	AC
SNP 2	AA	CC	AA
SNP 3	AA	CG	AC
.	.	.	.

2) The Gene or Group ID for each SNP

		Gene ID
	SNP 1	g1
	SNP 2	g1
	SNP 3	g1
	.	.
	SNP 41	g2
	SNP 42	g2
	.	.

3) Phenotypes for a set of individuals

		Phenotype
	Subject 1	1
	Subject 2	1
	Subject 3	0

The basic output of CRaVe is

1) One or more p-values for each gene. The p-values assess the significance of the relationship between gene and phenotype

Gene	Sum-Test	Hotelling's T-Test
g1	0.03	0.05
g2	0.67	0.82
g3	0.21	0.63

Section 2: File Formats

Here, we discuss the actual file formats needed by CRaVe

1) Genotypes for a set of individuals at a list of SNPs.

The genotype file can either be in variant call format, which usually ends with .vcf or it can be in a transposed PLINK ped file, which usually ends in .tped.

2) Gene or Group ID for each SNP

Both .tped and .vcf files include chromosome and position for each SNP. Installed with CRaVe is the database hg19/ncbi build 37. The database is partitioned by chromosome into files named chr_xx.txt.gz, where xx is the chromosome number. These files have the specific format as defined by the gene-file option (see link for key details). The gene-file option refers to the directory where the chr_xx.txt.gz files are located. If say, the hg18/build 36 coordinates are needed, then a user can download them from here, use the command &quottar –xzvf hg18.tar.gz" to un-pack the files, and then use the option gene-file when running CRaVe. For instance, if the files for hg18 have been downloaded into folder ./mydata/someFolder3 and un-packed, then an example run might look like:

./crave geno_file=someFile.vcf gene_file=./mydata/someFolder3/hg18 pheno_file=someFile.txt tests=RO,SUM,TH

Note that the folder hg18 gets automatically created when un-packing the file hg18.tar.gz. A user can also specify their own categorization in their own files provided their files are named chr_xx.txt.gz, where xx is the chromosome number, and they all have the format as described in the gene-file option. An (uncompressed) example file can be found here genefile.

3) Phenotypes for a set of individuals

This file should consist of at least 2 columns. By default, CRaVe asumes column 1 is for the subject ids and column 2 is for the response.

Column 1: Subject ID

When the genotype file is a .tped file, the subjects listed in the phenotype file must match their order in the genotype file. When the genotype file is a .vcf file, their order is inconsequential but the subject IDs must match those in the .vcf file.

Column 2: Phenotype Value.

If case/control status, use 0/1 values so CRaVe understands it is looking at a case/control study. Different case-control values can be specified by using the case_value and control_value options.

This file can also contain additional columns containing covariate values for an individual. To adjust for covariates, one must use the covars option. An example of a phenotype file can be found here pheno.

4) Output files

One output file is specified with the out_gene option and has the following columns:

	Column 1	Gene ID
	Column 2	Chromosome
	Column 3	Gene start position
	Column 4	Gene end position
	Column 5	Number of SNPs
	Column 6	Permutation p-value for the first requested test
	Column 7+	Permutation p-value for additional tests

Another output file is specified with the out_snp option that lists results for individual SNPs:

	Column 1	Chromosome
	Column 2	Gene ID
	Column 3	SNP position
	Column 4	Minor allele frequency
	Column 5	Number of subjects with non-missing genotypes
	Column 6	Effect size of the SNP
	Column 7	Delta value of the SNP
	Column 8	P-value (2-sided using delta as the z-test statistic)

Examples of both output files can be found at gene-file and SNP-file.

5) Bioinformatic Weight

Prespecified weights can be assigned to each SNP using the weight_file option. These weights should be listed in a file containing one or two columns. Alternatively, a weight of 0 or 1 can be assigned to codons, exons or non-synonomous SNPs using the weight= option.

Section 3: Statistical Tests

CRaVe can perform a range of association tests between SNPs and a phenotype. Currently, the possible tests include:

	ALL	All non-user defined tests
	BM	Bounded Minimum Test
	CMC	Combined Multivariate and Collapsing Test
	DAS	Data Adaptive Sum Test
	FI	Fisher's Test
	HOI	Hotelling's T² Test assuming independence
	HOT	Hotelling's T² Test
	MDF	Multiple Degrees of Freedom Test
	MDP	Multiple Degrees of Freedom Test using positive deltas
	RO	ROVER
	RV	ROVER-V test that includes the variances in the weights
	STO	Stouffer's Z-score Test
	STP	Stouffer's Positive Z-score Test
	SUM	Sum Test
	TH	Threshold Test
	SKAT	Alias for MDF [Under standard conditions, SKAT [PMID: 21737059] is equivalent to MDF]
	SKATO	Maximum of linear combinations of MDF and SUM statistics [similar to SKAT-O [PMID: 22863193]]
	CAL	Alias for C-alpha [Under standard conditions, the C-alpha statistic [PMID: 21408211] is equivalent to MDF]

Some tests are modified versions of those described by their original authors. The user needs to list the desired tests in the command line with the option tests=. Without including any tests, CRaVe defaults to calculating HOI. Some tests run much faster than others because they never require performing a matrix inversion. These faster tests are: FI, HOI, RO, STP and STO.

Section 4: Command Line

The simplest command, when the user has listed the SNPs by their hg19 positions, only requests ROVER, and does not use bioinformatic weights would be

./crave geno_file=someFile.vcf pheno_file=someFile.txt

However, in general, the user will want to specify multiple options. Most options are specified by their name, followed by an "=" sign, followed by the desired value. Below are a few examples.

./crave geno_file=someFile.vcf pheno_file=someFile.txt tests=RO,SUM,TH
./crave geno_file=someFile.vcf pheno_file=someFile.txt max_miss_rate=0.5
./crave geno_file=someFile.vcf pheno_file=someFile.txt exclude_snp=/temp/remove_SNPs.txt
./crave geno_file=someFile.vcf pheno_file=someFile.txt include_chr=1-5,22

If crave_001.tar.gz was installed in folder /someFolder1 and the genotype and phenotype files are in folder /someFolder2, then the command would look similar to

/someFolder1/crave geno_file=/someFolder2/someFile.vcf pheno_file=/someFolder2/someFile.txt

The list of options include:

	adapt_maxperm={integer}	It is generally inefficient to run the maximum number of permutations for each gene. For those genes with relatively high p-values, only a small set of permutations is needed. We allow CRaVe to stop running permutations once it believes the p-values are sufficiently accurate. When we allow the number of permutations to depend on the p-value, we refer to them as adaptive permutations. However, we still need to define a maximum number of adaptive permutations with this option. Set to 0 for non-adaptive permutations. The default value is 1000000.
	adapt_tol={value}	Stopping tolerance for adaptive permutations. Smaller values guarentee more precise estimates. The algorithm stops if 3.92sqrt(pval(1.0-pval)/N)/pval < adapt_tol, where N is the total number of permutations that have been run so far. The default is 0.1.
	addcols={0,1,2}	Option to include additional columns in the out_gene file. If set to 1, then the number of permutations for each test will included. If set to 2, then the number of permutations and number of times the permuted test statistic was greater than the observed are included. These columns are useful for breaking up a large number of permutations into smaller jobs on cluster. The default value is 0.
	allele_miss={character}	Missing value for alleles in a TPED file. The default is 0.
	bm_n={integer}	Number of points to consider in the Bounded Minimum Test. The default is 10.
	case_value={integer}	Value representing cases if response_col is for a case-control study. The default value is 1.
	chr_alias_file={file}	Tab delimited text file with the first column as the chromosome number (1-26) and the second column the chromosome string. This option is only used with VCF genotype files when the chromosome fields are not labeled as integers. An example file is chr_alias-file.
	control_value={integer}	Value representing controls if response_col is for a case-control study. The default value is 0.
	covars={list}	Comma separated list of column numbers or variable names in the phenotype file to use as covariates in the analysis. Example: covars=4,7,9 will use columns 4,7 and 9 in the phenotype file as covariates. Values of 1 or 2 are generally not used because those are generally reserved for subject ID and the outcome variable.The default value is to not use covariates.
	das_alpha={value}	A number between 0 and 1 for the alpha parameter in the Data Adaptive Sum Test. The default is 0.05
	exclude_chr={list}	List of chromsomes to exclude in the analysis. Examples: 1,4,6 1-11,22, 1,3,5-10,X
	exclude_gene={file}	File containing 1 column of the genes to exclude in the analysis.
	exclude_snp={file}	File containing 1 column of the SNPs to exclude in the analysis.
	exclude_sub={file}	File containing 1 column of the subject ids to exclude in the analysis.
	gene_file={file}	Directory containing the gene information files partitioned by chromosome. The default value is hg19. More details at gene_file [Note that geno_file and gene_file define two different types of files.]
	gene_kb={integer}	Number of kilobases upstream and downstream to expand the gene regions from the gene tables. The default is 10.
	geno_file={file}	Uncompressed VCF or TPED file. No default value, must be specified. [Note that geno_file and gene_file define two different types of files.]
	geno_sep={character}	File delimiter for the genotype file. Use 't' for tab-delimited, ' ' for space delimited, and ',' for comma delimited files. The default is determined from the file.
	id={string}	Variable name or column number in pheno_file for the subject ids. The default value is 1
	include_chr={list}	List of chromsomes to include in the analysis. Examples: 1,4,6 1-11,22, 1,3,5-10,X The default value is 1-26
	include_gene={file}	File containing 1 column of the genes to include in the analysis.
	include_snp={file}	File containing 1 column of the SNPs to include in the analysis.
	include_sub={file}	File containing 1 column of the subject ids to include in the analysis.
	max_maf={value}	Maximum MAF to use for which SNPs to include in the analysis. Any SNP with MAF > max_maf will be ignored. The default value is 0.5
	max_miss_rate={value}	Maximum missing rate for any SNP to be included in the analysis The default value is 0.8
	min_maf={value}	Minimum MAF to use for which SNPs to include in the analysis. Any SNP with MAF < min_maf will be ignored. The default value is 0
	nperm={integer}	For adaptive permutations, nperm is the number of permutations (per round) performed before checking the stopping criteria. (see option adapt_maxperm for definition of adaptive permutations). For non-adaptive permutations, nperm is the total number of permutations. The default value is 500.
	out_gene={file}	Output file for the genes The default value is ./out_gene.txt.
	out_snp={file}	Output file for the SNPs The default value is ./out_snp.txt.
	pheno_file={file}	Uncompressed file containing the response and covariates No default value, must be specified
	pheno_header={0 or 1}	Set to 1 if row 1 of pheno_file contains column names. The default value is determined from the file.
	pheno_miss={character}	Missing value for the phenotype file. The default value is '.'
	pheno_sep={character}	One character value for the phenotype file delimiter. Use 't' for tab-delimited, ' ' for space delimited, and ',' for comma delimited files. The default value is determined from the file.
	print={integer}	The larger the integer, the more information will be written to the console. This information is useful to see summary information, the genes that have been processed, and is critical for debugging. The default is 0 (no printing).
	reflink.geneName={string}	Name of the variable in the INFO field of the VCF file that gives the gene name. The default value is reflink.geneName.
	response={string}	Variable name or column number pheno_file for the response. The default value is 2.
	rover_t={value}	Positive number t to define the weights (w = sqrt(1-exp(-tdeltadelta))) in the Rover Test. The default is 0.2
	seed={integer}	Initial seed for the random permutations. The default is the returned value from the time() function.
	sortedByLoc={0 or 1}	Set to 0 if the genotype data is not sorted by SNP location within each chromosome. The default is 1.
	store_perms={0 or 1}	Set to 1 to store the permutations for efficiency. Due to memory issues, this option should only be used for a small sample size and number of permutations. The default is 0.
	strata={string}	strata can be set to the name of a variable in the phenotype file. When set, the outcome is permuted among individuals with the same value of the strata variable and the named variable is automatically included as a covariate.
	tests={List of strings}	Comma delimited list of tests to use. The valid tests are: BM, CMC, DAS, FI, HOI, HOT, MDF, MDP, RO, RV, STO, STP, SUM, TH. The strings are not case-sensitive. Example: tests=sto,bm,sum will compute the Stouffer, BM and SUM tests. Example: tests=c1,th will compute the user defined test c1 and the threshold test. Use tests=all to compute all non user defined tests. The default is MDF.
	threshold_alpha={value}	A number between 0 and 1 for the alpha parameter in the Theshold Test. The default is 0.05
	update_cor={value}	The value such that the correlation matrix will be updated if the relative difference in the observed and permuted Hotelling test statistics is less than update_cor. This option is only valid for case-control data and when the option --useAllSubsForV is not used. The default is 0.3
	vs={value}	The value of vs, where vs abbreviates variance stabilization, is the value added to the estimated variance of each SNP. The default is 0.005
	weight={c, e, or n}	weight=e assigns a weight of 1 to all exomic SNPs and 0 to other SNPs. weight=g assigns a weight of 1 to SNPs between the start and stop codon and 0 to other SNPs. weight=n assigns a weight of 1 to non-synonomous exomic SNPs and 0 to other SNPs. The default is that a weight of 1 is assigned to all SNPs. This option cannot be used with the weight_file= option, and weight=n is only valid for hg18 and hg19.
	weight_file={file}	Uncompressed file containing the SNP weights. This file may have 1 or 2 columns only. For a 1 column file the column of weights must match the order of the SNPs in the genotype file. For a 2 column file, the first column is the SNP id and the second column the weight. This option cannot be used with the weight= option.
	weight_header={0 or 1}	Set to 1 if row 1 of the weight file contains column names. The default value is determined from the file.
	weight_sep={character}	One character value for the weight file delimiter.
	--help	Display this information and exit.
	--notGroupedByChr	If the genotype file is a TPED file and it is not grouped by chromosome.
	--no_gene_file	Do not use the default gene table file for the gene names. Instead use the reflink.geneName value in the VCF file. This option is only valid if geno_file is a VCF file.
	--version	Display the version number and exit.

!!!!! IMPORTANT !!!!! For p-values below 10^-5, we strongly suggest increasing the maximum number of permutations.

Section 5: Frequently Asked Questions

1) How does CRaVe handle highly correlated genotypes?

Correlation only poses a problem for statistics that include terms from the inverse of the correlation matrix s. When s is non-invertible, we invert s+cI, where I is the identity matrix and c is a small value.

2) Can I define my own test statistic?

Yes. A user can define a test statistic as a R function or C function. For those accustomed to R, CRaVe greatly speeds up the calculation of the permuted p-value. For more information see user-defined.

Updated: 6/3/2013