Skip to Content

CRaVe Download and Instructions

CRaVe 

Section 0:

To download a beta-version of CRaVe 0.0.2, please choose

 

 UNIX:CRaVe
 
 R-library (Windows 7):CRaVe
 
 R-library (UNIX):CRaVe

 

To install CRaVe, follow these simple steps: setup

Our goal is to make this software as user-friendly as possible and therefore we encourage our users to provide feedback. First, as this is a beta-version, we do expect there to be some bugs, so please email us if you encounter any problems: joshua.sampson@nih.gov . However, we also encourage you to let us know if we omitted any options, features, or methods that could make CRaVe more useful.

Section I: Introduction

CRaVe is a free, open source, software package designed to perform a range of association tests between sets of SNPs and a phenotype. The basic input for CRaVe includes three types of information:

1) Genotypes for a set of individuals at a list of SNPs. 

 

 Subject 1Subject 2Subject 3
SNP 1ATCCAC
SNP 2AACCAA
SNP 3AACGAC
....

 

2) The Gene or Group ID for each SNP 

 

  Gene ID
 SNP 1g1
 SNP 2g1
 SNP 3g1
 ..
 SNP 41g2
 SNP 42g2
 ..

 

3) Phenotypes for a set of individuals 

 

  Phenotype
 Subject 11
 Subject 21
 Subject 30

 

The basic output of CRaVe is

1) One or more p-values for each gene. The p-values assess the significance of the relationship between gene and phenotype 

 

 GeneSum-TestHotelling's T-Test
 g10.030.05
 g20.670.82
 g30.210.63

Section 2: File Formats

Here, we discuss the actual file formats needed by CRaVe

1) Genotypes for a set of individuals at a list of SNPs. 

The genotype file can either be in variant call format, which usually ends with .vcf or it can be in a transposed PLINK ped file, which usually ends in .tped. Details about each type of file can be found at .vcf and .tped and examples can be found at VCF and TPED.

2) Gene or Group ID for each SNP 

Both .tped and .vcf files include chromosome and position for each SNP. Installed with CRaVe is the database hg19/ncbi build 37. The database is partitioned by chromosome into files named chr_xx.txt.gz, where xx is the chromosome number. These files have the specific format as defined by the gene_file option (see link for key details). The gene_file option refers to the directory where the chr_xx.txt.gz files are located. If say, the hg18/build 36 coordinates are needed, then a user can download them from here, use the command &quottar –xzvf hg18.tar.gz" to un-pack the files, and then use the option gene_file when running CRaVe. For instance, if the files for hg18 have been downloaded into folder ./mydata/someFolder3 and un-packed, then an example run might look like:

./crave geno_file=someFile.vcf gene_file=./mydata/someFolder3/hg18 pheno_file=someFile.txt tests=RO,SUM,TH

Note that the folder hg18 gets automatically created when un-packing the file hg18.tar.gz. A user can also specify their own categorization in their own files provided their files are named chr_xx.txt.gz, where xx is the chromosome number, and they all have the format as described in the gene_file option. An (uncompressed) example file can be found here geneFile.

3) Phenotypes for a set of individuals 

This file should consist of at least 2 columns. By default, CRaVe asumes column 1 is for the subject ids and column 2 is for the response.

Column 1: Subject ID

When the genotype file is a .tped file, the subjects listed in the phenotype file must match their order in the genotype file. When the genotype file is a .vcf file, their order is inconsequential but the subject IDs must match those in the .vcf file.

Column 2: Phenotype Value.

If case/control status, use 0/1 values so CRaVe understands it is looking at a case/control study. Different case-control values can be specified by using the case_value and control_value options.

This file can also contain additional columns containing covariate values for an individual. To adjust for covariates, one must use the covars option. An example of a phenotype file can be found here pheno.

4) Output files 

One output file is specified with the out_gene option and has the following columns:

 

 Column 1Gene ID
 Column 2Chromosome
 Column 3Gene start position
 Column 4Gene end position
 Column 5Number of SNPs
 Column 6Permutation p-value for the first requested test
 Column 7+Permutation p-value for additional tests

 

Another output file is specified with the out_snp option that lists results for individual SNPs:

 

 Column 1Chromosome
 Column 2Gene ID
 Column 3SNP position
 Column 4Minor allele frequency
 Column 5Number of subjects with non-missing genotypes
 Column 6Effect size of the SNP
 Column 7Delta value of the SNP
 Column 8P-value (2-sided using delta as the z-test statistic)

 

Examples of both output files can be found at gene-file and SNP-file.

5) Bioinformatic Weight 

Prespecified weights can be assigned to each SNP using the weight_file option. These weights should be listed in a file containing one or two columns. Alternatively, a weight of 0 or 1 can be assigned to codons, exons or non-synonomous SNPs using the weight= option.

Section 3: Statistical Tests

CRaVe can perform a range of association tests between SNPs and a phenotype. Currently, the possible tests include:

 

 ALLAll non-user defined tests
 BMBounded Minimum Test
 CMCCombined Multivariate and Collapsing Test
 DASData Adaptive Sum Test
 FIFisher's Test
 HOIHotelling's T2 Test assuming independence
 HOTHotelling's T2 Test
 MDFMultiple Degrees of Freedom Test
 MDPMultiple Degrees of Freedom Test using positive deltas
 ROROVER
 RVROVER-V test that includes the variances in the weights
 STOStouffer's Z-score Test
 STPStouffer's Positive Z-score Test
 SUMSum Test
 THThreshold Test
 SKATAlias for MDF [Under standard conditions, SKAT [PMID: 21737059] is equivalent to MDF]
 SKATO   Maximum of linear combinations of MDF and SUM statistics [similar to SKAT-O [PMID: 22863193]]
 CALAlias for C-alpha [Under standard conditions, the C-alpha statistic [PMID: 21408211] is equivalent to MDF]

 

Some tests are modified versions of those described by their original authors. The user needs to list the desired tests in the command line with the option tests=. Without including any tests, CRaVe defaults to calculating HOI. Some tests run much faster than others because they never require performing a matrix inversion. These faster tests are: FI, HOI, RO, STP and STO.

Section 4: Command Line

The simplest command, when the user has listed the SNPs by their hg19 positions, only requests ROVER, and does not use bioinformatic weights would be

./crave geno_file=someFile.vcf pheno_file=someFile.txt

However, in general, the user will want to specify multiple options. Most options are specified by their name, followed by an "=" sign, followed by the desired value. Below are a few examples.

./crave geno_file=someFile.vcf pheno_file=someFile.txt tests=RO,SUM,TH
./crave geno_file=someFile.vcf pheno_file=someFile.txt max_miss_rate=0.5
./crave geno_file=someFile.vcf pheno_file=someFile.txt exclude_snp=/temp/remove_SNPs.txt
./crave geno_file=someFile.vcf pheno_file=someFile.txt include_chr=1-5,22

If crave_001.tar.gz was installed in folder /someFolder1 and the genotype and phenotype files are in folder /someFolder2, then the command would look similar to

/someFolder1/crave geno_file=/someFolder2/someFile.vcf pheno_file=/someFolder2/someFile.txt

The list of options include:

 

 adapt_maxperm={integer}It is generally inefficient to run the maximum number of permutations for each gene. For those genes with relatively high p-values, only a small set of permutations is needed. We allow CRaVe to stop running permutations once it believes the p-values are sufficiently accurate. When we allow the number of permutations to depend on the p-value, we refer to them as adaptive permutations. However, we still need to define a maximum number of adaptive permutations with this option. Set to 0 for non-adaptive permutations. The default value is 1000000.
 adapt_tol={value}Stopping tolerance for adaptive permutations. Smaller values guarentee more precise estimates. The algorithm stops if 3.92*sqrt(pval*(1.0-pval)/N)/pval < adapt_tol, where N is the total number of permutations that have been run so far. The default is 0.1.
 addcols={0,1,2}Option to include additional columns in the out_gene file. If set to 1, then the number of permutations for each test will included. If set to 2, then the number of permutations and number of times the permuted test statistic was greater than the observed are included. These columns are useful for breaking up a large number of permutations into smaller jobs on cluster. The default value is 0.
 allele_miss={character}Missing value for alleles in a TPED file. The default is 0.
 bm_n={integer}Number of points to consider in the Bounded Minimum Test. The default is 10.
 case_value={integer}Value representing cases if response_col is for a case-control study. The default value is 1.
 chr_alias_file={file}Tab delimited text file with the first column as the chromosome number (1-26) and the second column the chromosome string. This option is only used with VCF genotype files when the chromosome fields are not labeled as integers. An example file is chr_alias-file.
 control_value={integer}Value representing controls if response_col is for a case-control study. The default value is 0.
 covars={list}Comma separated list of column numbers or variable names in the phenotype file to use as covariates in the analysis. Example: covars=4,7,9 will use columns 4,7 and 9 in the phenotype file as covariates. Values of 1 or 2 are generally not used because those are generally reserved for subject ID and the outcome variable.The default value is to not use covariates.
 
 das_alpha={value}A number between 0 and 1 for the alpha parameter in the Data Adaptive Sum Test. The default is 0.05
 exclude_chr={list}List of chromsomes to exclude in the analysis. Examples: 1,4,6 1-11,22, 1,3,5-10,X
 exclude_gene={file}File containing 1 column of the genes to exclude in the analysis.
 exclude_snp={file}File containing 1 column of the SNPs to exclude in the analysis.
 exclude_sub={file}File containing 1 column of the subject ids to exclude in the analysis.
 gene_file={file}Directory containing the gene information files partitioned by chromosome. The default value is hg19. More details at gene_file [Note that geno_file and gene_file define two different types of files.]
 gene_kb={integer}Number of kilobases upstream and downstream to expand the gene regions from the gene tables. The default is 10.
 geno_file={file}Uncompressed VCF or TPED file. No default value, must be specified. [Note that geno_file and gene_file define two different types of files.]
 geno_sep={character}File delimiter for the genotype file. Use 't' for tab-delimited, ' ' for space delimited, and ',' for comma delimited files. The default is determined from the file.
 id={string}Variable name or column number in pheno_file for the subject ids. The default value is 1
 include_chr={list}List of chromsomes to include in the analysis. Examples: 1,4,6 1-11,22, 1,3,5-10,X The default value is 1-26
 include_gene={file}File containing 1 column of the genes to include in the analysis.
 include_snp={file}File containing 1 column of the SNPs to include in the analysis.
 include_sub={file}File containing 1 column of the subject ids to include in the analysis.
 max_maf={value}Maximum MAF to use for which SNPs to include in the analysis. Any SNP with MAF > max_maf will be ignored. The default value is 0.5
 max_miss_rate={value}Maximum missing rate for any SNP to be included in the analysis The default value is 0.8
 min_maf={value}Minimum MAF to use for which SNPs to include in the analysis. Any SNP with MAF < min_maf will be ignored. The default value is 0
 nperm={integer}For adaptive permutations, nperm is the number of permutations (per round) performed before checking the stopping criteria. (see option adapt_maxperm for definition of adaptive permutations). For non-adaptive permutations, nperm is the total number of permutations. The default value is 500.
 out_gene={file}Output file for the genes The default value is ./out_gene.txt.
 out_snp={file}Output file for the SNPs The default value is ./out_snp.txt.
 pheno_file={file}Uncompressed file containing the response and covariates No default value, must be specified
 
 pheno_header={0 or 1}Set to 1 if row 1 of pheno_file contains column names. The default value is determined from the file.
 pheno_miss={character}Missing value for the phenotype file. The default value is '.'
 pheno_sep={character}One character value for the phenotype file delimiter. Use 't' for tab-delimited, ' ' for space delimited, and ',' for comma delimited files. The default value is determined from the file.
 print={integer}The larger the integer, the more information will be written to the console. This information is useful to see summary information, the genes that have been processed, and is critical for debugging. The default is 0 (no printing).
 reflink.geneName={string}Name of the variable in the INFO field of the VCF file that gives the gene name. The default value is reflink.geneName.
 response={string}Variable name or column number pheno_file for the response. The default value is 2.
 rover_t={value}Positive number t to define the weights (w = sqrt(1-exp(-t*delta*delta))) in the Rover Test. The default is 0.2
 seed={integer}Initial seed for the random permutations. The default is the returned value from the time() function.
 sortedByLoc={0 or 1}Set to 0 if the genotype data is not sorted by SNP location within each chromosome. The default is 1.
 store_perms={0 or 1}Set to 1 to store the permutations for efficiency. Due to memory issues, this option should only be used for a small sample size and number of permutations. The default is 0.
 strata={string}strata can be set to the name of a variable in the phenotype file. When set, the outcome is permuted among individuals with the same value of the strata variable and the named variable is automatically included as a covariate.
 tests={List of strings}Comma delimited list of tests to use. The valid tests are: BM, CMC, DAS, FI, HOI, HOT, MDF, MDP, RO, RV, STO, STP, SUM, TH. The strings are not case-sensitive. Example: tests=sto,bm,sum will compute the Stouffer, BM and SUM tests. Example: tests=c1,th will compute the user defined test c1 and the threshold test. Use tests=all to compute all non user defined tests. The default is MDF.
 threshold_alpha={value}A number between 0 and 1 for the alpha parameter in the Theshold Test. The default is 0.05
 update_cor={value}The value such that the correlation matrix will be updated if the relative difference in the observed and permuted Hotelling test statistics is less than update_cor. This option is only valid for case-control data and when the option --useAllSubsForV is not used. The default is 0.3
 vs={value}The value of vs, where vs abbreviates variance stabilization, is the value added to the estimated variance of each SNP. The default is 0.005
 weight={c, e, or n}weight=e assigns a weight of 1 to all exomic SNPs and 0 to other SNPs. weight=g assigns a weight of 1 to SNPs between the start and stop codon and 0 to other SNPs. weight=n assigns a weight of 1 to non-synonomous exomic SNPs and 0 to other SNPs. The default is that a weight of 1 is assigned to all SNPs. This option cannot be used with the weight_file= option, and weight=n is only valid for hg18 and hg19.
 weight_file={file}Uncompressed file containing the SNP weights. This file may have 1 or 2 columns only. For a 1 column file the column of weights must match the order of the SNPs in the genotype file. For a 2 column file, the first column is the SNP id and the second column the weight. This option cannot be used with the weight= option.
 weight_header={0 or 1}Set to 1 if row 1 of the weight file contains column names. The default value is determined from the file.
 weight_sep={character}One character value for the weight file delimiter.
 --helpDisplay this information and exit.
 --notGroupedByChrIf the genotype file is a TPED file and it is not grouped by chromosome.
 --no_gene_fileDo not use the default gene table file for the gene names. Instead use the reflink.geneName value in the VCF file. This option is only valid if geno_file is a VCF file.
 --versionDisplay the version number and exit.

!!!!! IMPORTANT !!!!! For p-values below 10-5, we strongly suggest increasing the maximum number of permutations. 

Section 5: Frequently Asked Questions

1) How does CRaVe handle highly correlated genotypes? 

Correlation only poses a problem for statistics that include terms from the inverse of the correlation matrix s. When s is non-invertible, we invert s+cI, where I is the identity matrix and c is a small value.

2) Can I define my own test statistic? 

Yes. A user can define a test statistic as a R function or C function. For those accustomed to R, CRaVe greatly speeds up the calculation of the permuted p-value. For more information see user-defined.

Updated: 6/3/2013