Skip to Content
Discovering the causes of cancer and the means of prevention

Joshua Sampson, Ph.D.

Senior Investigator

Information for Journalists

To request an interview with a DCEG investigator, contact the NCI Office of Media Relations:


Phone: 240-760-6600

Spotlight on Investigator

Read Joshua Sampson's profile in the March 2013 Linkage newsletter.

Joshua Sampson, Ph.D.

Joshua Sampson, Ph.D.

Organization:National Cancer Institute
Division of Cancer Epidemiology & Genetics, Biostatistics Branch
Address:NCI Shady Grove
Room 7E594


Joshua Sampson received a B.A. in mathematics and chemistry from Pomona College and an M.S. in biophysical chemistry from Stanford University. He then received his Ph.D. from the Biostatistics Department at the University of Washington in 2007 and spent two years as a postdoctoral fellow in the Yale School of Medicine. He joined the NCI in 2009 as a principal investigator, and was awarded scientific tenure and promoted to senior investigator in December 2016.

Research Interests

Genome-wide Association Studies and Genome-wide Linkage Studies

In genome-wide association studies (GWAS), we survey the genome for single nucleotide polymorphisms (SNPs) and/or copy number variations (CNVs) associated with a phenotype. Traditionally, the first step in such studies is to call, or assign a genotype to, each SNP in each subject based on a statistic summarizing fluorescent measurements. This facilitates simple analyses, such as comparing the percentages of the `A’ allele in the cases and controls through a chi-squared test. Unfortunately, we can often make mistakes in assigning genotypes. These mistakes can often lead to erroneous conclusions about our data, and are partially responsible for the high false positive rates. I have been interested in exploring the origin and the effects of genotyping error. Moreover, I am interested in developing new methods to examine the significance of a SNP that are based on the underlying fluorescent data and can skip the genotyping step.

In genetical genomics studies, gene expression levels are treated as phenotypes. Therefore, these studies measure thousands of phenotypes simultaneously. By combining information from multiple phenotypes, we can increase our power to detect linked quantitative trait loci (eQTL) and more accurately estimate their locations. However, before performing joint mapping, we need to first identify coregulated expression levels. I have developed a method for identifying coregulated genotypes based on the correlation coefficient between linkage profiles. We are currently interested studying the behavior of this LOD score correlation coefficient.

Ancestry Informative Markers

With the goal of understanding worldwide genetic diversity, the Human Genome Diversity Project (HGDP) has now genotyped nearly 1000 individuals across 52 populations at 300,000+ SNPs. Using the resulting information, it is possible to accurately predict an individual's eth- nicity or ancestry from a genetic sample. In fact, it is possible to accurately predict ethnicity using only a small subset of the SNPs originally genotyped. Because of its potential use in forensic applications and identifying population substructure in future genetic studies, our goal is to select a minimal number of SNPs with maximal predictive accuracy. This goal requires separating truly informative SNPs from those that, by random chance, appear to be informative, and can be framed as a standard variable selection problem. Therefore, following a greedy algorithm, SNPs can be selected to minimize the expected error rate. My interest is in estimating the error rate for a given group of SNPs in this high dimensional problem.

Estimating Prediction Error

Genotypes are often used to predict a characteristic of an individual. One example is the aforementioned goal of predicting ancestry. Other examples are predicting whether an individual has a specific illness or will react to a specific treatment. There are numerous methods for developing prediction rules and then assessing their accuracy. As there is no optimal method for estimating error that works for all of these prediction rules, I am interested in better understanding specific features of these methods and when they perform well. Additionally, I am interested in developing new methods for assessing prediction error in the case of high dimensional data and/or low event rates.

Back to Top