2021 Informatics Tool Challenge Winners Announced
, by DCEG Staff
Four projects were funded through the 2021 DCEG Informatics Tool Challenge. Since its establishment in 2014, the competitive program has provided support for innovative approaches to epidemiological methods, data collection, analysis, and other research efforts using modern technology and informatics.
Microbiome taxonomy and human ancestry from shotgun sequence metagenomic analyses
Christian Abnet (MEB), Leonardo Mariño-RamÍrez (NIMHD), Emily Vogtmann (MEB), Jianxin Shi (BB), Andres Gutierrez (MEB)
Population stratification is a major challenge in human microbiome studies—knowing the genetic ancestry of study participants would make it possible to address this important concern. Characterizing the human microbiome requires high-throughput sequencing to generate DNA sequences from the bacterial genomes. When sequencing samples from the mouth for a microbiome study, such as saliva or oral wash collected using mouthwash, a substantial fraction of the extracted DNA is from the human genome.
While human DNA cannot be used in microbiome analyses, the tool will establish a workflow that analyzes human DNA sequences to determine the genetic ancestry of the host. This work will be done in collaboration with Dr. Marino-Ramirez from the National Institute on Minority Health and Health Disparities (NIMHD).
GWAStarget: A comprehensive resource and web tool for identification of target genes and pathways from genome-wide association study (GWAS) data
Charles Breeze and Sonja Berndt (OEEB), Sue Pan and Mei Liu (CBIIT)
GWAStarget is a comprehensive resource and web tool for identification of target genes and pathways from genome-wide association study (GWAS) data. By cataloguing target genes for over 37 million genomic variants in the human genome, in context with epigenomic mapping information and chromatin conformation capture data, it will provide researchers insight into the pathways underlying GWAS associations. This tool can save substantial time for researchers performing systems-level analyses and experiments aimed at uncovering key disease-associated mechanisms.
ICDgenie
Shu-Hong Lin (ITEB), Mustapha Abubakar (ITEB), Sairah Khan (ITEB), Montserrat Garcia-Closas (TDRP), Mitchell Machiela (ITEB)
Accurate histological classification is important for facilitating studies of cancer epidemiology and etiologic heterogeneity. ICDgenie is a web-based tool that can assist epidemiologists, pathologists, research assistants, and data scientists to more easily access, translate, and validate codes and text descriptions from the International Classification of Diseases, 10th Edition (ICD-10) and ICD for Oncology, 3rd Edition (ICD-O-3). By improving accessibility and making existing cancer classification and coding schemes more readily understandable and searchable, ICDgenie will help accelerate descriptive and molecular epidemiological studies of cancer.
Use of a Common Data Model to standardize data query and analysis across multi-omics platforms
Maryam Rafati (CGB), Shahinaz Gadalla (CGB), Sharon Savage (CGB), Lisa McReynolds (CGB), Jonas De Almeida (OD), Bin Zhu (CGR), Vojtech Huser (NLM)
The structure of data collection and storage modalities is highly dependent on the investigator creating the database and usually based on the aims and/or specific needs of a project. Over time, heterogeneously-structured downstream omics data effectively “pile up” and make it more difficult to retrieve, merge, analyze, or pool with other projects, due to the diverse data formats. This tool explores the conversion of clinical and genomic molecular data into a common data model format that can harmonize data into a common data standard to facilitate accessing and analyzing multiple data sources concurrently. This is a platform-independent model that reads and queries different databases simultaneously to give researchers a holistic view of the omics findings in each sample set.
Done in collaboration with the National Library of Medicine, this effort will provide the opportunity to test the usability of a common data model in data retrieval across projects and explore the development of Extract-Transform-Load (ETL) code for available genomic and molecular data and associated clinical data.