Matthew Weirauch, PhD, a computational biologist with Cincinnati Children’s Center for Autoimmune and Genomic Etiology and the Divisions of Biomedical Informatics and Developmental Biology, discusses here the informatics effort behind a new study published in Nature Genetics.
In the study, scientists at Cincinnati Children’s Hospital Medical Center report that the Epstein-Barr virus (EBV)—best known for causing mononucleosis and some forms of cancer—also might contribute to the risk of some people developing seven other major diseases. These include systemic lupus erythematosus, multiple sclerosis, rheumatoid arthritis, juvenile idiopathic arthritis, inflammatory bowel disease, celiac disease, and type 1 diabetes. (Read the press release.)
Combined, these seven diseases affect nearly 8 million people in the United States. The project was led by three scientists: John Harley, MD, PhD, Leah Kottyan, PhD and Matthew Weirauch, PhD. Critical contributions were also provided by Xiaoting Chen, PhD, and Mario Pujato, PhD.
We started this project about four years ago; the teamwork angle was hugely important. It involved gathering massive sets of genetic data, then analyzing every genetic change that might affect the activity of the particular viral proteins by applying a novel computational method for discovering disease-driving mechanisms.
To support our data analysis, my team created two new informatics tools: Regulatory Element Locus Intersector (RELI) (written by Xiaoting Chen) and Measurement of Allelic Ratios Informatics Operator (MARIO) (written by Mario Pujato).
The major discovery we made is that up to half of the genetic loci associated with a set of seven autoimmune diseases are occupied by a viral protein, the Epstein-Barr virus EBNA2 protein, along with dozens of host (human) transcription factors. There are many studies implicating EBV in the disease process for several of these diseases, but the molecular mechanisms underlying these associations were unknown.
We have made our novel computer code available on the Weirauch Lab Github page (https://github.com/WeirauchLab), along with the study data and results. We think it’s an interesting approach that yielded important results for not just these seven diseases, but dozens of others, which could have important specific implications for many diseases. We are now contacting experts on the various diseases, sharing the results, and exploring possible future, more in-depth evaluations.
New Computational Methods
The study is based first on the results of applying the new tool RELI to huge compendiums of publicly available genome-wide association study (GWAS) and ChIP-seq data. We initially compiled and curated a set of 99,733 variants associated with or in strong linkage with 213 human phenotypes and diseases. We also collected a set of 2,511 functional genomics datasets (e.g., from chromatin immunoprecipitation followed by next-generation sequencing, ChIP-seq) from a variety of sources.
A genome-wide association study (GWAS) examines a genome-wide set of genetic variants in different individuals to see if any variant is associated with a given trait. ChIP-seq is a method used to analyze protein interactions with DNA. It combines chromatin immunoprecipitation with DNA sequencing to identify the binding sites of DNA-associated proteins such as transcription factors, which interact with the genome and regulate gene expression levels.
First, we used RELI to systematically estimate the significance of the relationship between the variants associated with a given phenotype and a given ChIP-seq dataset – this let us identify particular proteins occupying disease risk loci. The MARIO pipeline was then instrumental in demonstrating examples of a possible genetic (allele-dependent) role for the transcription factors that we found to be linked to the various diseases.
RELI (Regulatory Element Locus Intersection)
Regulatory Element Local Intersection (RELI) is an algorithm for discovering transcription factors that bind a significant number of loci associated with a given disease or phenotype. The major data components are:
- An input set of disease or phenotype-associated genetic variants
- An internal “library” consisting of many ChIP-seq dataset peaks (in the form of genomic coordinates)
- An internal file containing information on genetic variant allele frequencies, etc.
The tool evaluates the significance of the intersection between plausibly disease-causing genetic variants and any gene regulatory protein, by overlapping their genomic locations or positions in the DNA strand. This allows the identification of potential key regulatory players in disease.
The intersection is compared to a background null distribution generated by shuffling input positions, while maintaining the genetic structure of the probable disease-causing genetic variants, such as the number or variants within a linkage disequilibrium (LD) block and LD block structure.
RELI is a powerful technique, general enough to measure the significance of intersections between plausibly disease-causing genetic variants and any presented genomic feature, such as chromatin marks, genetic promoters and enhancers, and more.
MARIO (Measurement of Allelic Ratios Informatics Operator)
The MARIO (Measurement of Allelic Ratios Informatics Operator) pipeline was designed to identify allele-dependent behavior within a sequencing experiment at heterozygous positions identified through genotyping data.
The main goal of the MARIO pipeline is to gain insight into the molecular mechanisms of disease. It mainly exploits two complementary standard experimental procedures, such as chromatin immuno-precipitation and RNA quantification using next generation sequencing techniques (a.k.a. ChIP-seq and RNA-seq). ChIP-seq identifies the location or address in the genome where specific gene regulators (regulatory proteins such as transcription factors) position themselves in the DNA. RNA-seq tells how much of each gene is being produced as a result of regulation.
Humans have two copies of each gene, or alleles. Knowing exactly where in the DNA two alleles differ or vary (the subject’s genetic makeup), with one of the variants being potentially causal for a disease, the investigator using the MARIO pipeline is able to quantify its effect on gene regulation. For example, MARIO might detect that gene regulator X is unable to recognize the potentially causal variant in the DNA (from ChIP-seq information), leading to an insufficiency in gene Y product (RNA-seq information), which might be required for a cell’s normal functioning.
In the future, as more experiments become available and as we apply the pipeline to new datasets, the MARIO pipeline could be instrumental in generating mechanistic knowledge of the impact of genetic predisposition to many diseases.