Researchers at Cincinnati Children’s Hospital Medical Center report in Genetics in Medicine two new computational methods to aid in computerized diagnosis of rare genetic diseases. Both methods are now freely available in a web application called GDDP, Genetic Disease Diagnosis based on Phenotypes.
Here, developer and lead study author Jing Chen, PhD, a bioinformatician in the Division of Biomedical Informatics, describes the work and shares next steps.
Precision medicine has long been a goal. Today, we can sequence an individual’s exome or even whole genome fairly easily. Numerous bioinformatics tools and commercial software have been developed to store and analyze these large-scale genomic sequence data.
The clinical phenotype (symptom or characteristic) data, on the other hand, is less quantitative and less utilized by computational methods. This study was designed as a proof of concept to utilize phenotype information from clinical data, such as electronic health records (EHRs), as a step toward developing a full-fledged computer program that can help diagnose genetic disease at high accuracy.
A mature and certified computerized diagnosis tool could help guide the genetic analysis efforts for patients struggling with undiagnosed conditions. For example, clinicians could use a clinical phenotype analysis system’s predicted diagnoses to look for mutations in genes that correspond to those diagnoses through whole exome sequencing, perhaps shortening the time patients spend waiting for an accurate diagnosis.
Algorithms designed to aid in computerized diagnosis of rare genetic diseases based on clinical phenotype (symptom or characteristic) data from a patient typically consist of two parts: (1) a reference disease database that uses standardized language to describe the phenotypic traits of different diseases; and (2) a computational or statistical method that predicts diagnoses by searching the disease database for the best match of what the patient is experiencing. We developed and tested several computational methods for accomplishing this task.
I’m happy to report that our testing results were surprisingly good. We showed that phenotype information from an EHR was valuable to genetic disease diagnosis, and that computational methods could effectively utilize this information for such purposes.
Working Toward Computerized Diagnosis
When working with clinical labs in human genetics, I realized that diagnosing genetic disease is a very challenging task. Patients often display multiple and overlapping phenotypic symptoms of varying specificity, and may have co-occurring syndromes. Even with the development of genotyping and high-throughput sequencing technology and associated computational tools and methods, it’s still a very complicated process.
Among the numerous diagnosis guidelines, genetic analysts need to match the phenotypes manifested in patients with the phenotypes of a known genetic disease. This process involves a lot of literature review and laborious reference checking.
As a bioinformatician, I wanted to develop a computational method and tool that can help with this process.
About a year and a half ago, I started this project with a good friend of mine and a statistical geneticist, Ge Zhang, MD, PhD, of the Division of Human Genetics. Peter White, PhD, chair of the Division of Biomedical Informatics, has been very supportive since then.
In order to represent phenotypes and quantify their relations, we used Human Phenotype Ontology (HPO) terms. For the reference genetic disease database, we used Online Mendelian Inheritance in Man (OMIM), the most comprehensive genetic disease phenotype database. By combining the OMIM and HPO, we were able to represent more than 7,000 genetic diseases with almost 8,000 distinct phenotypes.
The next step was to develop the computational algorithm to match the patient’s phenotypes with the known genetic disease. We created and tested many different algorithms, and two of them worked well. The first method is called integrated semantic similarity. The second method is called weighted overlapping test. For simplicity, we just referred to them as Method 1 and Method 2. Full details on the methods are available in our paper.
Testing the Methods
While we can always create new computational methods to solve a problem, what really matters is how well the new methods work in practice.
We first tested both methods with more than 20,000 simulated patients. Each patient was simulated from one of the thousands of genetic diseases. For each patient, we gave the program the phenotypes of the patient as input query. The program would then rank all of the 7,000 genetic diseases based on their similarity to the query.
This works like a Google search: Given a set of key words, Google returns a list of websites with the most relevant website ranked on top. Just as we expect from Google, given a set of phenotypes of the patient, we hope to see the “true diagnosis” ranked on top of the result.
In our test, we counted how many times the “true diagnosis” occurred in the top 10 results returned. This corresponds to the sensitivity of the method at a fixed extreme high specificity (99.8%). In this test, both Method 1 and Method 2 achieved sensitivity >60%.
We then further tested our methods with more realistic and complicated patient data. From the i2b2 (Informatics for Integrating Biology & the Bedside) database, which is created based on EHRs, we got the diagnoses as well as phenotypes of 462 patients for 10 different genetic diseases.
This test was very challenging since the phenotypes from EHR data were expected to contain a lot of noise. The noise may come from comorbidity, treatment adverse effect, information loss in data conversion, or human error.
Not surprisingly, the performance of both methods dropped. For these 462 patients, our methods could rank the “true diagnosis” in the top 10 results for only 30% of cases.
Results and Next Steps
When comparing our methods with the existing published approaches, our methods achieved a significantly higher sensitivity at the same specificity level. We also found that the performance of our methods remained significantly more stable than the existing methods as the number of phenotypes in the query increased.
Finally, we implemented our methods as a web-based application which is publicly and freely available at https://gddp.research.cchmc.org/. This application includes a natural language processing (NLP) unit, so that HPO phenotypes can be automatically recognized in a free text query. By using the R computing platform and in-memory data processing, it can generate the predicted ranked diagnoses in a couple of seconds for any query.
In conclusion, our new computational methods performed significantly better than existing ones. At the same time, we see huge potential for improvements in our methods.
Currently, we are only using clinical phenotypes—a small set of all the clinical information we collect from a patient. In the future, I hope to collaborate with clinical labs to improve our methods and include more clinical information, such as the patient’s genetic variants from clinical sequencing, the family history, and lab test results to make the predictions even better.
For more information, contact Jing Chen at Jing.Chen2@cchmc.org.