NLP SYSTEM FOR DETERMINING HOW GENE MUTATIONS AFFECT DISEASES
About the Client
The client is a healthcare startup that develops IT-solutions for personalized diagnostics and treatments.
Almost any information can be found on the Internet using Google or Bing. However, for more complex and complex queries, modern search engines are not suitable. They simply cannot cope with the amount of information available online. These tasks are solved using custom NLP-based and data analysis solutions. These systems can find information from specified sources, search for relationships in context, and classify information.
Our client wanted to create a system that would analyze relevant abstracts from the PubMed library and establish relevant connections between diseases and mutations based on them to facilitate diagnosis.
Datamaister team has developed an NLP-based system that can read scientific abstracts from the PubMed library, analyze their content, and determine the effect of genetic mutations on the disease. The solution works through API requests, with which the user can list diseases and mutations, as well as MESH and OMIM codes and get a list of disease-mutation connection states:
- Mutation causes the disease
- Mutation doesn’t cause disease
- Mutation reduces the risk of disease
This information is highly valuable for medical staff when examining patients and diagnosing their health status.
The very first step to developing a solution was to create an NLP abstract processing model for which we’ve used an open-source BioBERT model pre-trained in biomedical language. After that, we collected and labeled a dataset with 6000 training and 2000 test mutation-disease compounds. This is about 600 abstracts, where 1 paper can describe several such connections. With this dataset, we proceeded to train up the model to identify disease-mutation compounds.
In this particular case, it was also important that the medical staff understands the ethnicity of the studied patients in the scientific works. To do this, we added the SpaCy model, which finds noun tokens in the “mutation-disease” clause and returns them with a connection status.
After that, we created an infrastructure for storing and processing scientific articles. We made a tool that downloads abstracts and looks for new publications periodically and then proceeds with the BioBERT model. All abstracts and found connections are placed in the ElasticSearch database.
Finally, we’ve created an API for integrating the solution into the client’s system.
The project was mainly developed in Python. We used the BioBERT model with our dataset to teach the system to identify connections between disease and mutation, and the SpaCy model for ethnicity identification. All abstracts are placed in ElasticSearch database. The solution was deployed on Amazon Web Services.