BLOOD CANCER TYPE CLASSIFICATION SOLUTION
About the Client
A company that uses predictive analytics and machine learning to deliver cost-effective solutions within the healthcare industry, helping private and governmental institutions deliver better results in diagnostics.
Nowadays, identifying and confirming blood cancer type takes time and effort – both from the doctor’s side as well as the patient’s side, causing physical and moral pain to a patient who is waiting for confirmed tests. Our client wanted to help doctors assume more accurately blood cancer types based on the patient’s historical data.
There are 3 most common types of blood cancer – Leukemia, Lymphoma, and Myeloma, and there are about 9 less common blood disorders.
To identify which blood cancer type a person has, a biopsy is performed. After that, the results of a biopsy are sent to a pathology lab to perform pathology tests, with the doctor’s comment, on which type it should be checked. Unfortunately, almost 30% of doctors fail to identify the right type even at the biopsy stage, which leads to the need of performing the biopsy again because for one biopsy you can collect material for one laboratory test only.
The goal of the project was to develop an AI solution that runs on Windows 10, which will assist medical staff in classifying blood cancer type using patient’s historical data as well as other data.
Datamaister has developed a solution based on an AI model, that was trained on historical data of blood cancer diagnostics of many cases. It classifies the type of cancer using the following criteria:
- Patient’s Demographics
- CBC tests detailed results
- Detailed results of chemistry tests
- Patient History Data
To get the result a user has to run the script with a tabular file of a predefined format with a patient’s historical data as input data.
The solution processes the file and returns the cancer classification for each patient from the input data.
The project was developed in 3 consequent stages, which were arranged in compliance with CRISP-DM recommendations.
Stage 1. Data cleaning and preprocessing
The stage consisted of different data manipulations that ultimately led to obtaining a clean dataset that is ready to use for training models on the next stage.
Stage 2. Modeling and evaluation
The next stage was aimed at experimenting with different types of models in order to identify the best one for blood cancer classification. The main delivery of this stage was a trained AI model ready to be used for cancer type classification.
Stage 3. Integration
The final stage was aimed at the development of a set of scripts that are required to run the solution by end-users at their local computer. A Docker image was built to ease the deployment, which was on Windows 10. Key stage deliverable is a command file that takes an input file with patients’ data, runs the classification model on it and adds to the file cancer type classification columns.
During the project, we have tried different classification models, such as Logistic Regression, Random Forest, XGBoost, and others. After a thorough examination, we have decided to use XGBoost model with tuned hyperparameters and after doing a feature engineering gave the best f1-score (we have chosen a metric f1 score, which turned out to be a good choice for unbalanced datasets).
The system was wrapped in Docker image in order to be able to easily set up it at any machine and deliver it to the end-user.
Besides that DVC (Data Version Control) was used to make research reproducible and easily track experiments, data, and code.