Abstract
Machine learning is a branch of Artificial Intelligence (AI) that observes large amount of data and uses statistical algorithms to make predictions. It is used predominately in the last decade in all aspects of everyday life, from email spam classification and fraud detection to self-driving cars and automated screening for detecting various types of cancer. This work shows how machine learning can be applied to solve two real-world problems: a) identify phenotypes of chronic obstructive pulmonary disease (COPD) and b) differentiate COVID-19 – the disease caused by the novel coronavirus - from other viral pneumonia and healthy X-ray images.
For the former problem, I carried out a systematic literature review of studies that used machine learning methods, over the last decade, to identify COPD phenotypes. Here, I describe those data reduction and clustering methods used to integrate clinical characteristics such as symptom intensity and COPD exacerbations with additional patient characteristics including comorbidities, biomarkers, and genetic information. This combination of clinical and patient characteristics is of highly clinical importance as it allows researchers to identify new COPD phenotypes and better characterise existing ones with the aim to improve diagnosis and develop new treatments. Furthermore, I describe the strengths and weaknesses of those methods, identify gaps in the literature and provide methodological recommendations for clinical research best practices.
I then carried out two original research studies where I implemented those machine learning methods in practice. In the first study, I applied machine learning tools to identify COPD phenotypes among 13260 patients from the UK Royal College of General Practitioners and Surveillance Centre database. Three phenotypes were identified prior to COPD diagnosis (training data set) and reproduced after COPD diagnosis (validation data set) with 80% accuracy. Of those phenotypes, the “fast decliner” was the most common one characterised by younger patients with lung function loss with an increased number of COPD exacerbations. I then applied several machine learning including a decision tree, a gradient boosting machine and a linear regression as well two ensembles – a linear and a random forest of the previously mentioned models – to predict lung function decline after COPD diagnosis. Of those, the linear regression was used to identify the most important risk factors for predicting lung function decline.
In the second study, I analysed a 4-year observational cohort of 6883 UK patients who were ultimately diagnosed with COPD and at least one cardiovascular comorbidity. This cohort was also extracted from the UK Royal College of General Practitioners and Surveillance Centre database. Using machine learning clustering methods, I identified three subtypes of the COPD cardiovascular phenotype prior to COPD diagnosis and reproduced them on an independent dataset after diagnosis with 92% accuracy. I then developed four machine learning models (multinomial logistic regression, decision tree, random forest, gradient boosting machine) to predict cardiovascular comorbidities after diagnosis. Among the four models tested, the random forest classifier was the most accurate at predicting cardiovascular comorbidities in COPD patients with the cardiovascular phenotype. Those findings are of substantial clinical importance as early diagnosis may allow for early intervention and improve disease management and care.
As for the second problem, I used the COVID-19 radiography public database consisting of 15153 X-ray images to develop a novel deep learning (an advanced offshoot of machine learning) model that differentiates COVID-19 from other viral pneumonia and normal lungs with high accuracy. The novelty of my model lies to the addition of a dense layer on top of a pre-trained convolutional neural network yielding to high accuracy with fewer parameters and less training time compared to more complex models of similar accuracy. This may assist clinicians with making more accurate diagnostic decisions and support chest X-rays as a valuable screening tool for early and rapid diagnosis of COVID-19.