Problem. Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease for which there is no curative treatment and due to its slow onset, lacks clear markers for early diagnosis. Genetic biomarkers of ALS could revolutionize drug discovery for the disease and help clinicians with early diagnosis, thus leading to much better outcomes for patients. For these reasons I decided to investigate if gene expression levels collected from patient's blood samples could accurately classify ALS patients. Which if true, could be very useful with early diagnosis of the disease. Furthermore, I was interested to see if the genes found to be most predictive of an ALS diagnosis could give insights into the biology underlying the disease.
Data. My dataset was an ALS research study obtained from the National Institute of Health and submitted from the the University Medical Center Utrecht, in the Netherlands. This data set contained clinical and genetic data on over 700 patients. The explanatory variables were gene expression levels for all the genes detected in the patients blood samples. This consisted of around 29,000 genes, which I reduced to 2500 genes, essentially based on which genes were most likely to be changed between the classes. The primary response variable was patient diagnosis, which consisted of two classes, patients diagnosed with ALS and control patients.
Analysis. I first split my data into a training data set and a test data. I then explored the training data by assessing the characteristics of the feature variables, which were gene expression levels and the response variable, which was the diagnosis of the patient. A key aspect of this exploratory analysis involved assessing how gene expression levels differed between the ALS class and control patient class. After exploring the data, I then tuned and tested 5 different cross validated classification models: logistic regression, LASSO regression, ridge regression, elastic net regression, and random forest. Overall the lasso regression performed the test with a test missclassification error of 13.5%. The tree based, random forest model had a test missclassification error of 16%.
Conclusions. I found that by using solely genetic data, which was not used for the original diagnosis, penalized regressions and random forests were able accurately diagnose ALS patients vs. controls ~85% of the time. This was well above chance and suggests that genes found in easily accessible blood samples, could be important supplements to diagnosing ALS. Furthermore 2 features were present in the top 10 predictive features of the best performing models. These features were the Atp5a1 gene and AIF1 gene, which I discovered were indeed verified by scientific studies to be crucial in the onset/progression of ALS. I hope that these results provide further evidence that genetic data can be important diagnostic indicators, as well as important tools for revealing the mechanisms underlying ALS disease pathology.