Genomic data generated by researchers has grown exponentially. This increase demands even better tools to derive insights from the data, including augmenting other data for better inference and decision-making. Machine learning, Deep learning, and artificial intelligence have matured with powerful tools that can be applied in Genomics. However, in Africa, there is still a skills gap among Bioinformatics students in these technologies. In this course, we introduce the basics of machine learning, including practical skills in transforming genomic data for machine learning modelling. Although the MSc Bioinformatics curricula contain the course, it is not being taught, putting the students at a disadvantage, as Bioinformatics leans more towards data science.
In this short course, we intend to impart knowledge and skills in the following competencies (See ISCB Competencies):
- Knowledge and skills: Details of the scientific discovery process and the role of bioinformatics in it.
- Knowledge, comprehension, and Application: Statistical, machine learning, and data science research methods in the context of molecular biology, genomics, medical or population genetics research
- Knowledge and Application: Command line and scripting-based computing skills appropriate to the discipline.
- Knowledge and skills: Data management
To attain the above competencies, the workshop participants should be able to:
- Describe the application of machine learning in genomics
- Explain the various machine learning principles and how they can be applied to genomics
- Explain the research design approaches as applied to machine learning for genomics
- Know the various open science tools (Jupyter Notebooks, Pandas, Conda)and how they support a reproducible bioinformatics research
- Know the various machine learning frameworks in Python
From the above objectives, the workshop participant should acquire the following skills;
- Be able to set up Jupyter and Conda environments for machine learning for a genomic project to ensure reproducibility
- Be able to transform genomic data for machine learning modelling
- Be able to perform exploratory analysis on genomic data, feature engineering, and parameter selection
- Be able to develop and validate machine learning models using genomic data
- Caleb Kibet
EANBiT Fellows
This course is broken up into several notebooks (lectures).
-
Notebook_01 Machine learning Concepts
- Module_01_Slides Introduction to machine learning
- Module_02_Slides Machine Leaning Deep Dive
-
Notebook_02 Linear regression
- Module 03_Slides Introduction to Linear Regression
- Notebook_03 Random Forest and Decision Trees
- Notebook_04 Feature Engineering in Genomics
- Module_04_Slides Decision Trees
- Notebook_05 Feature Engineering Example using NLP
- Notebook_06 Machine Learning Using VCF output: Dimensionality Reduction
Throughout this course, we will be using Jupyter Notebooks.
The Jupyter Notebook is an interactive computing environment that enables users to author notebooks, which contain a complete and self-contained record of a computation. These notebooks can be shared more efficiently. The notebooks may contain:
- Live code
- Interactive widgets
- Plots
- Narrative text
- Equations
- Images
- Video
It is good to note that "Jupyter" is a loose acronym meaning Julia, Python, and R; the primary languages supported by Jupyter.
The notebook can allow a computational researcher to create reproducible documentation of their research. As Bioinformatics is datacentric, the use of Jupyter Notebooks increases research transparency, hence promoting open science.
Machine learning for genomics assumes familiarity with Python and Pandas. Please have a look at the Python4Bioinformatics training materials for a refresher.
- Download Miniconda for your specific OS to your home directory
- Linux:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
- Mac:
curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
- Linux:
- Run:
bash Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
- Follow all the prompts: if unsure, accept defaults
- Close and re-open your terminal
- If the installation is successful, you should see a list of installed packages with
conda list
If the command cannot be found, you can add the Anaconda bin to the path using:
export PATH=~/miniconda3/bin:$PATH
For reproducible analysis, you can create a conda environment with all the Python packages you used.
`conda create --name ml_genomics python jupyter`
To activate the conda environment:
`source activate ml_genomics`
Having set-up conda environment, you can install jupyter lab
using pip.
conda install -c conda-forge jupyterlab
or by using pip
pip3 install jupyter
Download all the notebooks from MachineLearning4Genomics. The easiest way to do that is to clone the GitHub repository to your working directory using any of the following commands:
git clone https://github.com/mbbu/MachineLearning4Genomics.git
or
wget https://github.com/mbbu/MachineLearning4Genomics/archive/master.zip
unzip master.zip
rm master.zip
cd MachineLearning4Genomics-master
Then you can quickly launch jupyter lab using:
jupyter lab
NB: We will use a jupyter lab for training.
A Jupyter notebook is made up of many cells. Each cell can contain Python code. You can execute a cell by clicking on it and pressing Shift-Enter
or Ctrl-Enter
(run without moving to the next line).
-
Machine Learning in Bioinformatics: Genome Geography:From raw sequencing reads to a machine learning model, which infers an individual's geographical origin based on their genomic variation.
-
Machine Learning for Genomics. How to transform your genomics data to fit into machine learning models.
To Find datasets and get learning even further, use Kaggle
To contribute, fork the repository, make some updates and send me a pull request.
Alternatively, you can open an issue.
This work is licensed under the Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/