Machine Learning For Genomics

Introduction

Genomic data generated by researchers has grown exponentially. This increase demands even better tools to derive insights from the data, including augmenting other data for better inference and decision-making. Machine learning, Deep learning, and artificial intelligence have matured with powerful tools that can be applied in Genomics. However, in Africa, there is still a skills gap among Bioinformatics students in these technologies. In this course, we introduce the basics of machine learning, including practical skills in transforming genomic data for machine learning modelling. Although the MSc Bioinformatics curricula contain the course, it is not being taught, putting the students at a disadvantage, as Bioinformatics leans more towards data science.

Competencies

In this short course, we intend to impart knowledge and skills in the following competencies (See ISCB Competencies):

Knowledge and skills: Details of the scientific discovery process and the role of bioinformatics in it.
Knowledge, comprehension, and Application: Statistical, machine learning, and data science research methods in the context of molecular biology, genomics, medical or population genetics research
Knowledge and Application: Command line and scripting-based computing skills appropriate to the discipline.
Knowledge and skills: Data management

Learning Objectives

To attain the above competencies, the workshop participants should be able to:

Describe the application of machine learning in genomics
Explain the various machine learning principles and how they can be applied to genomics
Explain the research design approaches as applied to machine learning for genomics
Know the various open science tools (Jupyter Notebooks, Pandas, Conda)and how they support a reproducible bioinformatics research
Know the various machine learning frameworks in Python

Learning Outcomes

From the above objectives, the workshop participant should acquire the following skills;

Be able to set up Jupyter and Conda environments for machine learning for a genomic project to ensure reproducibility
Be able to transform genomic data for machine learning modelling
Be able to perform exploratory analysis on genomic data, feature engineering, and parameter selection
Be able to develop and validate machine learning models using genomic data

Instructors

Caleb Kibet

Who should attend?

EANBiT Fellows

Quick Introduction to Jupyter Notebooks

Throughout this course, we will be using Jupyter Notebooks.

Introduction

The Jupyter Notebook is an interactive computing environment that enables users to author notebooks, which contain a complete and self-contained record of a computation. These notebooks can be shared more efficiently. The notebooks may contain:

Live code
Interactive widgets
Plots
Narrative text
Equations
Images
Video

It is good to note that "Jupyter" is a loose acronym meaning Julia, Python, and R; the primary languages supported by Jupyter.

The notebook can allow a computational researcher to create reproducible documentation of their research. As Bioinformatics is datacentric, the use of Jupyter Notebooks increases research transparency, hence promoting open science.

Pre-requisites

Machine learning for genomics assumes familiarity with Python and Pandas. Please have a look at the Python4Bioinformatics training materials for a refresher.

First Steps

Installation

Download Miniconda for your specific OS to your home directory
- Linux: wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
- Mac: curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
Run:
- bash Miniconda3-latest-Linux-x86_64.sh
- bash Miniconda3-latest-MacOSX-x86_64.sh
Follow all the prompts: if unsure, accept defaults
Close and re-open your terminal
If the installation is successful, you should see a list of installed packages with
- conda list

If the command cannot be found, you can add the Anaconda bin to the path using: export PATH=~/miniconda3/bin:$PATH

For reproducible analysis, you can create a conda environment with all the Python packages you used.

`conda create --name ml_genomics python jupyter`

To activate the conda environment:

`source activate ml_genomics`

Having set-up conda environment, you can install jupyter lab using pip.

conda install -c conda-forge jupyterlab

or by using pip

pip3 install jupyter

How to learn from this resource?

Download all the notebooks from MachineLearning4Genomics. The easiest way to do that is to clone the GitHub repository to your working directory using any of the following commands:

git clone https://github.com/mbbu/MachineLearning4Genomics.git

or

wget https://github.com/mbbu/MachineLearning4Genomics/archive/master.zip

unzip master.zip

rm master.zip

cd MachineLearning4Genomics-master

Then you can quickly launch jupyter lab using:

jupyter lab

NB: We will use a jupyter lab for training. A Jupyter notebook is made up of many cells. Each cell can contain Python code. You can execute a cell by clicking on it and pressing Shift-Enter or Ctrl-Enter (run without moving to the next line).

Resources to use:

Encoding DNA
Machine Learning in Bioinformatics: Genome Geography:From raw sequencing reads to a machine learning model, which infers an individual's geographical origin based on their genomic variation.
Deep Learning for Genomics
Machine Learning for Genomics. How to transform your genomics data to fit into machine learning models.
Machine Learning For Good
Machine Leaning in Bioinformatics
Feature Engineering in Genomics - Variant calling
Machine leaning for genomic classification
Support Vector Machines
Mathematics For Machine Learning
Deep Learning Book - Machine Learning Chapter

https://github.com/nageshsinghc4/DNA-Sequence-Machine-learning

To Find datasets and get learning even further, use Kaggle

How to Contribute

To contribute, fork the repository, make some updates and send me a pull request.

Alternatively, you can open an issue.

License

This work is licensed under the Creative Commons Attribution 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Machine Learning For Genomics

Introduction

Competencies

Learning Objectives

Learning Outcomes

Instructors

Who should attend?

Contents

Session 1

Session 2

Session 3

Session 4

Quick Introduction to Jupyter Notebooks

Introduction

Pre-requisites

First Steps

Installation

How to learn from this resource?

Resources to use:

How to Contribute

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Machine Learning For Genomics

Introduction

Competencies

Learning Objectives

Learning Outcomes

Instructors

Who should attend?

Contents

Session 1

Session 2

Session 3

Session 4

Quick Introduction to Jupyter Notebooks

Introduction

Pre-requisites

First Steps

Installation

How to learn from this resource?

Resources to use:

How to Contribute

License