Skip to content

The goal of this project is to allow the development of machine learning algorithm to better determine the efficiency in self-cleavage and ligation of a given RNA sequence for the Hepatitis Delta Virus (HDV) ribozyme. This project also includes a dataset of 16384 experimental sequences which were folded using the SPOT-RNA algorithm [1].

Notifications You must be signed in to change notification settings

vincbeaulieu/HDV-LIG14

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HDV Ribozyme Auto-Cleavage and Ligation Prediction Using Machine Learning

HDV / LIG - Lib14

The Lib14 dataset encompasses 3 datasets of 16384 folded sequences (size of ~ 107 GB) generated using the SPOT-RNA algorithm [1]. The goal of this data is to allow the development of machine learning algorithm to better determine the efficiency in self-cleavage and ligation of a given RNA sequence for the HDV-Lib14 and LIG-Lib14, respectively.

Both Lib14 sequences have each 14 specific nucleotides that were experimentally modified. Both whole sequence follows the IUPAC nucleotide code to illustrate non-singleton nucleotide modifications [2].

HDV-Lib14 (whole sequence)

GGACCATTCGAMTCCCATTAGRCTGGKCCGCCTCCTSGCGGCGGGAGTTGSGCKAGGGAGGAASAGYCTTYYCTAGRCTAASGMSCATCGATCCGGTTCGCCGGATCCAAATCGGGCTTCGGTCCGGTTC

LIG-Lib14 (whole sequence)

GGAMTCCCATTAGRCTGGKCCGCCTCCTSGCGGCGGGAGTTGSGCKAGGGAGGAASAGYCTTYYCTAGRCTAASGMSCATCGATCCGGTTCGCCGGATCCAAATCGGGCTTCGGTCCGGTTC

14 modified IUPAC nucleotides with respective position for the HDV and LIG whole sequences [2].

nt modifications M R K S S K S Y Y Y R S M S
HDV nt positions 11 21 26 36 50 53 63 66 70 71 76 81 83 84
LIG nt positions 3 13 18 28 42 45 55 58 62 63 68 73 75 76

Preprocessing Results (PCA)

PCA results

HDV-Lib14 Machine Learning Outputs:

LIG-Lib14 Machine Learning Outputs:

ML graphs are generated using the ML model respective testing set, in other words, the data that was not used for training the Machine Learning (ML). On the graph, an exact prediction would be situated on a diagonal line represented by the equation Y = X. Where the X-axis represents the estimated output generated by the ML model, while the Y-axis represents the true output taken from experimental data. All predictions on the diagonal are correct predictions. The machine learning model are saved as a "pickle" file, in the "pkl" folder, under the respective model name, "lig_nt_MachineLearning.pkl" and "hdv_nt_MachineLearning.pkl".

Example of folded HDV RNA

Figure 3: Datasets/HDV/radiate/SEQUENCE_10343_radiate.png

REFERENCES:

[1] https://github.com/jaswindersingh2/SPOT-RNA
[2] https://www.bioinformatics.org/sms/iupac.html

About

The goal of this project is to allow the development of machine learning algorithm to better determine the efficiency in self-cleavage and ligation of a given RNA sequence for the Hepatitis Delta Virus (HDV) ribozyme. This project also includes a dataset of 16384 experimental sequences which were folded using the SPOT-RNA algorithm [1].

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages