Skip to content

Latest commit

 

History

History
39 lines (26 loc) · 2.54 KB

README.md

File metadata and controls

39 lines (26 loc) · 2.54 KB

HDV Ribozyme Auto-Cleavage and Ligation Prediction Using Machine Learning

HDV / LIG - Lib14

The Lib14 dataset encompasses 3 datasets of 16384 folded sequences (size of ~ 107 GB) generated using the SPOT-RNA algorithm [1]. The goal of this data is to allow the development of machine learning algorithm to better determine the efficiency in self-cleavage and ligation of a given RNA sequence for the HDV-Lib14 and LIG-Lib14, respectively.

Both Lib14 sequences have each 14 specific nucleotides that were experimentally modified. Both whole sequence follows the IUPAC nucleotide code to illustrate non-singleton nucleotide modifications [2].

HDV-Lib14 (whole sequence)

GGACCATTCGAMTCCCATTAGRCTGGKCCGCCTCCTSGCGGCGGGAGTTGSGCKAGGGAGGAASAGYCTTYYCTAGRCTAASGMSCATCGATCCGGTTCGCCGGATCCAAATCGGGCTTCGGTCCGGTTC

LIG-Lib14 (whole sequence)

GGAMTCCCATTAGRCTGGKCCGCCTCCTSGCGGCGGGAGTTGSGCKAGGGAGGAASAGYCTTYYCTAGRCTAASGMSCATCGATCCGGTTCGCCGGATCCAAATCGGGCTTCGGTCCGGTTC

14 modified IUPAC nucleotides with respective position for the HDV and LIG whole sequences [2].

nt modifications M R K S S K S Y Y Y R S M S
HDV nt positions 11 21 26 36 50 53 63 66 70 71 76 81 83 84
LIG nt positions 3 13 18 28 42 45 55 58 62 63 68 73 75 76

Preprocessing Results (PCA)

PCA results

HDV-Lib14 Machine Learning Outputs:

LIG-Lib14 Machine Learning Outputs:

ML graphs are generated using the ML model respective testing set, in other words, the data that was not used for training the Machine Learning (ML). On the graph, an exact prediction would be situated on a diagonal line represented by the equation Y = X. Where the X-axis represents the estimated output generated by the ML model, while the Y-axis represents the true output taken from experimental data. All predictions on the diagonal are correct predictions. The machine learning model are saved as a "pickle" file, in the "pkl" folder, under the respective model name, "lig_nt_MachineLearning.pkl" and "hdv_nt_MachineLearning.pkl".

Example of folded HDV RNA

Figure 3: Datasets/HDV/radiate/SEQUENCE_10343_radiate.png

REFERENCES:

[1] https://github.com/jaswindersingh2/SPOT-RNA
[2] https://www.bioinformatics.org/sms/iupac.html