The Lib14 dataset encompasses 3 datasets of 16384 folded sequences (size of ~ 107 GB) generated using the SPOT-RNA algorithm [1]. The goal of this data is to allow the development of machine learning algorithm to better determine the efficiency in self-cleavage and ligation of a given RNA sequence for the HDV-Lib14 and LIG-Lib14, respectively.
Both Lib14 sequences have each 14 specific nucleotides that were experimentally modified. Both whole sequence follows the IUPAC nucleotide code to illustrate non-singleton nucleotide modifications [2].
GGACCATTCGAMTCCCATTAGRCTGGKCCGCCTCCTSGCGGCGGGAGTTGSGCKAGGGAGGAASAGYCTTYYCTAGRCTAASGMSCATCGATCCGGTTCGCCGGATCCAAATCGGGCTTCGGTCCGGTTC
GGAMTCCCATTAGRCTGGKCCGCCTCCTSGCGGCGGGAGTTGSGCKAGGGAGGAASAGYCTTYYCTAGRCTAASGMSCATCGATCCGGTTCGCCGGATCCAAATCGGGCTTCGGTCCGGTTC
nt modifications | M | R | K | S | S | K | S | Y | Y | Y | R | S | M | S |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HDV nt positions | 11 | 21 | 26 | 36 | 50 | 53 | 63 | 66 | 70 | 71 | 76 | 81 | 83 | 84 |
LIG nt positions | 3 | 13 | 18 | 28 | 42 | 45 | 55 | 58 | 62 | 63 | 68 | 73 | 75 | 76 |
ML graphs are generated using the ML model respective testing set, in other words, the data that was not used for training the Machine Learning (ML). On the graph, an exact prediction would be situated on a diagonal line represented by the equation Y = X. Where the X-axis represents the estimated output generated by the ML model, while the Y-axis represents the true output taken from experimental data. All predictions on the diagonal are correct predictions. The machine learning model are saved as a "pickle" file, in the "pkl" folder, under the respective model name, "lig_nt_MachineLearning.pkl" and "hdv_nt_MachineLearning.pkl".
Figure 3: Datasets/HDV/radiate/SEQUENCE_10343_radiate.png
[1] https://github.com/jaswindersingh2/SPOT-RNA
[2] https://www.bioinformatics.org/sms/iupac.html