-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
447 lines (412 loc) · 26 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description" content="RNA-GPT: Multimodal Generative System for RNA Sequence Understanding">
<meta name="keywords" content="RNA, GPT, Multimodal, Generative System, RNA Sequence Understanding">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>RNA-GPT: Multimodal Generative System for RNA Sequence Understanding</title>
<script async src="https://www.googletagmanager.com/gtag/js?id=G-PYVRSFMDRL"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() { dataLayer.push(arguments); }
gtag('js', new Date());
gtag('config', 'G-PYVRSFMDRL');
</script>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link rel="stylesheet" href="./static/css/bulma.min.css">
<link rel="stylesheet" href="./static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="./static/css/bulma-slider.min.css">
<link rel="stylesheet" href="./static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="./static/css/index.css">
<link rel="icon" href="./static/images/RNA_GPT.png">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script defer src="./static/js/fontawesome.all.min.js"></script>
<script src="./static/js/bulma-carousel.min.js"></script>
<script src="./static/js/bulma-slider.min.js"></script>
<script src="./static/js/index.js"></script>
</head>
<body>
<nav class="navbar" role="navigation" aria-label="main navigation">
<div class="navbar-brand">
<a role="button" class="navbar-burger" aria-label="menu" aria-expanded="false">
<span aria-hidden="true"></span>
<span aria-hidden="true"></span>
<span aria-hidden="true"></span>
</a>
</div>
<div class="navbar-menu">
<div class="navbar-start" style="flex-grow: 1; justify-content: center;">
<a class="navbar-item" href="https://yijia-xiao.github.io/">
<span class="icon">
<i class="fas fa-home"></i>
</span>
</a>
<div class="navbar-item has-dropdown is-hoverable">
<a class="navbar-link">More Research</a>
<div class="navbar-dropdown">
<a class="navbar-item" href="https://arxiv.org/abs/2408.11363">ProteinGPT</a>
<a class="navbar-item" href="https://arxiv.org/abs/2310.02469">PrivacyMind</a>
<a class="navbar-item" href="https://arxiv.org/abs/XXXX.XXXXX">RNA-GPT</a>
</div>
</div>
</div>
</div>
</nav>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">RNA-GPT: Multimodal Generative System for RNA Sequence Understanding</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">Yijia Xiao<sup>1</sup>,</span>
<span class="author-block">Edward Sun<sup>1</sup>,</span>
<span class="author-block">Yiqiao Jin<sup>2</sup>,</span>
<span class="author-block">Wei Wang<sup>1</sup></span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>University of California, Los Angeles,</span>
<span class="author-block"><sup>2</sup>Georgia Institute of Technology</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block"><a href="https://arxiv.org/abs/2411.08900" class="external-link button is-normal is-rounded is-dark"><span class="icon"><i class="fas fa-file-pdf"></i></span><span>Paper</span></a></span>
<span class="link-block"><a href="https://github.com/Yijia-Xiao/RNA-GPT" class="external-link button is-normal is-rounded is-dark"><span class="icon"><i class="fab fa-github"></i></span><span>Code</span></a></span>
<!-- Add more links if available -->
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>RNAs are vital molecules that carry genetic information essential for life, with significant implications for drug development and biotechnology. However, RNA research is often slowed by the vast amount of literature. To address this, we introduce <strong>RNA-GPT</strong>, a multi-modal RNA chat model that simplifies RNA discovery by leveraging extensive RNA literature.</p>
<p>RNA-GPT combines RNA sequence encoders with linear projection layers and state-of-the-art large language models (LLMs) for precise representation alignment. This enables it to process user-uploaded RNA sequences and provide concise, accurate responses. Our scalable training pipeline, powered by RNA-QA, automatically gathers RNA annotations from RNACentral using a divide-and-conquer approach with GPT-4o and latent Dirichlet allocation (LDA) to handle large datasets and generate instruction tuning samples.</p>
<p>Experiments show RNA-GPT effectively handles complex RNA queries, streamlining RNA research. We also introduce RNA-QA, a 407,616 RNA dataset for modality alignment and instruction tuning.</p>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Introduction</h2>
<div class="content has-text-justified">
<p>Large language models (LLMs) trained on internet-scale corpora have been shown to perform extraordinarily well on a large array of tasks from Olympiad-level mathematical and scientific reasoning to planning long-term tasks for robotic systems. Recent advances in the biological and medical fields have enabled the adaptation of powerful models to accelerate research, significantly reducing reliance on traditional experiments.</p>
<p>Since proteins, RNAs, and DNAs can be represented as character strings and a vast amount of sequenced data is readily available, this has created an ideal environment for training language models to predict and generate protein, DNA, and RNA structures and sequences. Protein language models like ESM have successfully encoded protein sequence and structure information, inspiring works such as ProteinGPT and ProtSt, which adapt protein representations into a language-based format, enabling natural language querying of protein data.</p>
<p>Similar to ESM-2, works like RiNALMo and RNA-FM have utilized the flexible capabilities of LLMs to learn and predict RNA structure and functions. Much like the motivation behind protein research, where proteins are represented as strings of characters, RNAs—with their sequences of five unique nucleotides—have also sparked interest in computational RNA and DNA research using large language models (LLMs).</p>
<p>While models like ProteinGPT, ProtST, ProteinChat, and ProteinCLIP have made significant progress in aligning protein sequences and structures with textual descriptions, advancements in the DNA and RNA domains are far less advanced. Previous efforts, such as RiNALMo and RNA-FM have mainly focused on specific tasks like promoter or enhancer prediction, and structure and function analysis. ChatNT is among the few models striving to bridge the gap between RNA comprehension and natural language. However, its emphasis is more on performing biological tasks as a conversational agent rather than providing deep RNA understanding and comprehensive dialogue.</p>
<p>As a result, there is a notable gap in RNA chat models that offer in-depth knowledge. However, applying multimodal LLMs to RNA modeling presents unique challenges, especially in integrating diverse modalities such as textual descriptions, RNA sequences, and structural data.</p>
<p>To overcome these challenges, we propose a two-step approach to RNA-GPT. First, we utilize the RNA-FM sequence encoder to embed RNA sequences, followed by aligning these sequence representations with natural language through a large, automatically curated QA dataset from RNA Central. Secondly, to ensure our model generates concise and accurate responses, we break down RNA-QA’s abstract summaries into individual QA pairs for instruction tuning, enhancing the model’s ability to deliver clear and relevant answers. We utilize Meta AI’s flagship Llama-3 8B Instruction as our backbone LLM to provide solid general language understanding.</p>
<p>More specifically, our contributions are as follows:</p>
<ul>
<li><strong>Novel Framework:</strong> RNA-GPT is one of the first multi-modal RNA sequence chat models that enables deep, interactive RNA-focused conversations, significantly enhancing the understanding of RNAs for biological research.</li>
<li><strong>Large-scale Dataset and Collection Pipeline:</strong> We introduce RNA-QA, a QA dataset derived from the RNA Central Database for modality alignment instruction tuning of RNA chat models. We also present our highly scalable collection pipeline that automates the scraping and summarizing of relevant literature on RNA. Using a divide-and-conquer summarization strategy, we ensure that research details are preserved effectively.</li>
</ul>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<!-- Center the entire content within the container -->
<div class="columns is-centered">
<div class="column is-full">
<h2 class="title is-3">Methodology</h2>
<div class="content has-text-justified">
<p>RNA-GPT uses the pre-trained RNA-FM sequence encoder to embed RNA sequences, which are then passed through a linear projection layer. This layer learns to map the RNA embeddings to a shared representation space with natural language, enabling alignment with a backbone LLM, for which we chose Meta’s Llama-3 8B model. The training process is divided into two stages:</p>
<ol>
<li><strong>Sequence and Modality Alignment:</strong> RNA and natural language representations are aligned.</li>
<li><strong>Instruction Tuning:</strong> The model is fine-tuned for task-specific QA generation.</li>
</ol>
<!-- Figure 1 -->
<div class="columns is-centered">
<div class="column is-12">
<figure class="image">
<img src="./static/images/MA.png" alt="Modality Alignment Stage">
</figure>
<p class="has-text-centered"><strong>Figure 1:</strong> RNA-GPT Modality Fusion & Alignment Stage: we freeze the sequence encoder block and train the linear projection layer to learn how to align RNA sequence representations with text. In the alignment stage, the input to the training is only the projected RNA representation. No text prompts are incorporated in this stage.</p>
</div>
</div>
<p><strong>Modality Alignment Stage (Stage 1):</strong> RNA sequences in the form of strings are first fed into the pre-trained sequence encoder, featuring 12 transformer layers trained with 23 million RNAs from the RNA Central database via self-supervised learning. We utilize a specialized token <RNAHere> for RNA-text modality alignment.</p>
<p><strong>Instruction Tuning Stage (Stage 2):</strong> In stage 2, we instruction-tune the model using our curated RNA-QA dataset. We break down the full annotations into targeted QA samples with concise answers to specific questions as prediction targets. This allows the chat model to provide more relevant and accurate responses.</p>
<!-- Figure 2 -->
<div class="columns is-centered">
<div class="column is-12">
<figure class="image">
<img src="./static/images/IT.png" alt="Instruction Tuning Stage">
</figure>
<p class="has-text-centered"><strong>Figure 2:</strong> RNA-GPT Instruction Tuning Stage: we use the RNA representation from the alignment stage and combine it with question prompts for instruction tuning. The model generates answers that are concise and relevant to the questions.</p>
</div>
</div>
<h3>RNA-QA Dataset</h3>
<p>To achieve modality alignment, we constructed a large-scale dataset from the RNA Central database, comprising 407,616 RNA sequences paired with abstract descriptions.</p>
<p><strong>Divide and Conquer RNA Literature Summarization:</strong> We begin by filtering RNA sequences from RNA Central, focusing on those indexed with "Lit Scan," yielding around 420,000 RNAs with associated research papers. For the remaining 407,616 RNAs, we scrape and extract abstracts from all relevant literature. We apply LDA topic modeling to group papers by topic, summarizing each group individually. This ensures each summarization focuses on a narrower, cohesive subject area, minimizing information loss.</p>
<!-- Figure 3 -->
<div class="columns is-centered">
<div class="column is-12">
<figure class="image">
<img src="./static/images/LDA.png" alt="RNA-QA Dataset Pipeline">
</figure>
<p class="has-text-centered"><strong>Figure 3:</strong> RNA-QA uses an automated pipeline to scrape and summarize existing RNA literature. We apply latent Dirichlet allocation (LDA) to group the vast literature on each RNA, and then we summarize each group individually using GPT-4o-mini. These summaries are then combined and refined to produce the final RNA annotation.</p>
</div>
</div>
<p><strong>Data Augmentation:</strong> RNA-GPT decomposes the rich RNA annotations of RNA-QA into more specific QA-pairs for instruction tuning using GPT-4o-mini so that user instructions can be concisely answered.</p>
</div>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full-width">
<h2 class="title is-3">Experiments</h2>
<div class="content has-text-justified">
<p>We trained <strong>RNA-GPT</strong> using the flagship Llama-3 8B model architecture using a smaller subset of 5,000 RNAs and 121,000 QA samples for our initial model. We are in the process of training the larger RNA-GPT that utilizes all 407,616 RNAs of the RNA-QA dataset with millions of QA samples.</p>
<!-- Table 1 -->
<div class="columns is-centered">
<div class="column is-10">
<div class="has-text-centered">
<table class="table is-striped is-fullwidth is-centered">
<thead>
<tr>
<th>Metric</th>
<th colspan="3">RNA Sequence</th>
<th colspan="3">Modality Fusion</th>
<th colspan="3">RNA-GPT</th>
</tr>
<tr>
<th></th>
<th>S<sub>BERT</sub></th>
<th>S<sub>Pub</sub></th>
<th>S<sub>GPT</sub></th>
<th>S<sub>BERT</sub></th>
<th>S<sub>Pub</sub></th>
<th>S<sub>GPT</sub></th>
<th>S<sub>BERT</sub></th>
<th>S<sub>Pub</sub></th>
<th>S<sub>GPT</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Precision</strong></td>
<td>0.7372</td><td>0.5528</td><td>0.5219</td>
<td>0.6929</td><td>0.6507</td><td>0.6655</td>
<td>0.8602</td><td>0.7384</td><td>0.7848</td>
</tr>
<tr>
<td><strong>Recall</strong></td>
<td>0.7496</td><td>0.5270</td><td>0.5474</td>
<td>0.8028</td><td>0.6082</td><td>0.6603</td>
<td>0.8404</td><td>0.7208</td><td>0.7561</td>
</tr>
<tr>
<td><strong>F1 Score</strong></td>
<td>0.7424</td><td>0.5387</td><td>0.5339</td>
<td>0.7403</td><td>0.6283</td><td>0.6627</td>
<td>0.8494</td><td>0.7293</td><td>0.7700</td>
</tr>
</tbody>
</table>
<p><strong>Table 1:</strong> RNA-QA (<strong>AIS</strong>): Comparison of RNA Sequence (left), Modality Fusion (middle), and RNA-GPT (right). Embedding base models are BERT, PubMedBERT, and OpenAI's GPT text-embedding-3-large.</p>
</div>
</div>
</div>
<p>We conducted a series of experiments to assess RNA-GPT's effectiveness both quantitatively and qualitatively, along with ablation studies to ascertain the importance of various modules at different stages. These included the original model (LLM with RNA sequence as text input), the modality-aligned model, and the final instruction-tuned model.</p>
<!-- Table 2 -->
<div class="columns is-centered">
<div class="column is-10">
<div class="has-text-centered">
<table class="table is-striped is-fullwidth is-centered">
<thead>
<tr>
<th>Metric</th>
<th colspan="3">RNA Sequence</th>
<th colspan="3">Modality Fusion</th>
<th colspan="3">RNA-GPT</th>
</tr>
<tr>
<th></th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-L</th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-L</th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-L</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ROUGE</strong></td>
<td>0.2364</td><td>0.0935</td><td>0.2037</td>
<td>0.2239</td><td>0.1364</td><td>0.2091</td>
<td>0.5031</td><td>0.3667</td><td>0.4747</td>
</tr>
</tbody>
</table>
<p><strong>Table 2:</strong> RNA-QA (<strong>AIS</strong>): ROUGE Scores for RNA Sequence, Modality Fusion, and RNA-GPT.</p>
</div>
</div>
</div>
<!-- Figures -->
<div class="columns is-centered">
<div class="column is-half has-text-centered">
<figure>
<img src="./static/images/RNAGPT_ROUGE.png" alt="ROUGE Score Comparison" />
<figcaption><strong>Figure 4:</strong> ROUGE Score Comparison</figcaption>
</figure>
</div>
<div class="column is-half has-text-centered">
<figure>
<img src="./static/images/RNAGPT_Semantic.png" alt="Semantic Score Comparison" />
<figcaption><strong>Figure 5:</strong> Semantic Score Comparison</figcaption>
</figure>
</div>
</div>
<!-- Table 3 -->
<div class="columns is-centered">
<div class="column is-10">
<div class="has-text-centered">
<table class="table is-striped is-fullwidth is-centered">
<thead>
<tr>
<th>Metric</th>
<th colspan="3">RNA Sequence</th>
<th colspan="3">Modality Fusion</th>
<th colspan="3">RNA-GPT</th>
</tr>
<tr>
<th></th>
<th>S<sub>BERT</sub></th>
<th>S<sub>Pub</sub></th>
<th>S<sub>GPT</sub></th>
<th>S<sub>BERT</sub></th>
<th>S<sub>Pub</sub></th>
<th>S<sub>GPT</sub></th>
<th>S<sub>BERT</sub></th>
<th>S<sub>Pub</sub></th>
<th>S<sub>GPT</sub></th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Precision</strong></td>
<td>0.7612</td><td>0.5498</td><td>0.5479</td>
<td>0.6884</td><td>0.6201</td><td>0.6676</td>
<td>0.8620</td><td>0.7173</td><td>0.7568</td>
</tr>
<tr>
<td><strong>Recall</strong></td>
<td>0.7654</td><td>0.5512</td><td>0.5649</td>
<td>0.8187</td><td>0.5830</td><td>0.6602</td>
<td>0.8623</td><td>0.7161</td><td>0.7554</td>
</tr>
<tr>
<td><strong>F1 Score</strong></td>
<td>0.7625</td><td>0.5501</td><td>0.5561</td>
<td>0.7466</td><td>0.6005</td><td>0.6637</td>
<td>0.8609</td><td>0.7165</td><td>0.7560</td>
</tr>
</tbody>
</table>
<p><strong>Table 3:</strong> RNA-QA (<strong>D&C</strong>): Comparison of RNA Sequence (left), Modality Fusion (middle), and RNA-GPT (right). Embedding base models are BERT, PubMedBERT, and OpenAI's GPT text-embedding-3-large.</p>
</div>
</div>
</div>
<!-- Table 4 -->
<div class="columns is-centered">
<div class="column is-10">
<div class="has-text-centered">
<table class="table is-striped is-fullwidth is-centered">
<thead>
<tr>
<th>Metric</th>
<th colspan="3">RNA Sequence</th>
<th colspan="3">Modality Fusion</th>
<th colspan="3">RNA-GPT</th>
</tr>
<tr>
<th></th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-L</th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-L</th>
<th>ROUGE-1</th>
<th>ROUGE-2</th>
<th>ROUGE-L</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>ROUGE</strong></td>
<td>0.2472</td><td>0.0964</td><td>0.2182</td>
<td>0.0922</td><td>0.0393</td><td>0.0799</td>
<td>0.4791</td><td>0.2690</td><td>0.4405</td>
</tr>
</tbody>
</table>
<p><strong>Table 4:</strong> RNA-QA (<strong>D&C</strong>): ROUGE Scores for RNA Sequence, Modality Fusion, and RNA-GPT.</p>
</div>
</div>
</div>
<p>The results demonstrate that <strong>RNA-GPT</strong> significantly outperforms both the original model and the modality fusion model in terms of precision, recall, F1 score, and ROUGE metrics. This indicates the effectiveness of our two-stage training process and the utility of the RNA-QA dataset.</p>
<!-- Figures -->
<!--
<div class="columns is-centered">
<div class="column is-half has-text-centered">
<figure>
<img src="./static/images/RNAGPT_ROUGE.png" alt="ROUGE Score Comparison" />
<figcaption><strong>Figure 4:</strong> ROUGE Score Comparison</figcaption>
</figure>
</div>
<div class="column is-half has-text-centered">
<figure>
<img src="./static/images/RNAGPT_Semantic.png" alt="Semantic Score Comparison" />
<figcaption><strong>Figure 5:</strong> Semantic Score Comparison</figcaption>
</figure>
</div>
</div>
-->
<p>Figures 4 and 5 illustrate the performance improvements of RNA-GPT over the baseline models. The ROUGE score comparison shows a significant increase in ROUGE-1, ROUGE-2, and ROUGE-L scores, indicating better overlap with the reference answers. The semantic score comparison, evaluated using BERT, PubMedBERT, and GPT embeddings, demonstrates enhanced semantic similarity between the generated and reference answers.</p>
<p>These experiments validate the effectiveness of our approach in aligning RNA sequences with natural language representations, enabling the model to generate accurate and relevant responses to complex RNA queries.</p>
</div>
</div>
</div>
</div>
</section>
<!-- Conclusion section remains the same -->
<footer class="footer">
<div class="container">
<div class="content has-text-centered">
<a class="icon-link" href="https://arxiv.org/abs/XXXX.XXXXX"><i class="fas fa-file-pdf"></i></a>
<a class="icon-link" href="https://github.com/Yijia-Xiao/RNA-GPT"><i class="fab fa-github"></i></a>
</div>
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</p>
</div>
</div>
</div>
</div>
</footer>
</body>
</html>