Updated docs to commit 9841f0bde2c8b03357889102809c8d7f5dbbbc15.

kipoi · Nov 7, 2024 · 1e0ba76 · 1e0ba76
1 parent 17ab620
commit 1e0ba76
Showing 1 changed file with 4 additions and 4 deletions.
diff --git a/seminar/index.html b/seminar/index.html
@@ -124,23 +124,23 @@ <h4>How to apply as a speaker</h4>
         <p>The seminar is a great opportunity to present your recent work to a large international audience.
             If you want to apply as a speaker, please use the contact in the registration confirmation email.</p>
         <h4>Next seminar</h4>
-        <h6> Title: Detecting and avoiding homology-based data leakage in genome-trained sequence models </h6> 6 November 2024 5:30 p.m. - 6:30 p.m. Central European Time
-        <p>Speaker: <strong><a href="https://deboer.bme.ubc.ca/people/">Abdul Muntakim Rafi (Rafi) - Carl de Boer lab</a></strong>, The University of British Columbia</p>
+        <h6> Title: LegNet: parameter-efficient modeling of gene regulatory regions using modern convolutional neural network </h6> 4 December 2024 5:30 p.m. - 6:30 p.m. Central European Time
+        <p>Speaker: <strong><a href="https://autosome.org">Dmitry Penzar</a></strong>, autosome.org team</p>
         <strong>Abstract:</strong>
         <p align="justify">
-            Models that predict function from sequence have become critical tools in deciphering the functional roles of genomic sequences and genetic variation within them. However, traditional approaches for dividing the genomic sequences into training data, used to create the model, and test data, used to determine the model’s performance on unseen data, fail to account for the widespread homology that permeates the genome. Using models that predict human gene expression from DNA sequence, we demonstrate that model performance on test sequences varies by their homology with training sequences, consistent with a form of ‘data leakage’ that inflates model performance by rewarding overfitting of sequences that are also present in the test data. We also show that for test sequences that share high homology with training data, predictions of gene expression can be accurately made simply by averaging the outputs from the most similar sequences in the training set, underscoring the issue of having homologous sequences across train-test sets. Furthermore, we observe that neural networks fail to generalize when predicting the effects of mutations, with larger expression changes predicted for unseen sequences compared with seen sequences. This issue is particularly concerning because many GWAS SNPs have doppelgangers of alternate alleles present elsewhere in the genome, often multiple times, and may be inadvertently included in the training data, compromising the reliability of model-predicted effects of genetic variation. To prevent leakage in genome-trained models, we introduce ‘hashFrag,' a scalable solution for partitioning data with minimal leakage. Altogether, we address a fundamental challenge in creating appropriate train-test set splits for sequence-based models on genomes, and highlight the consequences of failing to do so.
+            State-of-the-art genome-scale deep learning (DL) models still struggle to reliably make cell type-specific predictions of gene regulatory regions or reveal the fine-grained effects of individual genome variants, even having been trained with thousands of bulk epigenetic profiles. This is, in particular, due to the comparably low number and finite diversity of native genomic sequences. Overcoming these limitations is expected by tapping from the increasing flow of uniformly processed data from massively parallel reporter assays (MPRA). Inspired by up-to-date developments in image analysis, we developed a new fully-convolutional architecture LegNet, which can efficiently utilize the volume and diversity of the MPRA data to learn the underlying grammar of short gene regulatory regions. LegNet won 1st place in the Random Promoter Dream Challenge 2022, significantly outperforming other approaches in predicting yeast promoter activity. After that, we adapted LetNet for lentiMPRA, where we compared its performance against the SOTA human genome-trained models. Surprisingly, LegNet performed on par or better than fine-tuned Enformer and Sei, despite having 250-fold fewer parameters, thus significantly reducing the computational costs for training and running DL models of regulatory regions. Another adaptation of LegNet (SELEX-LegNet) allows its applications to even shorter sequences, such as results of the high-throughput SELEX assays of the transcription factor binding specificity. Further, we developed RNA-LegNet for MPRAs evaluating the 5' and 3' UTRs effects on the RNA stability and translation efficiency. Last but not least, we have implemented a cold diffusion generative model that produces sequences with desired properties, starting with yeast promoters of desired activity. Next, we repurposed the model to generate UTRs with cell-specific activity, highlighting LegNet's potential in the development of improved RNA therapeutics. All in all, we consider LegNet to be among the SOTA models for short regulatory sequences, and to provide a solid baseline for further application of deep neural networks to decipher the logic of eukaryotic gene regulation on both transcriptional and post-transcriptional levels.
         </p>
 
         <h4>Upcoming speakers</h4>
         <div class="container-fluid">
             <ul class="list-unstyled">
-                <li>4 December 2024 - <a href="https://scholar.google.ru/citations?user=0f5hVB4AAAAJ&hl=en">Ivan Kulakovskiy, Dmitry Penzar</a>, Vavilov Institute of General Genetics</li>
 
             </ul>
         </div>
         <h4>Previous speakers</h4>
         <div class="container-fluid">
         <ul class="list-unstyled">
+            <li>6 November 2024 - <a href="https://deboer.bme.ubc.ca/people/">Abdul Muntakim Rafi (Rafi) - Carl de Boer lab</a>, The University of British Columbia</li>
             <li>2 October 2024 - <a href="https://avantikalal.github.io/">Avantika Lal</a>, Genentech</li>
             <li>4 September 2024 - <a href="https://www.buenrostrolab.com/">Max Horlbeck and Ruochi Zhang (Buenrostro lab)</a>, Harvard University and Broad Institute</li>
             <li>3 July 2024 - <a href="https://www.sabetilab.org/sager-gosai/">Sagar Gosai - Sabeti (Broad), Reilly (Yale) & Tewhey lab (Jackson laboratories)</a>, Broad Institute of Harvard and MIT</li>