Bioinformatics pipeline for the identification and quantification of active, full-length Long INterspersed Elements (LINEs)
Long INterspersed Elements (LINEs) are a class of retrotransposons of transposable elements. The abundant number of LINE sequences in the human genome can create genomic structural variants in human populations and lead to somatic alterations in cancer genomes. Evaluating LINE expression will help to better understand the roles they have across diverse tissues and cell lines. With the advent of long-read sequencing technologies, these platforms offer a promising solution by enabling more accurate characterization of LINEs expression profiles. This repository hosts a collection of scripts designed to quantify the expression of LINEs, particularly L1s, from long-read sequencing data. The scripts facilitate data preprocessing, LINE identification, artifact-based filtering, and quantification of L1 expression.
1. Minimap2 (v2.24)
2. RepeatMasker (version 4.1.5 released March 2023)
3. Python (v3.7.11)
4. Seqtk (1.2-r94)
5. LongReadSum (v1.3.0)
6. Samtools (1.15.1)
7. Bedtools (v2.30.0)
Newer versions for each program may also work, it is recommended to create a conda environment and download these programs within the environment. See this reference for more information and the Anaconda Search engine for the program requirements specifics.
- Reference Genome: hg38
- Comprehensive gene annotation: gencode.v40.annotation (GTF and BED File Formats) (and/or the most recent version)
- Custom LINE Reference Library: UCSC hg38 RepeatMasker track and LINE/L1 consensus sequences from the Repbase RepeatMasker library (provided in the references folder)
- L1Base2 Active, Inactive, and Intact Only in ORF2 Reference Regions
PacBio sequenced Universal Human Reference (UHR) cell line sample. See this reference for additional information.
- Download the dataset:
wget https://downloads.pacbcloud.com/public/dataset/UHR_IsoSeq/FullLengthReads/flnc.bam
- Convert the BAM file into a fastq file, before using the scripts to the pipeline below:
samtools fastq flnc.bam > UHR_PacBio.fastq
The expected results are outlined in the reference publication below.
01_preprocess_input.sh
02_preprocess_mapping.sh
03_L1_detection.sh
04_map_hg38.sh
05_map_qc_LRS.sh
06_read_filter.sh
07_exon_filter.sh
08_L1_loci_filter.sh
09_map_qc_LRS.sh
10_normalization_wgt_avg.sh
The pipeline is designed to take in a single input file, which can be in either FASTQ
or FASTA
format, suitable for both RNA and cDNA sequencing data. If dealing with multiple FASTQ files, it is best to merge the files using the cat
command beforehand.
If the input is in FAST5 format, it is recommended to use Guppy6 (Available online: https://community.nanoporetech.com) to perform basecalling first, with the default basecalling read Q-score filtering threshold of 7, and then convert the FASTQ files for downstream analysis.
Note: Each example command may take some additional time ran as is, for best results, optimize with the HPC (High-Performance Computing) cluster/computer.
This script first initializes the subsequent directories and input files required for analysis and was filtered to retain reads longer than 1kb.
- Sample Name (all one word)
- Full Path to the Input and Reference Files (entire path)
- Input File Format (FASTQ or FASTA)
- Data Type (RNA or cDNA)
./01_preprocess_input.sh sample_name /path/to/input/fasta/file.fasta FASTA RNA
- Folder of your specified sample name (specified by parameter 1, no spaces)
- 4 Sub-folders for downstream analysis
a_dataset
b_repeat_masker_process
c_hg38_mapping_LRS
d_LINE_quantification
- Depending on the input parameters set, the script produces the following:
-
If the input is in FASTQ format (specified by parameter 3), it converts the input to FASTA format and filters out reads shorter than 1kb.
-
If the input data is specified as RNA (parameter 4), it further converts it to cDNA format. If it's already in DNA format, no conversion is performed.
- FASTA File of your input FASTQ
- FASTA File with Reads < 1kb Removed
Next, using a custom LINE reference library as a reference, mapping with Minimap2 was completed as a filtering step to retain reads that contained L1 elements. The custom LINE reference library was constructed by merging the UCSC hg38 RepeatMasker track with the LINE/L1 consensus sequences from the Repbase RepeatMasker library.
- Sample Name (all one word)
- Full Path to the Input and Reference Files (entire path)
./02_preprocess_mapping.sh sample_name
- FASTA (.fa) file of the sequences mapped to the custom LINE reference library
- SAM (.sam) file equivalent
This script first runs RepeatMasker to identify reads with any type of transposable element. The RepeatMasker output was processed to retain reads with less than 10% diverged LINE/L1 elements from the RepeatMasker L1 consensus.
- Sample Name (all one word)
- Full Path to the Input and Reference Files (entire path)
./03_L1_detection.sh sample_name
- Files specific to the RepeatMasker output: RepeatMasker Explaination
- .out.xm
- .cat
- .align
- .out
- .tbl
- .masked
- .ori.out
- Using the
- div10_LINEs.out: Instances from the RepeatMasker output (.out) file with LINE/L1 as the repeat class/family
- div10_readIDs.txt: Read IDs of the read sequences with LINE/L1 as the repeat class/family
- UHR_ONT_test_div10.fa: FASTA file of these Read IDs
Reads with less than 10% diverged LINE/L1 elements were retained for Minimap2 splice-aware mapping to the hg38 reference genome.
- Sample Name (all one word)
- Full Path to the Input and Reference Files (entire path)
./04_map_hg38.sh sample_name
- Mapped SAM file to the hg38 reference genome (.sam)
- Mapped BAM file to the hg38 reference genome (.bam)
- Sorted by position mapped BAM file to the hg38 reference genome (.bam)
- Indexed sorted by position mapped BAM file to the hg38 reference genome (.bai)
LongReadSum was utilized to report mapping statistics for quality check analysis.
- Sample Name (all one word)
- Full Path to the Input and Reference Files (entire path)
./05_map_qc_LRS.sh sample_name
- Log output file (.log)
- Folder with sample_name holding the LongReadSum output
- LongReadSum Output:
- bam_summary.txt
- img (folder)
- st_bam_statistics_dynamic.html
- st_bam_statistics.html
Artifact-Based Filtering: First, we removed reads where less than 90% of the read itself mapped to the L1 region of interest. With the remaining reads, a bedgraph was generated for downstream calculations of L1 expression levels.
- Sample Name (all one word)
- Specify the Reference L1 Region File (Active, Inactive, or Intact Only in ORF2)
- Full Path to the Input and Reference Files (entire path)
06_read_filter.sh sample_name active
- sample_name_L1_regions_reads.bam
- sample_name_read_filter_passed.bam
- sample_name_read_filter_passed.sorted_position.bam
- sample_name_read_filter_passed.sorted_position.bam.bai
- sample_name_bedgraph.bg
- sample_name_bedgraph_clean.bg
- sample_name_bedgraph_sorted.bg
Next, L1 regions with exon overlap using GENCODE.v43 were removed. The exon annotations from GENCODE.v43 were combined into a new annotation file and using bedtools intersect
, L1 loci with overlap were removed.
- Full Path to the Input and Reference Files (entire path)
./07_exon_filter.sh sample_name
- active_no_ORF2_no_INACTIVE.bed: This file is provided with the references folder (../L1Base2_filtered/overlap_removed/active_no_ORF2_no_INACTIVE.bed) using gencode.annotation.v43 reference file. However, this script is the general set up if you would like to filter the L1 regions (active, inactive, or intact only in ORF2) reference file using a different gencode.annotation file to remove any exon overlap from the L1 regions. You would need to edit the gencode.annotation file for the exons only beforehand.
Next, regions with fewer than two mapped reads, inconsistent read start position among the reads located within the L1 region (threshold 100bps), and overall starting positions too far from the L1 region start position (threshold 1.5kb) were removed from analysis to filter out false positives.
- Sample Name (all one word)
- Full Path to the Input and Reference Files (entire path)
The final number of reads for each region is located in the file titled: active_coverage_for_weighted_avg.bed
(for example, if you are calculating over the active L1 regions)
./08_L1_loci_filter.sh sample_name
- sample_name_consistent_passed_regions.bed: Remaining L1 (active, inactive, or intact only in ORF2) regions that have consistent read start or end positions, taking into account the strandness of the regions.
- sample_name_regions_for_coverage.bed: Remaining regions that passed all L1 loci criteria and are used to calculate the L1 expression coverage levels over.
- sample_name_threshold_passed_regions.bed: Remaining L1 (active, inactive, or intact only in ORF2) regions where the starting position of the region falls within a 1.5kb window relative to the average consistent starting position, taking into account the strandness of the regions.
- sample_name_regions_for_coverage.sorted.bed: Sorted remaining regions that passed all L1 loci criteria and are used to calculate the L1 expression coverage levels over.
- raw_coverage_values_mean.txt: Contains the specific L1 regions that passed the previous criteria with their initial coverage values.
- filtered_coverage_values_mean.txt: Regions with coverage values less than 3 (< 2 reads) are filtered out, and the remaining regions are stored in this file.
- active_coverage_for_weighted_avg.bed: The final output of the script, containing all the active, inactive, or intact only in ORF2 regions with the specific regions that passed the previous criteria with their respective coverage values to be used in the next weighted average calculation. If L1 regions also meet the coverage threshold (3 reads, as per the script), their coverage values are included in this file; otherwise, the original reference regions are used.
LongReadSum was utilized to report mapping statistics for quality check analysis.
- Sample Name (all one word)
- Type of L1 Category you are analyzing (active, inactive, ORF2)
- Full Path to the Input and Reference Files (entire path)
./09_map_qc_LRS.sh sample_name
- Log output file (.log)
- Folder with sample_name holding the LongReadSum output
- LongReadSum Output:
- bam_summary.txt
- img (folder)
- st_bam_statistics_dynamic.html
- st_bam_statistics.html
To determine the general expression level of these L1 loci, the calculated coverage values were first normalized by the total number of reads in the sample. Then, a weighted average was computed across all L1 loci in each of the reference L1Base2 categories: active, inactive, and intact only in ORF2, for each sample. This weighted average involved assigning weights based on the length of the reference L1 element with its corresponding expression value, scaled in millions, as displayed below. For i takes the L1 reference category, "Active," "Inactive," and "Intact only in ORF2" to represent the different categories.
- Sample Name (all one word)
- Type of L1 Category you are analyzing (active, inactive, ORF2)
- Full Path to the Input and Reference Files (entire path)
./10_normalization_wgt_avg.sh sample_name
- normalized_active_regions.bed: Using the total number of reads from the LongReadSum summary report output, the coverage values in the last column of the file
active_coverage_for_weighted_avg
were normalized. - coverage_weighted_avg.bed: Using the weighted average equation displayed above, the values in the last column of
normalized_active_regions.bed
were weighted together to output one value to represent the overall L1 expression value for the specific L1 reference region, active, inactive, or intact only in ORF2.
If you have any questions/issues/bugs, please post them on GitHub.
Rybacki, Karleena, et al. "Assessing the Expression of Long INterspersed Elements (LINEs) via Long-Read Sequencing in Diverse Human Tissues and Cell Lines." Genes 14.10 (2023): 1893.