DE-kupl is a pipeline that finds differentially expressed k-mers between RNA-Seq datasets under The MIT License.
Dekupl-run handles the first part of the DE-kupl pipeline from raw FASTQ to the production of contigs from differentially expressed k-mers.
Before using Dekupl-run, install these dependencies:
- Snakemake
- jellyfish
- pigz
- CMake
- boost
- R:
- DESEq2 : open R and execute :
> source("https://bioconductor.org/biocLite.R")
> biocLite("DESeq2")
- RColorBrewer
- pheatmap
- foreach
- doParallel
- DESEq2 : open R and execute :
- Python:
- rpy2 :
pip3 install rpy2
- rpy2 :
Either use the Docker container (updated daily, https://hub.docker.com/r/ebio/dekupl/), or:
- Clone this repository including submodules :
git clone --recursive https://github.com/Transipedia/dekupl-run.git
- Install dependencies above
- Edit the config.json file to add the list of your samples, their conditions and the location their FASTQ file. See next section for parameters description.
- Run the pipeline with then
snakemake -jNB_THREADS --resources ram=MAX_MEMORY -p
command. ReplaceNB_THREADS
with the number of threads andMAX_MEMORY
with the maximum memory (in Megabyte) you want DEkupl to allocate. - Once Dekupl-run has been fully executed, DE contigs produced by Dekupl-run
(under
DEkupl_results/A_vs_B_kmer_counts/merged-diff-counts.tsv.gz
) can be annotate using Dekupl-annotation
- fastq_dir: Location of FASTQ files
- nb_threads: Default number of thread to use (unless specified in the snakemake command-line
- kmer_length: Length of k-mers (default: 31). This value shoud not exceed 32.
- diff_method: Method used for differential testing (default: DESeq2). Possible choices are 'Ttest' which is fast and 'DESeq2' which is more sensitive but longer to run.
- lib_type: Paired-end library type (default:
rf
). You can specify eitherrf
for reverse-forward strand-specific libraries,fr
for strand-specific forward-reverse, orunstranded
for unstranded libraries. - output_dir: Location of DE-kupl results (default:
DEkupl_result
). - tmp_dir: Temporary directory to use (default:
./
aka current directory) - r1_suffix: Suffix to use for the FASTQ with left mate. Set
r2_suffix
for the second FASTQ. - dekupl_counter:
- min_recurrence: Minimum number of samples to support a k-mer
- min_recurrence_abundance: Min abundance threshold to consider a k-mer in the reccurency filter.
- Ttest:
- condition: Specify A and B conditions.
- pvalue_threshold: Min p-value (adjusted) to consider a k-mer as DE. Only DE k-mers are selected for assembly.
- log2fc_threshold: Min Log2 Fold Change to consider a k-mer as DE.
- Samples: An array of samples. Each sample is described by a
name
and acondition
. The FASTQ files for a sample will be located using the following commandfastq_dir/sample_name_{1,2}.fastq.gz
- transcript_fasta: The reference transcriptome to be used for masking. By default DEKupl-run uses the human Gencode transcriptome for masking. To change this, add to the config.json file:
"transcript_fasta":my_transciptome.fa
For single-end libraries please specify the following parameters :
- lib_type: You can either set the lib_type to
single
in the case of single-end strand-specific library orunstranded
for single-end unstranded libraries. - fragment_length : The estimated fragment length (necessary for kallisto quantification). Default value is
200
. - fragment_standard_deviation : The estimated standard deviation of fragment length (necessary for kallisto quantification). Default value is
30
.
Notes :
The fastq files for single-end samples will be located using the following path : {fastq_dir}/{sample_name}.fastq.gz
If present, parameters r1_suffix and r2_suffix will be ignored.
The output directory of a DE-kupl will have the following content :
├── {A}_vs_{B}_kmer_counts
│ ├── diff-counts.tsv.gz
│ ├── merged-diff-counts.tsv.gz
├── gene_expression
│ ├── {A}vs{B}-DEGs.tsv
├── kmer_counts
│ ├── normalization_factors.tsv
│ ├── raw-counts.tsv.gz
│ ├── noGENCODE-counts.tsv.gz
│ ├── {sample}.jf
│ ├── {sample}.txt.gz
│ ├── ...
├── metadata
│ ├── sample_conditions.tsv
│ ├── sample_conditions_full.tsv
The following table describes the output files produced by DE-kupl :
FileName | Description |
---|---|
diff-counts.tsv.gz |
Contains k-mers counts from noGENCODE-counts.tsv.gz that have passed the differential testing. Output format is a tsv with the following columns: kmer pvalue meanA meanB log2FC [SAMPLES] . |
merged-diff-counts.tsv.gz |
Contains assembled k-mers from diff-counts.tsv.gz . Output format is a tsv with the following columns: nb_merged_kmers contig kmer pvalue meanA meanB log2FC [SAMPLES] . |
raw-counts.tsv.gz |
Containins raw k-mer counts of all libraries that have been filtered with the reccurency filters. |
noGENCODE-counts.tsv.gz |
Containtains k-mer counts filtered from raw-counts.tsv with the k-mers from the reference transcription (ex: GENCODE by default). |
sample_conditions_full.tsv |
Tabulated file with samples names, conditions and normalization factors. sample_conditions.tsv is the sample |
If you are interested in running a DE-Kupl-style analysis on whole-genome data, i.e. without using a reference transcriptome, please use this branch.
- if new samples are added to the config.json, make sure to remove the
metadata
folder in order to force SnakeMake to re-make all targets that depends on this file - Snakemake uses Rscript, not R. If a R module is not installed, type
which Rscript
andwhich R
and make sure they point to the same installation of R.
- Create a dekupl binary with two commands :
dekupl build_index {genome}
: This command will download reference files and create all indexesdekupl run {dekupl_index} {config.yml} {output_dir}
: This command will run the dekupl pipeline