Skip to content

Identify differentially expressed k-mers between RNA-Seq datasets

License

Notifications You must be signed in to change notification settings

mikael-s/dekupl-run

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dekupl-run

DE-kupl is a pipeline that finds differentially expressed k-mers between RNA-Seq datasets under The MIT License.

Dekupl-run handles the first part of the DE-kupl pipeline from raw FASTQ to the production of contigs from differentially expressed k-mers.

Dependencies

Before using Dekupl-run, install these dependencies:

  • Snakemake
  • jellyfish
  • pigz
  • CMake
  • boost
  • R:
    • DESEq2 : open R and execute : > source("https://bioconductor.org/biocLite.R") > biocLite("DESeq2")
    • RColorBrewer
    • pheatmap
    • foreach
    • doParallel
  • Python:
    • rpy2 : pip3 install rpy2

Installation and usage

Either use the Docker container (updated daily, https://hub.docker.com/r/ebio/dekupl/), or:

  1. Clone this repository including submodules : git clone --recursive https://github.com/Transipedia/dekupl-run.git
  2. Install dependencies above
  3. Edit the config.json file to add the list of your samples, their conditions and the location their FASTQ file. See next section for parameters description.
  4. Run the pipeline with then snakemake -jNB_THREADS --resources ram=MAX_MEMORY -p command. Replace NB_THREADS with the number of threads and MAX_MEMORY with the maximum memory (in Megabyte) you want DEkupl to allocate.
  5. Once Dekupl-run has been fully executed, DE contigs produced by Dekupl-run (under DEkupl_results/A_vs_B_kmer_counts/merged-diff-counts.tsv.gz) can be annotate using Dekupl-annotation

Configuration (config.json)

General configuration parameters

  • fastq_dir: Location of FASTQ files
  • nb_threads: Default number of thread to use (unless specified in the snakemake command-line
  • kmer_length: Length of k-mers (default: 31). This value shoud not exceed 32.
  • diff_method: Method used for differential testing (default: DESeq2). Possible choices are 'Ttest' which is fast and 'DESeq2' which is more sensitive but longer to run.
  • lib_type: Paired-end library type (default: rf). You can specify either rf for reverse-forward strand-specific libraries, fr for strand-specific forward-reverse, or unstranded for unstranded libraries.
  • output_dir: Location of DE-kupl results (default: DEkupl_result).
  • tmp_dir: Temporary directory to use (default: ./ aka current directory)
  • r1_suffix: Suffix to use for the FASTQ with left mate. Set r2_suffix for the second FASTQ.
  • dekupl_counter:
    • min_recurrence: Minimum number of samples to support a k-mer
    • min_recurrence_abundance: Min abundance threshold to consider a k-mer in the reccurency filter.
  • Ttest:
    • condition: Specify A and B conditions.
    • pvalue_threshold: Min p-value (adjusted) to consider a k-mer as DE. Only DE k-mers are selected for assembly.
    • log2fc_threshold: Min Log2 Fold Change to consider a k-mer as DE.
  • Samples: An array of samples. Each sample is described by a name and a condition. The FASTQ files for a sample will be located using the following command fastq_dir/sample_name_{1,2}.fastq.gz
  • transcript_fasta: The reference transcriptome to be used for masking. By default DEKupl-run uses the human Gencode transcriptome for masking. To change this, add to the config.json file: "transcript_fasta":my_transciptome.fa

Configuration for single-end libraries

For single-end libraries please specify the following parameters :

  • lib_type: You can either set the lib_type to single in the case of single-end strand-specific library or unstranded for single-end unstranded libraries.
  • fragment_length : The estimated fragment length (necessary for kallisto quantification). Default value is 200.
  • fragment_standard_deviation : The estimated standard deviation of fragment length (necessary for kallisto quantification). Default value is 30.

Notes : The fastq files for single-end samples will be located using the following path : {fastq_dir}/{sample_name}.fastq.gz If present, parameters r1_suffix and r2_suffix will be ignored.

Output files

The output directory of a DE-kupl will have the following content :

├── {A}_vs_{B}_kmer_counts
│   ├── diff-counts.tsv.gz
│   ├── merged-diff-counts.tsv.gz
├── gene_expression
│   ├── {A}vs{B}-DEGs.tsv
├── kmer_counts
│   ├── normalization_factors.tsv
│   ├── raw-counts.tsv.gz
│   ├── noGENCODE-counts.tsv.gz
│   ├── {sample}.jf
│   ├── {sample}.txt.gz
│   ├── ...
├── metadata
│   ├── sample_conditions.tsv
│   ├── sample_conditions_full.tsv

The following table describes the output files produced by DE-kupl :

FileName Description
diff-counts.tsv.gz Contains k-mers counts from noGENCODE-counts.tsv.gz that have passed the differential testing. Output format is a tsv with the following columns: kmer pvalue meanA meanB log2FC [SAMPLES].
merged-diff-counts.tsv.gz Contains assembled k-mers from diff-counts.tsv.gz. Output format is a tsv with the following columns: nb_merged_kmers contig kmer pvalue meanA meanB log2FC [SAMPLES].
raw-counts.tsv.gz Containins raw k-mer counts of all libraries that have been filtered with the reccurency filters.
noGENCODE-counts.tsv.gz Containtains k-mer counts filtered from raw-counts.tsv with the k-mers from the reference transcription (ex: GENCODE by default).
sample_conditions_full.tsv Tabulated file with samples names, conditions and normalization factors. sample_conditions.tsv is the sample

Whole-genome data

If you are interested in running a DE-Kupl-style analysis on whole-genome data, i.e. without using a reference transcriptome, please use this branch.

FAQ

  • if new samples are added to the config.json, make sure to remove the metadata folder in order to force SnakeMake to re-make all targets that depends on this file
  • Snakemake uses Rscript, not R. If a R module is not installed, type which Rscript and which R and make sure they point to the same installation of R.

TODO

  • Create a dekupl binary with two commands :
    • dekupl build_index {genome}: This command will download reference files and create all indexes
    • dekupl run {dekupl_index} {config.yml} {output_dir}: This command will run the dekupl pipeline

About

Identify differentially expressed k-mers between RNA-Seq datasets

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 50.3%
  • R 32.5%
  • Perl 17.2%