Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fixing workflow for single-end sequencing #70

Conversation

Wolfffff
Copy link
Contributor

@Wolfffff Wolfffff commented Nov 17, 2023

Summary

This PR fixes minor issues stopping the workflow from working with SE sequences.

Bugs Fixed

  • Fixes path in get_fq to appropriately look in results/trimmed/ instead of trimmed/
  • Changes cutadapt/se wrapper to call adapters not adapters_r1. While the wrapper (found here) seems to want users to separate adapter args from extra args, they appear to be equivalent. This is how the PE workflow works and verifying the STAR commands suggests that this is fine.

@Wolfffff Wolfffff marked this pull request as ready for review November 17, 2023 00:15
@Wolfffff Wolfffff changed the title Fixing workflow for single-end sequencing fix: Fixing workflow for single-end sequencing Nov 17, 2023
@dlaehnemann
Copy link
Contributor

Many thanks for looking into this and providing the fix!

Would it be OK if I hijacked this pull request to improve the automatic testing of the workflow? I'm currently looking into improving the testing data used in multiple workflows, basically looking for real Sacharomyces cerevisiae datasets, as this is a much smaller genome, but still has most of the annotation resources commonly used. I would then try to add one single-end dataset here as well, so that your changes actually get run through a proper test (and we catch any regressions in the future!).

To nevertheless use your fixes already, you can point your snakemake module import to your fork and this branch (or to your latest commit), so it would be something like:

module rna_seq_star_deseq2:
    snakefile:
        github("Wolfffff/rna-seq-star-deseq2", path="workflow/Snakefile", branch="fixing-single-end-sequencing-workflow")

This will use the latest commit on that branch. If you want to fix it to a commit, you could instead change the github() function call to:

        github("Wolfffff/rna-seq-star-deseq2", path="workflow/Snakefile", commit="8b85d76419742a3998e715fd44fd44bb04e1d5f9")

Hope this already helps, and hope we can use this opportunity to improve testing overall!

@Wolfffff
Copy link
Contributor Author

Of course! Feel free to take over this PR for testing!

One other thing to note as far as getting it to be more robust would be adding r-base and icu to the deseq.yaml to avoid relying on having icu and R. The icu fix is most likely not regularly required but having a r-base guarantees we're not relying on the system already having R.

@dlaehnemann
Copy link
Contributor

r-base should always be a requirement to bioconductor-deseq2, so this should not need to be specified explicitly. You can see this for the currently used version of DESeq2 here:
https://github.com/bioconda/bioconda-recipes/blob/56a92812e947276de90f72139a596ff313c13a64/recipes/bioconductor-deseq2/meta.yaml#L32

For icu, do you know which of the packages in the deseq2.yaml file needs this to work properly:
https://github.com/snakemake-workflows/rna-seq-star-deseq2/blob/846693fa78a923c12e397e89c447816c70de097b/workflow/envs/deseq2.yaml

Because then we should fix that package's dependencies to include icu instead of artificially including it here.

@Wolfffff
Copy link
Contributor Author

Awesome! This may be instance specific but I hit problems with bioconductor-deseq2 not downloading R and using the base R on my system.

biomaRt requires RCurl which requires icu. When I used the install directly, conda couldn't resolve libicui18n.so.73 and the install of icu at the env level fixed it.

@dlaehnemann
Copy link
Contributor

Hmm, same for icu and bioconductor-biomart, even though that's a bit harder to find. You can search the tree of dependencies that are pulled in by a conda package with mamba repoquery depends, so with a mamba installation you can do this:

mamba repoquery depends --tree -c conda-forge -c bioconda bioconductor-biomart

This brings up a clear icu dependency fairly quickly (I cut the output to its first occurrence, icu comes up multiple times and the output is A LOT longer):

bioconductor-biomart[2.56.1]
  ├─ r-stringr[1.1.0]
  │  ├─ r-magrittr[1.5]
  │  │  └─ r-base[3.3.2]
  │  │     ├─ libgcc[4.8.4]
  │  │     │  ├─ isl[0.16.1]
  │  │     │  │  └─ gmp[6.1.0]
  │  │     │  ├─ cloog[0.18.0]
  │  │     │  │  ├─ gmp >>> NOT FOUND <<<
  │  │     │  │  └─ isl >>> NOT FOUND <<<
  │  │     │  ├─ gmp[6.3.0]
  │  │     │  │  ├─ libgcc-ng[13.2.0]
  │  │     │  │  │  ├─ _openmp_mutex[4.5]
  │  │     │  │  │  │  ├─ _libgcc_mutex[0.1]
  │  │     │  │  │  │  └─ llvm-openmp[17.0.5]
  │  │     │  │  │  │     ├─ libzlib[1.2.13]
  │  │     │  │  │  │     └─ zstd[1.5.5]
  │  │     │  │  │  │        ├─ libzlib already visited
  │  │     │  │  │  │        └─ libstdcxx-ng[13.2.0]
  │  │     │  │  │  └─ _libgcc_mutex already visited
  │  │     │  │  └─ libstdcxx-ng already visited
  │  │     │  ├─ mpc[1.3.1]
  │  │     │  │  ├─ gmp already visited
  │  │     │  │  ├─ libgcc-ng already visited
  │  │     │  │  └─ mpfr[4.2.1]
  │  │     │  │     ├─ gmp already visited
  │  │     │  │     └─ libgcc-ng already visited
  │  │     │  └─ mpfr already visited
  │  │     ├─ curl[7.52.1]
  │  │     │  ├─ zlib[1.2.13]
  │  │     │  │  ├─ libgcc-ng already visited
  │  │     │  │  └─ libzlib already visited
  │  │     │  └─ openssl[1.0.2u]
  │  │     │     ├─ libgcc-ng already visited
  │  │     │     └─ ca-certificates[2016.2.28]
  │  │     ├─ gsl[1.16]
  │  │     ├─ pcre[8.37]
  │  │     ├─ jpeg[9e]
  │  │     │  └─ libgcc-ng already visited
  │  │     ├─ libxml2[2.9.14]
  │  │     │  ├─ libgcc-ng already visited
  │  │     │  ├─ libzlib already visited
  │  │     │  ├─ zlib[1.2.11]
  │  │     │  ├─ xz[5.2.6]
  │  │     │  │  └─ libgcc-ng already visited
  │  │     │  ├─ libiconv[1.17]
  │  │     │  │  └─ libgcc-ng already visited
  │  │     │  └─ icu[69.1]
  │  │     │     ├─ libgcc-ng already visited
  │  │     │     └─ libstdcxx-ng already visited
  [...]

@Wolfffff
Copy link
Contributor Author

Hm... then leaving as is makes sense. I'm surprised, given this, that it doesn't run through directly out of the box on our cluster.

@dlaehnemann
Copy link
Contributor

So these particular issues with icu and r-base do sound like something instance specific. There are two things that might help:

  1. Use mamba as the drop-in replacement for all things conda: https://github.com/conda-forge/miniforge#install
    This will also speed up your dependency resolution during environment creation / software installation considerably.
  2. Ensure strict channel priorities, for example by following the setup guidelines for bioconda: https://bioconda.github.io/#usage

I hope with those you won't have similar issues, even though I don't currently understand where exactly your problem(s) is/are coming from...

@Wolfffff
Copy link
Contributor Author

Ah no worries! Those are both setup on our instance -- the workaround was just to install icu directly or update the environments as noted above. It's working -- just required those fixes on my end.

@dlaehnemann
Copy link
Contributor

I actually decided to do the improvement of the testing data in a separate pull request. So here I'm just fixing the formatting and hope that the existing tests pass.

Copy link
Contributor

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, many thanks again.

@dlaehnemann dlaehnemann merged commit 3380a05 into snakemake-workflows:master Nov 21, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants