Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENSEMBL-SEQUENCE does not work for all species #3070

Open
lczech opened this issue Jul 12, 2024 · 4 comments
Open

ENSEMBL-SEQUENCE does not work for all species #3070

lczech opened this issue Jul 12, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@lczech
Copy link

lczech commented Jul 12, 2024

Snakemake version
Snakemake: 8.15.2
Wrapper: "v3.13.6/bio/reference/ensembl-sequence"

Describe the bug
The path for downloading has a hard-coded structure in the wrapper:

spec = ("{build}" if int(release) > 75 else "{build}.{release}").format(
    build=build, release=release
)
url_prefix = f"{url}/{branch}release-{release}/fasta/{species}/{datatype}/{species.capitalize()}.{spec}"

This uses a hard check for > 75. However, for some species, the path structure differs, for instance A. thaliana is currently in plants release 59, but does not have the above hard-coded extra release number in the spec part of the filename.

The correct file name is

Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

but instead the wrapper is only checking for

Arabidopsis_thaliana.TAIR10.59.[dna.primary_assembly.fa.gz|dna.toplevel.fa.gz]

which has the additional 59 that should not be there. Hence, the download fails. I think a simple fix is to avoid the hard-coded 75, and instead check both variants of the path.

@lczech lczech added the bug Something isn't working label Jul 12, 2024
@lczech
Copy link
Author

lczech commented Jul 12, 2024

Also, I noticed the following in the wrapper:

    try:
        shell("curl -sSf {url} > /dev/null 2> /dev/null")
    except sp.CalledProcessError:
        continue

    shell("(curl -L {url} | gzip -d >> {snakemake.output[0]}) {log}

If I understand this correctly, the file will be downloaded twice, is that right? Also, it is always decompressed automatically, which might lead to confusion if the specified output file is actually specified as .fasta.gz. Not sure if either needs immediate fixing, but wanted to bring it up.

@lczech
Copy link
Author

lczech commented Jul 12, 2024

And another related issue. I hope it's okay to report this here as well, as it's the same underlying problem, but in the variation wrapper "v3.13.6/bio/reference/ensembl-variation", see here.

There, the path that I need to specify is

https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-59/plants/variation/vcf/arabidopsis_thaliana/arabidopsis_thaliana.vcf.gz

See https://plants.ensembl.org/info/data/ftp/index.html for the table where this URL is from.

However, for the wrapper to correctly assemble the URL, I need to leave out the branch specifier, yet a release < 98 causes an error when no branch is given.

Furthermore, the species name is automatically capitalized in the wrapper, which also would lead to an error here.

@lczech
Copy link
Author

lczech commented Jul 12, 2024

More updates, hope that's okay. My current workaround for this instead of trying to assemble the URL within the wrapper, I just offer an alternative rule that uses a user-provided URL directly for the download. Might be an easy solution here as well, by offering an optional param full-url or the like, that, if given, skips all the URL assembly steps.

@fgvieira
Copy link
Collaborator

Can you make a PR with your suggested changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants