Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to download the fasta file when retrieving the genome from FTP #107

Open
tcezard opened this issue May 24, 2023 · 3 comments · May be fixed by #110
Open

Add function to download the fasta file when retrieving the genome from FTP #107

tcezard opened this issue May 24, 2023 · 3 comments · May be fixed by #110

Comments

@tcezard
Copy link
Member

tcezard commented May 24, 2023

Currently the downloadAssemblyReport only access and download the assembly report from the FTP. We should add a function to retrieve the actual assembly that live alongside the report.

@waterflow80
Copy link
Collaborator

Hi @tcezard
can you give me some guidance on where to find the actual assembly, in fasta format.
For example, the ENA database home tree looks like this:
Screenshot from 2023-05-24 17-04-44

And I can't figure out the path for the actual assembly.

@tcezard
Copy link
Member Author

tcezard commented May 24, 2023

You can read up on the structure of ENA FTP here.
This said you can look at how the code browse the ENA FTP here and Genbank FTP here.

@waterflow80
Copy link
Collaborator

@tcezard @apriltuesday @sundarvenkata-EBI @nitin-ebi
I've tried to reverse engineer the Contig-Alias's Logic of fetching and downloading the assembly report data.
And I've managed to use the same logic to download the fna (fasta) format of an assembly, given its accession (Using the NCBI database).

  • So the process would be as follows (some intermediate classes may have been ignored) :

Screenshot from 2023-05-30 11-10-33

In order to do so, I had to add more classes, in order to make it compatible with the assembly's sequence file, and to try not to mess the already existing code and design.

  • The classes that I've created so far are:

Screenshot from 2023-05-30 11-27-21

  • Note:
    • The naming of the classes and packages may not be very accurate or completely relevant, and it may be corrected later.
    • The AssemblySequenceEntity is not yet complete. It has been created for testing purposes only. It will be completed after I understand the way to parse and understand the information inside the fna/fasta file.

The process of parsing the file and retrieving the information of it is still pending, because I still have a question about the file and the data it contains:

  1. How can I do the mapping (correspondence) between the data in the assembly report and the one at the assembly sequence file (in the fna/fasta format). For example, the data below refers to the same insdc_accession: GCA_000001765.2
  • Assembly table:
    assembly_table

  • Chromosome table:
    Screenshot from 2023-05-30 11-39-06

  • Fna/Fasta file (a tiny portion of it):
    Screenshot from 2023-05-30 11-40-04

  • Observation:

    • Working on the same repository as the the Contig-alias will require adding more classes and packages that might have independent logic. On the other hand, using some of the Contig-alias' functionalities in fetching the data such as the assembly Report, will significantly save us a huge amount of time rewriting the same code.
    • So I want to know if there will be a negative effect when making our code on top of the contig-alias's repo ?

@waterflow80 waterflow80 linked a pull request May 30, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants