-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a 'database' for assembly accessions to map saved seqCol objects #13
Adding a 'database' for assembly accessions to map saved seqCol objects #13
Comments
However this is useful only when creating new seqCol objects and populating our database. |
We should be using the Assembly accession to check that an assembly has been ingested already. |
@tcezard |
I've said above that we should be storing the Assembly accession but we should also try to think about other potential source of sequence collection and how they would impact our design.
At the moment the main identifier is the assembly accession since it is what we use in the ingestion parameter so it makes sense to store this. However it won't be enough to know exactly where the sequence collection comes from. For this I think storing the source URL would be more accurate. I also think that in the future we will want to be able to store sequence collection that are not linked to INSDC accession. These most likely will have URL and we will also need to find some form of identifier associated. To enable this we should chose generic column names such as All these points makes me think we are building a set of metadata associated with the sequence collection which should be stored in a separate table than the one we already use for storing the digests and JSON objects. There is the question of the same metadata being used for multiple collections because they have different naming convention and the possibility that the same sequence collection could exist in different source. For this, I'm thinking we might need a many to many relationship between a sequence collection metadata set and a sequence collection. |
Concretely there are 3 pieces of metadata that I think are very relevant for a sequence collection:
There could also be some other information we might want to store like |
When trying to fetch and insert seqCol objects, we test whether the seqCol's digest is saved in the database or not, if so, we don't proceed with the saving. But at that point we've already downloaded both the assembly report and the sequences FASTA file and processed them to create the seqCol object, which might be a huge work for the server, especially when the sequences are too large.
So if we can have a database (or a file) where we can save the assembly accessions that maps to the saved seqCol objects. Like this, we'll be saving a huge amount of time, because we'll check the existing of seqCol objects that corresponds to that accession b4 downloading and processing anything.
Note (Technical detail): we should make sure to check that we have seqcol objects saved in the db that corresponds to all of the naming conventions that exist in the assembly report, in order to abort the fetch.
The text was updated successfully, but these errors were encountered: