Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better version/build management #8

Open
stephenturner opened this issue Jan 19, 2017 · 3 comments
Open

better version/build management #8

stephenturner opened this issue Jan 19, 2017 · 3 comments

Comments

@stephenturner
Copy link
Owner

with the changes in #6 it's much easier to recreate annotation tables. the files are named e.g. galgal5, but which version/build is actually used depends on what's current in ensembl. e.g., when I first built this package, chicken was on galgal4. i had to manually update the filenames, and I probably did the wrong thing by just deleting (rather than deprecating) the old datasets. maybe that's okay since it's still versioned in a release. not sure how to best handle these issues.

@aaronwolen
Copy link
Contributor

🤔...

One potential solution: name recipes and tables based on species, so hsapiens.yml would create a table called hsapiens that includes annotations for whatever the most recent build/version happens to be.

Previous versions could be specified by appending the version number. Most users will (probably) want the most up to date info and only need to type hsapiens, users with more specific needs would have to type something like hsapiens_GRCh37.

What's your opinion on providing previous genome versions?

We could maintain recipes for older builds and provide a function that allows users to build them locally. That way they're still easily accessible for reproducibility purposes without causing the package size to explode.

@stephenturner
Copy link
Owner Author

I do think there's a need to be able to maintain or recreate older versions. I operate a core facility - I've had folks that I've done analysis for years ago using, e.g., Galgal4, but if I now created or recreated the data, it'd be galgal5. Also, for human specifically, lots of folks (me included) are still using GRCh37.

There might be a few ways to manage this. I think you'd need to know which archive version of ensembl you'd need to go after to get the build you're interested in. Also, maybe there's some way to retrieve and record this information from the biomart query.

I do like the idea of just typing hsapiens... I'm sure there's a way to "alias" different names to the same dataset. Not very experience with R data package creation. This is my first/only.

@aaronwolen
Copy link
Contributor

This is a good point. Attaching GRCh38 data to an object called hsapiens would probably violate user assumptions. Perhaps it's better be more explicit and stick to naming objects after the relevant genome version?

I'm also in a bioinformatics core and frequently switching between different projects that require different genomes/builds, so I loved the idea of annotables. It can be a real time saver!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants