Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools for data/results provenance #5

Open
MilesMcBain opened this issue Mar 17, 2017 · 2 comments
Open

Tools for data/results provenance #5

MilesMcBain opened this issue Mar 17, 2017 · 2 comments

Comments

@MilesMcBain
Copy link
Member

MilesMcBain commented Mar 17, 2017

I recently had the pleasure of using R as part of a team in a data science project. Despite the best reproducibility intentions we ended up getting ourselves in a mighty tangle with dataset versions, modelling results versions and the matching up of the two.

It got me thinking about the issue of provenance and the tooling in R. I'd be keen to work on any of the following:

  • An API for dataset versioning from R, e.g. using dat or zenodo, potentially building out existing rOpenSci packages.
  • A tool for journaling and versioning some kind of execution record that can map data, code, and results.
    • Tools for creating useful summaries of this information.
  • A tool for conditionally executing code based on successful validation with a repository identifier.
  • Anything else that fits into this broad category.

A much more long winded proposal that motivates all of these is available here: https://github.com/MilesMcBain/journalr/blob/master/Journalling_tool_proposal.Rmd

@jonocarroll
Copy link

A possible component could be last year's suggested project of an 'R package to store/access metadata associated with data/functions': ropensci/auunconf#18

@MilesMcBain
Copy link
Member Author

MilesMcBain commented Mar 17, 2017

Wow Jono you are right this is a very similar idea to that one!

I have in mind (and remember, this is all purely brainstorming at this point) the case where you load some data from a trusted source, validate that it is indeed unchanged (validate_checksum(data)), print out the context (context(data)$owner; context(data)$last_modified), etc... ditto for functions that do what one thinks they do (context(my_function)$assumptions). The context travels with the data/function and can be tested against, e.g.

Yes. I would be happy to try to hack up this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants