Tools for data/results provenance #5

MilesMcBain · 2017-03-17T04:55:54Z

I recently had the pleasure of using R as part of a team in a data science project. Despite the best reproducibility intentions we ended up getting ourselves in a mighty tangle with dataset versions, modelling results versions and the matching up of the two.

It got me thinking about the issue of provenance and the tooling in R. I'd be keen to work on any of the following:

An API for dataset versioning from R, e.g. using dat or zenodo, potentially building out existing rOpenSci packages.
A tool for journaling and versioning some kind of execution record that can map data, code, and results.
- Tools for creating useful summaries of this information.
A tool for conditionally executing code based on successful validation with a repository identifier.
Anything else that fits into this broad category.

A much more long winded proposal that motivates all of these is available here: https://github.com/MilesMcBain/journalr/blob/master/Journalling_tool_proposal.Rmd

jonocarroll · 2017-03-17T05:00:02Z

A possible component could be last year's suggested project of an 'R package to store/access metadata associated with data/functions': ropensci/auunconf#18

MilesMcBain · 2017-03-17T06:14:55Z

Wow Jono you are right this is a very similar idea to that one!

I have in mind (and remember, this is all purely brainstorming at this point) the case where you load some data from a trusted source, validate that it is indeed unchanged (validate_checksum(data)), print out the context (context(data)$owner; context(data)$last_modified), etc... ditto for functions that do what one thinks they do (context(my_function)$assumptions). The context travels with the data/function and can be tested against, e.g.

Yes. I would be happy to try to hack up this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tools for data/results provenance #5

Tools for data/results provenance #5

MilesMcBain commented Mar 17, 2017 •

edited

Loading

jonocarroll commented Mar 17, 2017

MilesMcBain commented Mar 17, 2017 •

edited

Loading

Tools for data/results provenance #5

Tools for data/results provenance #5

Comments

MilesMcBain commented Mar 17, 2017 • edited Loading

jonocarroll commented Mar 17, 2017

MilesMcBain commented Mar 17, 2017 • edited Loading

MilesMcBain commented Mar 17, 2017 •

edited

Loading

MilesMcBain commented Mar 17, 2017 •

edited

Loading