-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prototype workflow RO-Crate from snakemake workflow #1
Comments
Hi @douglowe I am able to give this more thought this week, so am wondering what the best next steps would be. At the moment all of the information I have pulled from the html are just sitting in variables. Do you think it will be easy to turn this into provenance ro-crate? |
Hi @tbrown91 - I'm getting a bit of time to look at this too, and have conflicting ideas about how to go about this. In the long-term I think we can add to the snakemake runner itself, creating an 'ro-crate' report option, as an alternative to the html report. See this issue I created in a local copy of the snakemake repo: eScienceLab/snakemake#1 This probably should start with creating some example RO-Crate files (first a workflow crate, then script the building of a provenance crate from that, using the metadata pulled from the html report), so that we can build a test to include in the snakemake testing suites. Let's have a go at creating that this week? |
Baby steps befd0dd There are many things I don't like about the snakemake report, but particularly that the input and output files are not really listed or names. There are a number of wildcards left in, but maybe this is not important for a workflow RO-crate. For the provenance RO-crate I think we will not be able to extract the information we are looking for |
Hi, I was added by @douglowe as collaborator to this repository. You're scraping the information directly from the html report, right? The html report itself is generated by a Snakemake report plugin and uses the data stored in There's a poetry template provided by Snakemake project for new plugins. I think that's a cleaner way to get the needed information. Also, this does not break if html report changes it's structure/layout/content. I can provide some code in the next days, when my work schedule allows it. |
The skim2mt workflow ran through on our cluster. I added an usable Snakemake report plugin to the repo and documented how you can rebuild it in the README. |
Snakemake interface skeletons are defined here: https://github.com/snakemake/snakemake-interface-report-plugins The ReporterBase is defined here: https://github.com/snakemake/snakemake-interface-report-plugins/blob/main/snakemake_interface_report_plugins/reporter.py. It has access to These are defined here: https://github.com/snakemake/snakemake-interface-report-plugins/blob/main/snakemake_interface_report_plugins/interfaces.py. |
I've made a first stab at extending the reporter, and got it to add the snakemake version number to the RO-Crate metadata (baby steps...). I've posted the following question to the snakemake developers discord channel, as the I'm working on an RO-Crate reporter plugin for Snakemake. Development code is in https://github.com/UoMResearchIT/ro-crate_snakemake_tooling/tree/develop (under the snakemake-report... directory). I am making use of the ro-crate-py library (https://github.com/ResearchObject/ro-crate-py) to work with the RO-Crate report. This library is a bit clunky, and I've realised I need to create an explicit exclude list for everything I don't want to be captured from the snakemake working directory. I've defined basic snakemake and git exclusions, but there are likely to be user-specific exclusions as well. How can I enable users to pass such information to the plugin at run time - can it be via a flag, or perhaps an environment variable? |
Note that I've not added |
I'm trying to catch up ... thanks for keeping this going @douglowe There is a new validator for RO-Crates: https://github.com/crs4/rocrate-validator
A parameter for a excludelist could be implemented here: ro-crate_snakemake_tooling/snakemake-report-plugin-wrroc/snakemake_report_plugin_wrroc/__init__.py Line 22 in 9d9e416
We could also start with some kind of whitelist, e.g. files we expect to exist and ignore everything else for the time beeing. As base we can take the Snakemake recommendation: |
Hi @fbartusch, I like the idea of incorporating the CRS4 RO-Crate validator into this plugin too. At our end @alexhambley has been exploring this validator too, for use in automatically validating workflow RO-Crates elsewhere, so will be interested what we're doing here. One thing we are missing for this tool is a profile for workflow run RO-Crates. So we will be limited on how much validation we can do to begin with. Regarding the excludelist - thanks for the pointer on how to add a parameter - I'll add this in when I'm next working on this. Unfortunately I don't think a whitelist will work with the current ro-crate-py library. I am going to raise an issue with them, asking if they can change/improve the manner in which they select files for inclusion in the RO-Crate. |
The current version of the validator (0.4.2) supports all three Workflow Run RO-Crates profiles: Process Run Crate, Workflow Run Crate, Provenance Run Crate. You can get a list of supported profiles by running:
|
Ahh - nice! My information is evidently out of date :) |
Actually, when building a new RO-Crate with
Indeed, the repo2rocrate package uses a sort of white list when building an RO-Crate from a directory. |
Our current use case is this - I'm using a Workflow RO-Crate created by WorkflowHub as the basis of the Workflow Run RO-Crate that we are asking Snakemake to build. This is awkward because both git and snakemake create a lot of hidden files and folders within this working directory, which I don't want to be included in the final RO-Crate (nor any files which might be created by VS Code, etc). It would be nice if, when reading in an RO-Crate, we could use the listed objects within that to identify what should be included, and exclude everything else by default (until they are explicitly added). |
Then it's probably better to use ro-crate-py only to build the final RO-Crate, but not to read the one you get from WorkflowHub. Unpack the RO-Crate obtained from WorkflowHub and treat it like a generic directory, adding files from it to a new RO-Crate: from rocrate.rocrate import ROCrate
out_crate = ROCrate()
out_crate.add_file("crate_from_workflowhub/foo.txt")
...
out_crate.write("snakemake_crate") If you need to preserve some metadata of the RO-Crate obtained from WorkflowHub, you can keep that open as a separate object and get what you need, for instance: from rocrate.rocrate import ROCrate
in_crate = ROCrate("crate_from_workflowhub")
out_crate = ROCrate()
out_crate.add_file("crate_from_workflowhub/foo.txt")
out_crate.root_dataset["isBasedOn"] = in_crate.root_dataset["isBasedOn"]
...
out_crate.write("snakemake_crate") |
Work coming from the BGE hackathon in Leiden. Reporting of products made should go in the report here: https://docs.google.com/document/d/1if6ukMKN3xHQHAwGEQPhhgvp7iQcFnauj4W1ZtIs8wk/edit
Aim is to write a python tool which will create a workflow RO-Crate from the outputs and reports created from a snakemake workflow.
Snakemake workflow used: https://github.com/o-william-white/skim2mt.git
The text was updated successfully, but these errors were encountered: