Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prototype workflow RO-Crate from snakemake workflow #1

Open
douglowe opened this issue May 8, 2024 · 15 comments
Open

prototype workflow RO-Crate from snakemake workflow #1

douglowe opened this issue May 8, 2024 · 15 comments
Assignees

Comments

@douglowe
Copy link
Contributor

douglowe commented May 8, 2024

Work coming from the BGE hackathon in Leiden. Reporting of products made should go in the report here: https://docs.google.com/document/d/1if6ukMKN3xHQHAwGEQPhhgvp7iQcFnauj4W1ZtIs8wk/edit

Aim is to write a python tool which will create a workflow RO-Crate from the outputs and reports created from a snakemake workflow.

Snakemake workflow used: https://github.com/o-william-white/skim2mt.git

@tbrown91
Copy link
Collaborator

Hi @douglowe

I am able to give this more thought this week, so am wondering what the best next steps would be. At the moment all of the information I have pulled from the html are just sitting in variables. Do you think it will be easy to turn this into provenance ro-crate?

@douglowe
Copy link
Contributor Author

Hi @tbrown91 - I'm getting a bit of time to look at this too, and have conflicting ideas about how to go about this.

In the long-term I think we can add to the snakemake runner itself, creating an 'ro-crate' report option, as an alternative to the html report. See this issue I created in a local copy of the snakemake repo: eScienceLab/snakemake#1

This probably should start with creating some example RO-Crate files (first a workflow crate, then script the building of a provenance crate from that, using the metadata pulled from the html report), so that we can build a test to include in the snakemake testing suites. Let's have a go at creating that this week?

@tbrown91
Copy link
Collaborator

Baby steps befd0dd

There are many things I don't like about the snakemake report, but particularly that the input and output files are not really listed or names. There are a number of wildcards left in, but maybe this is not important for a workflow RO-crate. For the provenance RO-crate I think we will not be able to extract the information we are looking for

@fbartusch
Copy link
Collaborator

Hi, I was added by @douglowe as collaborator to this repository.

You're scraping the information directly from the html report, right? The html report itself is generated by a Snakemake report plugin and uses the data stored in .snakemake/metadata/ in the workflow's main directory.
Since Snakemake 8 there is a plugin system for some functionality, among other things the report function.

There's a poetry template provided by Snakemake project for new plugins.
I fiddled a bit around a few weeks ago with the template was able to some provenance information rather quickly.

I think that's a cleaner way to get the needed information. Also, this does not break if html report changes it's structure/layout/content. I can provide some code in the next days, when my work schedule allows it.

@fbartusch
Copy link
Collaborator

The skim2mt workflow ran through on our cluster. I added an usable Snakemake report plugin to the repo and documented how you can rebuild it in the README.
The plugin works on the skim2mt metadata on our cluster (e.g. it does not throw errors). Although the plugin does nothing useful in the moment this is a good sign :)

@douglowe
Copy link
Contributor Author

Snakemake interface skeletons are defined here: https://github.com/snakemake/snakemake-interface-report-plugins

The ReporterBase is defined here: https://github.com/snakemake/snakemake-interface-report-plugins/blob/main/snakemake_interface_report_plugins/reporter.py. It has access to rules, jobs, configfiles, settings, workflow_description, and dag.

These are defined here: https://github.com/snakemake/snakemake-interface-report-plugins/blob/main/snakemake_interface_report_plugins/interfaces.py.

@douglowe
Copy link
Contributor Author

I've made a first stab at extending the reporter, and got it to add the snakemake version number to the RO-Crate metadata (baby steps...). I've posted the following question to the snakemake developers discord channel, as the ro-crate-py library is a bit clunky and I need some help with working around it's limitations.


I'm working on an RO-Crate reporter plugin for Snakemake. Development code is in https://github.com/UoMResearchIT/ro-crate_snakemake_tooling/tree/develop (under the snakemake-report... directory).

I am making use of the ro-crate-py library (https://github.com/ResearchObject/ro-crate-py) to work with the RO-Crate report. This library is a bit clunky, and I've realised I need to create an explicit exclude list for everything I don't want to be captured from the snakemake working directory. I've defined basic snakemake and git exclusions, but there are likely to be user-specific exclusions as well. How can I enable users to pass such information to the plugin at run time - can it be via a flag, or perhaps an environment variable?

@douglowe
Copy link
Contributor Author

Note that I've not added ro-crate-py as a dependency for the plugin yet - this needs adding to the poetry file I guess before release. Currently I'm manually running pip install ro-crate-py before creating the plugin.

@fbartusch
Copy link
Collaborator

I'm trying to catch up ... thanks for keeping this going @douglowe
Installing ro-crate-py worked with: pip install rocrate for me.

There is a new validator for RO-Crates: https://github.com/crs4/rocrate-validator
I use it for validating the RO-Crate produced by the Nextflow plugin I'm working on. I blocked some time tomorrow for setting up tests for the plugin. Like running a simple Snakemake pipeline and checking the resulting RO-Crate with the validator.

How can I enable users to pass such information to the plugin at run time - can it be via a flag, or perhaps an environment variable?

A parameter for a excludelist could be implemented here:

We could also start with some kind of whitelist, e.g. files we expect to exist and ignore everything else for the time beeing. As base we can take the Snakemake recommendation:
https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html

@douglowe
Copy link
Contributor Author

douglowe commented Nov 5, 2024

Hi @fbartusch,

I like the idea of incorporating the CRS4 RO-Crate validator into this plugin too. At our end @alexhambley has been exploring this validator too, for use in automatically validating workflow RO-Crates elsewhere, so will be interested what we're doing here.

One thing we are missing for this tool is a profile for workflow run RO-Crates. So we will be limited on how much validation we can do to begin with.

Regarding the excludelist - thanks for the pointer on how to add a parameter - I'll add this in when I'm next working on this.

Unfortunately I don't think a whitelist will work with the current ro-crate-py library. I am going to raise an issue with them, asking if they can change/improve the manner in which they select files for inclusion in the RO-Crate.

@simleo
Copy link

simleo commented Nov 5, 2024

One thing we are missing for this tool is a profile for workflow run RO-Crates. So we will be limited on how much validation we can do to begin with.

The current version of the validator (0.4.2) supports all three Workflow Run RO-Crates profiles: Process Run Crate, Workflow Run Crate, Provenance Run Crate. You can get a list of supported profiles by running:

rocrate-validator profiles list

@douglowe
Copy link
Contributor Author

douglowe commented Nov 5, 2024

One thing we are missing for this tool is a profile for workflow run RO-Crates. So we will be limited on how much validation we can do to begin with.

The current version of the validator (0.4.2) supports all three Workflow Run RO-Crates profiles: Process Run Crate, Workflow Run Crate, Provenance Run Crate. You can get a list of supported profiles by running:

rocrate-validator profiles list

Ahh - nice! My information is evidently out of date :)

@simleo
Copy link

simleo commented Nov 5, 2024

Unfortunately I don't think a whitelist will work with the current ro-crate-py library.

Actually, when building a new RO-Crate with ROCrate(), the ro-crate-py library does not include anything you don't want. See Creating an RO-Crate. The exclude argument is needed only in two cases:

  • When initializing an RO-Crate from a directory tree: ROCrate(some_dir, init=True)
  • When writing an RO-Crate that was previously read as an RO-Crate (copying an RO-Crate, possibly with modifications in between)

Indeed, the repo2rocrate package uses a sort of white list when building an RO-Crate from a directory.

@douglowe
Copy link
Contributor Author

douglowe commented Nov 5, 2024

  • When writing an RO-Crate that was previously read as an RO-Crate (copying an RO-Crate, possibly with modifications in between)

Our current use case is this - I'm using a Workflow RO-Crate created by WorkflowHub as the basis of the Workflow Run RO-Crate that we are asking Snakemake to build. This is awkward because both git and snakemake create a lot of hidden files and folders within this working directory, which I don't want to be included in the final RO-Crate (nor any files which might be created by VS Code, etc).

It would be nice if, when reading in an RO-Crate, we could use the listed objects within that to identify what should be included, and exclude everything else by default (until they are explicitly added).

@simleo
Copy link

simleo commented Nov 5, 2024

Our current use case is this - I'm using a Workflow RO-Crate created by WorkflowHub as the basis of the Workflow Run RO-Crate that we are asking Snakemake to build. This is awkward because both git and snakemake create a lot of hidden files and folders within this working directory, which I don't want to be included in the final RO-Crate (nor any files which might be created by VS Code, etc).

It would be nice if, when reading in an RO-Crate, we could use the listed objects within that to identify what should be included, and exclude everything else by default (until they are explicitly added).

Then it's probably better to use ro-crate-py only to build the final RO-Crate, but not to read the one you get from WorkflowHub. Unpack the RO-Crate obtained from WorkflowHub and treat it like a generic directory, adding files from it to a new RO-Crate:

from rocrate.rocrate import ROCrate
out_crate = ROCrate()
out_crate.add_file("crate_from_workflowhub/foo.txt")
...
out_crate.write("snakemake_crate")

If you need to preserve some metadata of the RO-Crate obtained from WorkflowHub, you can keep that open as a separate object and get what you need, for instance:

from rocrate.rocrate import ROCrate
in_crate = ROCrate("crate_from_workflowhub")
out_crate = ROCrate()
out_crate.add_file("crate_from_workflowhub/foo.txt")
out_crate.root_dataset["isBasedOn"] = in_crate.root_dataset["isBasedOn"]
...
out_crate.write("snakemake_crate")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants