Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CBRAIN] CARMIN has no API to 'prepare' a file for a pipeline #107

Open
prioux opened this issue Jul 25, 2019 · 2 comments
Open

[CBRAIN] CARMIN has no API to 'prepare' a file for a pipeline #107

prioux opened this issue Jul 25, 2019 · 2 comments
Assignees

Comments

@prioux
Copy link
Member

prioux commented Jul 25, 2019

Note: this issue will be part of a series describing limitations and questions that arose while implementing CARMIN within CBRAIN.

CARMIN has a very simple data management model: everything is just files and directories that are stored server-side under an abstract 'root' chosen by the people who deployed the CARMIN server.

Launching pipelines on these files imply providing their paths as arguments to the pipeline's parameters. It is not clear if the paths are expected to be relative to the server's data root (e.g. some/stuff/file.txt) or absolute (e.g. /mnt/nfs/data1/carmin_data/bradley/some/stuff/file.txt). In the later case, how does the CARMIN API user even get that path? In the former case, how can the CARMIN API user even be sure that some/stuff/file.txt will be an appropriate argument for any of the pipelines? Is it expected that all pipelines will run with their cwd set to the root directory of CARMIN storage area?

In CBRAIN, data files are registered and given a unique numerical ID. I'm not going to go into the details for our framework, but the basic idea is that data files don't have a fixed path. The path is determined at the moment of the pipeline's start, because the pipeline can be executed on any number of remote servers that have distinct file system configuration (think supercomputer clusters). A CBRAIN pipeline (task) asks for a file to be 'synchronized' by ID, whichs brings a copy of the file to the remote server, and then its local path is provided to the pipeline.

So right now to use a CARMIN pipeline in CBRAIN, one has to:

  1. upload a PATH with CARMIN
  2. find the associated ID using CBRAIN's interface
  3. launch the pipeline providing the ID in the parameters, instead of the path

Steps 1 and 3 work in CARMIN, step 2 doesn't have any CARMIN API equivalent.

What we would need in CARMIN is an extension to an existing call:

  • Extension to GET /PATH : the JSON record should contain an entry for a platform-specific ID associated with the path

A more generic solution that other implementer woudl probably like (but that CBRAIN doesn't need) would be:

  • PUT /executions/{executionIdentifier}/preparePath/some/stuff/file.txt which would tell the server side to prepare the path /some/stuff/file.txt specifically for the execution by task executionIdentifier.
@glatard
Copy link
Contributor

glatard commented Sep 9, 2019

Hi @prioux, a few comments on this issue:

It is not clear if the paths are expected to be relative to the server's data root (e.g. some/stuff/file.txt)

The paths in CARMIN are meant to be defined by the platform. There is no requirement that these paths actually correspond to real paths on any server, although a platform may decide to do so. In VIP for instance, paths are logical names registered in a file catalog where they are associated with physical location(s). The platform has the liberty to define their paths so that there are convenient for them and their users.

In case of CBRAIN, it may make sense to have paths of the following form:

  • <file_id>-<file_name>, where <file_id> would be the CBRAIN file id (useful for the platform), and <file_name> would be the user file name (useful for the user). The file structure would essentially be flat, mapping the one in CBRAIN.
    What do you think?

@axlbonnet
Copy link
Contributor

As @glatard said, the data module has been designed around the notion of path, but with a lot of liberty given to the implementing platform to use these paths as it wants. The idea is that each platform should choose them to mirror how it stores and identifies the files internally.

That's why it's not clear whether paths are relative or absolute, it's not defined in CARMIN. In the end what is necessary is that each path should identify uniquely a file. As Tristan said, it's easy in VIP as we use internally a catalog as the database to map each logical file to its physical location and the key/identifier is a path.

Discussion on @prioux suggestions

@prioux There's something I don't get about what you did in CBRAIN :

So right now to use a CARMIN pipeline in CBRAIN, one has to:

  1. upload a PATH with CARMIN
  2. find the associated ID using CBRAIN's interface
  3. launch the pipeline providing the ID in the parameters, instead of the path

How does the user do step 2 ? What information does he have to find the file he uploaded ? Is the path used in step 1 useful or is this path discarded after step 1 ?

About your first suggestion :

What we would need in CARMIN is an extension to an existing call:

  • Extension to GET /PATH : the JSON record should contain an entry for a platform-specific ID associated with the path

I don't get it. It would mean that you could identify the file with its path, which is the point of the current CARMIN data spec. Why couldn't you use this path as execution input then ?

About your second suggestion :

A more generic solution that other implementer woudl probably like (but that CBRAIN doesn't need) would be:

  • PUT /executions/{executionIdentifier}/preparePath/some/stuff/file.txt which would tell the server side to prepare the path /some/stuff/file.txt specifically for the execution by task executionIdentifier.

Similarly to my comment on the first suggestion, where does this path come from ? I get the preparation stuff but I can't see where this answers this discussed issue.

Proposition of a solution

I agree with @glatard in that ID-based platforms should have ID-based paths.

In case of CBRAIN, it may make sense to have paths of the following form:

<file_id>-<file_name>, where <file_id> would be the CBRAIN file id (useful for the platform), and <file_name> would be the user file name (useful for the user). The file structure would essentially be flat, mapping the one in CBRAIN.
What do you think?

The issue is that it's not currently possible with CARMIN where it's the user who chooses the path whereas the ID is generated by the platform. So we need a way for the user to give the file content and the file name, and let the platform decide of the ID and the path that includes that ID and return it.

I think it would be complicated to add this feature in the current PUT /path/{completePath} method. We could enrich the json/base64 way to include that but I don't like it as it's inefficient. We could also use another mime type to make the difference.

The solution I like most would be to add a POST /path/{filename} method where the platform would store the file, create an ID and return a json including a path like <file_id>-<file_name>
I hesitate between POST /path/{filename} and POST /file/{filename}. pathwould be coherent to have all the data module stuff beginning with the same prefix, whereas file is semantically better in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants