TaskVine files #3060

btovar · 2022-11-28T18:00:38Z

btovar
Nov 28, 2022
Maintainer

Some thoughts about TaskVine files:

On files in TaskVine

A TaskVine file is any named data read or written (usually to disk) by TaskVine. TaskVine files includes regular files and directories. TaskVine files are first-class citizens, which means they have a life-cycle independent of managers, tasks, and workers.

A TaskVine file may be designated as input or output to a task. There are two restrictions:

A TaskVine file may be at most the output of one task. It is an error for the same TaskVine file to be the output of two different tasks.
A TaskVine file may be an input or an output of a particulat task, but not both.

Names

TaskVine has three different namespaces for files:

cache namespace: This namespace has global scope, that is, it encompases all managers, workers, and tasks. A file name in this namespace is called a cache-name, and it is computed by TaskVine. The cache-name may be used as a global handle for the named data.
task namespace: The namespace for files used to execute a single task. Each task has its own task namespace, thus task-names may be repeated across tasks. No two different cache-names may be assigned the same task-name inside a given task namespace. The task-name is given by the creator of the task
(usually the user, but sometimes TaskVine with mini-tasks).
manager namespace: The namespace for files at the manager. This includes named data at the manager's filesystem, and global urls. A manager-name is given by the the user, and not all TaskVine files have a manager-name. According to their role in a workflow, only three types of files need a manager-name:
- Input files that are not an output of any task.
- Final output files of the workflow.
- Intermediate output files that need to be copied back to the manager (e.g., when TaskVine can't send intermediate output files to each other, or output files that serve as checkpoints to the workflow.)

Life cycle of a TaskVine file

A TaskVine file can be declared and release by a manager, or added to a task:

Declaring a TaskVine file

Any file that should be managed by TaskVine should be first declared:

struct vine_manager *m = ...;

const char *f = vine_manager_declare_file(m, manager_name, flags);
const char *u = vine_manager_declare_url(m, manager_name, flags);
const char *d = vine_manager_declare_directory(m, manager_name, flags);
const char *p = vine_manager_declare_poncho(m, manager_name);

where manager_name is the name of the file at the manager, and flags is an
or-ed between: VINE_NOCACHE, VINE_CACHE, and VINE_UNPACK. All these
functions return a string that uniquely identifies the file. (E.g., hash of the
cache-name if we want to be mysterious.)

Release a TaskVine file

struct vine_manager *m = ...;

int vine_manager_release_file(m, fileid);

where fileid is one of the strings returned by the vine_manager_declare_* above. The above call simply marks the file to be released, but files are not released until there are no more active tasks (that is running, waiting, or a declared minitask) using them. Once no more tasks are using the file (files keep a track of which tasks are using them), then:

- The file is removed from the cache of all the workers that have it.
- Bookkeeping structures related to the file are freed.

Adding a TaskVine file to a task

A declared file can be added to a task as input or output:

struct vine_task *t = vine_task_create(...);

vine_task_add_input(t, fileid, task_name);
vine_task_add_output(t, fileid, task_name, flags);

where flags is an or-ed of: VINE_WATCH, VINE_FAILURES_ONLY, and VINE_SUCCESS_ONLY. Note that no flags are needed for input files.

Convenience functions

A file can be declared and added to a task at the same time with calls such as:

struct vine_task *m = ...;
struct vine_task *t = ...;

fileid vine_task_add_input_file(m, t, manager_name, task_name, flags);
fileid vine_task_add_output_file(m, t, manager_name, task_name, flags);

For these calls, TaskVine guarantees that two files with the same manager_name will return the same fileid.

Further, one can obtain the fileid of a file by its task_name:

fileid vine_get_fileid(m, t, task_name);

Minitasks and TaskVine files

Minitasks are tasks that get executed to populate the worker's cache with particular files. They have the same syntax as regular tasks, only that they get executed as many times as required. Further, they can be chained, as:

const char *fileid_a = vine_manager_declare_file(m, "file_a", VINE_CACHE);
const char *fileid_b = vine_manager_declare_file(m, "file_b", VINE_CACHE);

struct vine_task *mini_a = vine_create_minitask(...);
vine_task_add_output(mini_a, file_a);

struct vine_task *mini_b = vine_create_minitask(...);
vine_task_add_input(mini_b, fileid_a, "file_at_task_a");
vine_task_add_output(mini_b, fileid_b, "file_at_task_b");

vine_submit(mini_a);
vine_submit(mini_b);

# at this point TaskVine knows how to generate files `file_a`, and `file_b`,
but they don't get created anywhere until a regular task requires them:

t = vine_create_task(...);


vine_task_add_input(t, fileid_b, "file_at_task_b");
//or
// vine_task_add_input(t, vine_get_fileid(m, t, "file_at_task_b");

vine_submit(t);

With the above, TaskVine needs to create the file with file_at_task_b for task t. But for that, first it needs to generate file_at_task_a.

dthain · 2022-12-07T19:05:20Z

dthain
Dec 7, 2022
Maintainer

Yes, this is exactly the right idea!

Now, the bit that is still fuzzy here (and is fuzzy in the current implementation) is how to name output files that do not exist yet.

I think it is important to distinguish between two cases.

One is input files that have their source in the conventional filesystem. These have a fileid that is generated from the contents and metadata of the file, so that it can be re-used across manager lifetimes:

const char *fileid = vine_manager_declare_file(m, "file_a");

And the other are files that do not exist until created by a task. I think they should be anonymous:

const char *fileid = vine_manager_declare_temp(m);

Now these files are different both internally and externally. Externally, they have no connection to anything in the conventional filesystem, so it makes no sense to give them an external name. But internally, the fileid is tricky, because it neither has content nor do we know the task used to generate it.

But, when you attach it to a task:

vine_task_add_output(m,taskid,fileid,"name-of-file-in-task.txt")

Now we know the content of the task and can compute it's taskid. And then the fileid of the output file can be computed as something like hash( concat(taskid,"name-of-file-in-task.txt" ) so that now it is named and reusable.

So the tricky bit here is that a persistent fileid is not known right away, and either the system needs to keep track of two of them (an initial temp ID, and then a final real ID) or.... well, I'm not sure what the alternative is.

Finally, it may be that you want to extract a file (any file) from the system, in which case you do this:

vine_manager_extract_file(m,fileid,"path-to-localfile");

0 replies

dthain · 2022-12-07T19:07:24Z

dthain
Dec 7, 2022
Maintainer

Or the alternative is to not even attempt to name the file until the task is named , like this:

fileid = vine_manager_declare_file(m,"input.txt");
taskid = vine_manager_declare_task(cmd);
vine_task_add_input(taskid,fileid,"input.txt")
outid = vine_task_add_output(taskid,"output.txt")

0 replies

btovar · 2022-12-07T19:18:42Z

btovar
Dec 7, 2022
Maintainer Author

We may want to attempt to name the files right away just because doing dags later may be a little bit easier, no?

1 reply

dthain Dec 7, 2022
Maintainer

Sorry, I don't follow..?

btovar · 2022-12-07T20:13:38Z

btovar
Dec 7, 2022
Maintainer Author

I was thinking on forming the dependencies with the names of files, rather than with the struct vine_file * or a name derived from the task id. The reason is that a given name may not change even if the task id of the task that generates it does. (E.g., tasks execute in different order.)

0 replies

dthain · 2023-02-07T16:30:32Z

dthain
Feb 7, 2023
Maintainer

Pondering what the implementation of this would look like.
My initial thought is that the vine_file structure should become private, so that files are only manipulated by fileid.
The manager would then have an internal table mapping fileid to vine file *, and may need a reference count, so that we don't accidentally remove files currently in use by tasks.
If vine_task remains an opaque object (and we trust the reference counting) then it could still maintain pointers to file objects, so that we don't have to delete them.

A file would then have states reflecting its lifetime:

PENDING - the file is a task output and has not been created yet
AVAILABLE - the source is known and it could be downloaded if needed
EXISTS - at least one copy of the object is in the cluster, or it can be downloaded if needed.
DELETING - the user has requested a removal, but the object is still in use.
(GONE) - actually deleted

That would then make it possible for the user to submit tasks with input files in the PENDING state, and the scheduler would have to evaluate the state of input files before attempting to place a task.

0 replies

dthain · 2023-02-07T16:31:12Z

dthain
Feb 7, 2023
Maintainer

Perhaps AVAILABLE and EXISTS are the same thing, because it is the worker's responsibility to materialize urls, mini-tasks and so forth.

1 reply

btovar Feb 7, 2023
Maintainer Author

Yes, I agree that they sound very similar. AVAILABLE?

btovar · 2023-02-07T16:47:14Z

btovar
Feb 7, 2023
Maintainer Author

I think we need LOST for when the file lived in a worker that went away. Eventually LOST files become PENDING again.

Also, we may want to have UNAVAILABLE for when a source is not working.

0 replies

dthain · 2023-02-07T18:44:55Z

dthain
Feb 7, 2023
Maintainer

Capturing our recent discussion:

Because we want intermediate files to be recreated if they are lost, such files must contain pointers back to the tasks that created them. And likewise, those tasks must have pointers back to the files that they consume.

So, what we need internally is a table of files and a table of tasks that together describe the DAG that the user has constructed (incrementally). Both must be reference counted.

An intermediate file cannot be deleted until its results are "safe" downstream. So, a proper "output file" that is brought back to the user can be deleted, and then the task that produced it, and so on back to the beginning.

0 replies

dthain · 2023-02-07T18:45:22Z

dthain
Feb 7, 2023
Maintainer

And so what we are doing is effectively enabling dynamic construction of a dag, as opposed to submit-wait dispatch.

0 replies

dthain · 2023-02-14T14:23:18Z

dthain
Feb 14, 2023
Maintainer

Now that I am playing around with the code, I see that there is also a garbage collection problem:

In the current API, we give the details of each file each time a task input or output is declared:

vine_task_add_input(task,vine_file_url(url),"data.txt",VINE_CACHE);

However, this complicates the process of matching up common references to files for which properties must be tracked across tasks. For temporary files in particular, we want to do this:

f = vine_file_temp(0);

vine_task_add_output( taska, f, "output.txt", VINE_CACHE);
vine_task_add_input( taskb, f, "input.txt", VINE_CACHE);

However, that doesn't quite work for several reasons:
1 - Garbage collection: the file is not associated with two tasks, and who will clean it up? (Reference counting is a solution here, but you will still have to have a vine_file_delete(f) at the end, which will surely be forgotten in practice.
2 - The file structure confuses the source of a file with how it is used. (Note that add_input and add_output hackily modify the remote_name and flags of the file when connecting inputs and outputs.). It can't have two flags at once.
3 - We want the manager to do some work to compute the proper cached name of a file the first (and only time) it is declared.

Right now, we get around that by making copies of the file. So vine_example_montage does something like this:

f = vine_file_temp(0);

vine_task_add_output( taska, vine_file_clone(f), "output.txt", VINE_CACHE);
vine_task_add_input( taskb, vine_file_clone(f), "input.txt", VINE_CACHE);

vine_file_delete(f)

That works for the moment, but is a real pain for users. What we really want is this:

file = vine_manager_declare_temp( manager);

vine_task_add_output( task, file, "output.txt", VINE_CACHE );
vine_task_add_input( task, file, "input.txt", VINE_CACHE );

vine_manager_delete(manager);

To get to that point, we need to make several big changes:

1 - Separate the definition of a file from how it is mounted into the task. This requires changing the task input files and output files into lists of vine_mounts that contain the mount options.
2 - Move the file objects into a table of single definitions in the manager. Then, the mount structures point to entries in the file table. The unique names for files are computed once when declared at the manager level. The manager then garbage-collects everything in the file table on vine_delete().
3 - Modify the API to define files as a manager-level option, which then returns the unique name as a string for mounting.

1 reply

btovar Feb 14, 2023
Maintainer Author

Makes sense. I'd call that function:

file = vine_file_declare(struct manager *m,  VINE_FILE|VINE_TEMP|VINE_EMPTY_DIR|etc. , char *json_metadata);

dthain · 2023-02-14T19:44:55Z

dthain
Feb 14, 2023
Maintainer

PR #3116 gets us part of the way there by separating vine_mount from vine_file. This allows files to have a lifetime separate from the of tasks. We can do this now:

file = vine_file_temp();

vine_task_add_output( task, file, "output.txt", VINE_CACHE );
vine_task_add_input( task, file, "input.txt", VINE_CACHE );

submit/wait/etc

vine_file_delete(file);

0 replies

dthain · 2023-02-24T13:54:09Z

dthain
Feb 24, 2023
Maintainer

PR #3157 gets us closer again by declaring file objects through the manager.

0 replies

btovar · 2023-03-24T12:15:09Z

btovar
Mar 24, 2023
Maintainer Author

Implemented!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskVine files #3060

{{title}}

Replies: 13 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

TaskVine files #3060

btovar Nov 28, 2022 Maintainer

On files in TaskVine

Names

Declaring a TaskVine file

Release a TaskVine file

Adding a TaskVine file to a task

Convenience functions

Minitasks and TaskVine files

Replies: 13 comments · 3 replies

dthain Dec 7, 2022 Maintainer

dthain Dec 7, 2022 Maintainer

btovar Dec 7, 2022 Maintainer Author

dthain Dec 7, 2022 Maintainer

btovar Dec 7, 2022 Maintainer Author

dthain Feb 7, 2023 Maintainer

dthain Feb 7, 2023 Maintainer

btovar Feb 7, 2023 Maintainer Author

btovar Feb 7, 2023 Maintainer Author

dthain Feb 7, 2023 Maintainer

dthain Feb 7, 2023 Maintainer

dthain Feb 14, 2023 Maintainer

btovar Feb 14, 2023 Maintainer Author

dthain Feb 14, 2023 Maintainer

dthain Feb 24, 2023 Maintainer

btovar Mar 24, 2023 Maintainer Author

btovar
Nov 28, 2022
Maintainer

Replies: 13 comments 3 replies

dthain
Dec 7, 2022
Maintainer

dthain
Dec 7, 2022
Maintainer

btovar
Dec 7, 2022
Maintainer Author

dthain Dec 7, 2022
Maintainer

btovar
Dec 7, 2022
Maintainer Author

dthain
Feb 7, 2023
Maintainer

dthain
Feb 7, 2023
Maintainer

btovar Feb 7, 2023
Maintainer Author

btovar
Feb 7, 2023
Maintainer Author

dthain
Feb 7, 2023
Maintainer

dthain
Feb 7, 2023
Maintainer

dthain
Feb 14, 2023
Maintainer

btovar Feb 14, 2023
Maintainer Author

dthain
Feb 14, 2023
Maintainer

dthain
Feb 24, 2023
Maintainer

btovar
Mar 24, 2023
Maintainer Author