Replies: 13 comments 3 replies
-
Yes, this is exactly the right idea! Now, the bit that is still fuzzy here (and is fuzzy in the current implementation) is how to name output files that do not exist yet. I think it is important to distinguish between two cases. One is input files that have their source in the conventional filesystem. These have a fileid that is generated from the contents and metadata of the file, so that it can be re-used across manager lifetimes:
And the other are files that do not exist until created by a task. I think they should be anonymous:
Now these files are different both internally and externally. Externally, they have no connection to anything in the conventional filesystem, so it makes no sense to give them an external name. But internally, the fileid is tricky, because it neither has content nor do we know the task used to generate it. But, when you attach it to a task:
Now we know the content of the task and can compute it's taskid. And then the fileid of the output file can be computed as something like So the tricky bit here is that a persistent fileid is not known right away, and either the system needs to keep track of two of them (an initial temp ID, and then a final real ID) or.... well, I'm not sure what the alternative is. Finally, it may be that you want to extract a file (any file) from the system, in which case you do this:
|
Beta Was this translation helpful? Give feedback.
-
Or the alternative is to not even attempt to name the file until the task is named , like this:
|
Beta Was this translation helpful? Give feedback.
-
We may want to attempt to name the files right away just because doing dags later may be a little bit easier, no? |
Beta Was this translation helpful? Give feedback.
-
I was thinking on forming the dependencies with the names of files, rather than with the struct vine_file * or a name derived from the task id. The reason is that a given name may not change even if the task id of the task that generates it does. (E.g., tasks execute in different order.) |
Beta Was this translation helpful? Give feedback.
-
Pondering what the implementation of this would look like. A file would then have states reflecting its lifetime:
That would then make it possible for the user to submit tasks with input files in the PENDING state, and the scheduler would have to evaluate the state of input files before attempting to place a task. |
Beta Was this translation helpful? Give feedback.
-
Perhaps |
Beta Was this translation helpful? Give feedback.
-
I think we need Also, we may want to have |
Beta Was this translation helpful? Give feedback.
-
Capturing our recent discussion: Because we want intermediate files to be recreated if they are lost, such files must contain pointers back to the tasks that created them. And likewise, those tasks must have pointers back to the files that they consume. So, what we need internally is a table of files and a table of tasks that together describe the DAG that the user has constructed (incrementally). Both must be reference counted. An intermediate file cannot be deleted until its results are "safe" downstream. So, a proper "output file" that is brought back to the user can be deleted, and then the task that produced it, and so on back to the beginning. |
Beta Was this translation helpful? Give feedback.
-
And so what we are doing is effectively enabling dynamic construction of a dag, as opposed to submit-wait dispatch. |
Beta Was this translation helpful? Give feedback.
-
Now that I am playing around with the code, I see that there is also a garbage collection problem: In the current API, we give the details of each file each time a task input or output is declared:
However, this complicates the process of matching up common references to files for which properties must be tracked across tasks. For temporary files in particular, we want to do this:
However, that doesn't quite work for several reasons: Right now, we get around that by making copies of the file. So
That works for the moment, but is a real pain for users. What we really want is this:
To get to that point, we need to make several big changes: 1 - Separate the definition of a file from how it is mounted into the task. This requires changing the task input files and output files into lists of |
Beta Was this translation helpful? Give feedback.
-
PR #3116 gets us part of the way there by separating vine_mount from vine_file. This allows files to have a lifetime separate from the of tasks. We can do this now:
|
Beta Was this translation helpful? Give feedback.
-
PR #3157 gets us closer again by declaring file objects through the manager. |
Beta Was this translation helpful? Give feedback.
-
Implemented! |
Beta Was this translation helpful? Give feedback.
-
Some thoughts about TaskVine files:
On files in TaskVine
A TaskVine file is any named data read or written (usually to disk) by TaskVine. TaskVine files includes regular files and directories. TaskVine files are first-class citizens, which means they have a life-cycle independent of managers, tasks, and workers.
A TaskVine file may be designated as input or output to a task. There are two restrictions:
Names
TaskVine has three different namespaces for files:
cache namespace: This namespace has global scope, that is, it encompases all managers, workers, and tasks. A file name in this namespace is called a cache-name, and it is computed by TaskVine. The cache-name may be used as a global handle for the named data.
task namespace: The namespace for files used to execute a single task. Each task has its own task namespace, thus task-names may be repeated across tasks. No two different cache-names may be assigned the same task-name inside a given task namespace. The task-name is given by the creator of the task
(usually the user, but sometimes TaskVine with mini-tasks).
manager namespace: The namespace for files at the manager. This includes named data at the manager's filesystem, and global urls. A manager-name is given by the the user, and not all TaskVine files have a manager-name. According to their role in a workflow, only three types of files need a manager-name:
A TaskVine file can be declared and release by a manager, or added to a task:
Declaring a TaskVine file
Any file that should be managed by TaskVine should be first declared:
where
manager_name
is the name of the file at the manager, andflags
is anor-ed between:
VINE_NOCACHE
,VINE_CACHE
, andVINE_UNPACK
. All thesefunctions return a string that uniquely identifies the file. (E.g., hash of the
cache-name if we want to be mysterious.)
Release a TaskVine file
where fileid is one of the strings returned by the
vine_manager_declare_*
above. The above call simply marks the file to be released, but files are not released until there are no more active tasks (that is running, waiting, or a declared minitask) using them. Once no more tasks are using the file (files keep a track of which tasks are using them), then:Adding a TaskVine file to a task
A declared file can be added to a task as input or output:
where flags is an or-ed of:
VINE_WATCH
,VINE_FAILURES_ONLY
, andVINE_SUCCESS_ONLY
. Note that no flags are needed for input files.Convenience functions
A file can be declared and added to a task at the same time with calls such as:
For these calls, TaskVine guarantees that two files with the same manager_name will return the same fileid.
Further, one can obtain the fileid of a file by its task_name:
fileid vine_get_fileid(m, t, task_name);
Minitasks and TaskVine files
Minitasks are tasks that get executed to populate the worker's cache with particular files. They have the same syntax as regular tasks, only that they get executed as many times as required. Further, they can be chained, as:
With the above, TaskVine needs to create the file with
file_at_task_b
for taskt
. But for that, first it needs to generatefile_at_task_a
.Beta Was this translation helpful? Give feedback.
All reactions