-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][Graph] Add initial draft of malloc/free nodes ... #352
base: sycl
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried not to nitpick this too much until we agree on the high level design, but think this looks good. API is simple, clear mapping from CUDA-Graph entry-points for porting code, and the dependency on sycl_ext_oneapi_virtual_mem
hopefully means we can hand off having to go into any low level details.
won't be valid until the allocation is made on graph finalization, as allocating | ||
at finalization is the only way to benefit from the known graph scope for optimal | ||
memory allocation, and even optimize to eliminate some allocations entirely. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think people will have the question "if I have two executable graphs created from the same modifiable graph, will that node be different memory allocations?", so we should state something about that.
|
||
| | ||
Returns a pair of a pointer to memory and a node. The pointer is allocated on the `device` | ||
that is associated with current graph by first execution of the `command_graph`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first execution of the
command_graph
Question about whether we want to tighten this more to say something like "allocated during the finalization of command-graph to the returned executable graph".
The current wording leaves open the suggestion that if you never execute the graph nothing will be allocated, but we probably will do the allocation on finalization - which means less latency when submitting the graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "may be allocated during the finalization ... "? That way we can keep our options, depending on the backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we align with line 1723 here?
in line 1723, it says allocating at finalization is the only way
|
||
Exceptions: | ||
|
||
* Throws synchronously with error code `feature_not_supported` if any devices in `context` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have a context parameter, "if device associated with the command graph does not not have ..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I meant: "any device in the context associated with the graph ..." but let's be more accurate!
can be passed here with a list of nodes to create dependency edges on. | ||
|
||
|=== | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heads up that once intel#12366 merges i think we'll need to extend the node_type
enum to include malloc_device
and free
sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc
Outdated
Show resolved
Hide resolved
An example with structured binding could look like: ...
sycl_ext::command_graph my_graph(my_queue);
float scalar = 2.0f;
size_t N = 1024;
float *ptrX = malloc_device<float>(N, my_queue); // allocate device memory
auto [ptrY, node_a] = my_graph.add_malloc_device<float>(N); // only reserve virtual memory
auto node_b = my_graph.add([=](handler& cgh){cgh.parallel_for(N, [=](id<1> it){ptrX[it] += scalar * ptrY[it];});}, {sycl_ext::property::node::depends_on(node_a)});
auto node_c = my_graph.add_free(ptrY);
auto my_exec = my_graph.finalize(); // may allocate physical memory and map
my_queue.ext_oneapi_graph(my_exec).wait(); |
std::pair<void*,node> | ||
add_malloc_device(size_t num_bytes, const property_list& propList = {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have this templated so that users can call this similarly to sycl::malloc_device<T>
? Just saves the users having to do a cast after the call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will add. My only reason to start with a single version of this API was not having to update too many functions, in case we change the basic design of this interface.
USM allocations to be used in the graph submission. Before to coming to this | ||
recommendation we considered the following explicit graph building interfaces | ||
for adding a memory allocation owned by the graph: | ||
The following interfaces enables users to define a memory allocation/free operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there another plan/draft to record the memory allocation/free with record&replay APIs?
The current functions such as sycl::malloc_device are not recorded into a graph with record&replay APIs, we may need new async functions such as sycl::malloc_device_async
... with backend support for deferred memory allocation.
As proposed in intel#8954