Skip to content

Execution of Inference

Mingyu Kim edited this page Mar 14, 2022 · 12 revisions

Execution of Inference

Network execution happens when user calls inferRequest->infer() or inferRequest->start_async(). (link)

In high level, all we need to do is enqueuing OCL kernels with buffers. For that purpose, we need to find the cldnn::network instance as it contains the required buffers for execution. (TBD: Link to data structure doc) CPUStreamExecutor is holding streams and the stream corresponds to the cldnn::network structure. (link)

The main body of network execution is cldnn::network::execute_impl. (link) In this function, set_arguments() is called to set OpenCL arguments and execute_primitive is called to enqueue kernels to OCL queue. In case of synchronous API call(i.e. inferRequest->infer()), waiting for completion of kernels is also required. It is called from cldnn::network_output::get_memory() function. (link)

Intermediate buffer dump during execution

This function also contains some logic to dump intermediate buffer for debugging purpose. As it is related to memory usage, it deserves some description, too.

In order to dump intermediate buffer, we need to wait for the moment that the kernel is about to be called(for source buffer) or just called(for destination buffer). In other moments, we don't have the intermediate buffer as the buffers are reused from memory pool. TBD: Link to data structure doc

get_stream().finish() is called firstly as we need to be synchronous with kernel execution. (link) Then we access the intermediate buffer. (link) This access varies depending on the kind of buffer. If it is usm_host or usm_shared, it is just accessed directly. If it is usm_device, it is accessed after copying the data into host memory because host cannot access usm_device directly. (link) If it is ocl memory, we map this into host memory. (link) Typical network execution happens with usm_host for network input and output and usm_device for the buffers inside the network.

Clone this wiki locally