From af2fec9a00ae9dcf6df631120793ad157f13f041 Mon Sep 17 00:00:00 2001 From: Andrey Zaytsev Date: Tue, 29 Jun 2021 18:16:46 +0300 Subject: [PATCH] Feature/azaytsev/changes from baychub colin q2 (#6437) * Q2 changes * Changed Convert_RNNT.md Co-authored-by: baychub --- docs/IE_DG/Int8Inference.md | 18 +-- .../pytorch_specific/Convert_F3Net.md | 15 ++- .../pytorch_specific/Convert_RNNT.md | 19 ++- .../installing-openvino-conda.md | 6 +- .../install_guides/installing-openvino-pip.md | 14 +- docs/install_guides/pypi-openvino-dev.md | 9 +- .../dldt_optimization_guide.md | 121 +++++++++--------- 7 files changed, 105 insertions(+), 97 deletions(-) diff --git a/docs/IE_DG/Int8Inference.md b/docs/IE_DG/Int8Inference.md index 917c7836de293b..335333e17eeeb9 100644 --- a/docs/IE_DG/Int8Inference.md +++ b/docs/IE_DG/Int8Inference.md @@ -17,25 +17,25 @@ Low-precision 8-bit inference is optimized for: ## Introduction -A lot of investigation was made in the field of deep learning with the idea of using low precision computations during inference in order to boost deep learning pipelines and gather higher performance. For example, one of the popular approaches is to shrink the precision of activations and weights values from `fp32` precision to smaller ones, for example, to `fp11` or `int8`. For more information about this approach, refer to +A lot of investigation was made in the field of deep learning with the idea of using low-precision computation during inference in order to boost deep learning pipelines and achieve higher performance. For example, one of the popular approaches is to shrink the precision of activations and weights values from `fp32` precision to smaller ones, for example, to `fp11` or `int8`. For more information about this approach, refer to the **Brief History of Lower Precision in Deep Learning** section in [this whitepaper](https://software.intel.com/en-us/articles/lower-numerical-precision-deep-learning-inference-and-training). -8-bit computations (referred to as `int8`) offer better performance compared to the results of inference in higher precision (for example, `fp32`), because they allow loading more data into a single processor instruction. Usually the cost for significant boost is a reduced accuracy. However, it is proved that an accuracy drop can be negligible and depends on task requirements, so that the application engineer can set up the maximum accuracy drop that is acceptable. +8-bit computation (referred to as `int8`) offers better performance compared to the results of inference in higher precision (for example, `fp32`), because they allow loading more data into a single processor instruction. Usually the cost for significant boost is reduced accuracy. However, it has been proven that the drop in accuracy can be negligible and depends on task requirements, so that an application engineer configure the maximum accuracy drop that is acceptable. - -Let's explore quantized [TensorFlow* implementation of ResNet-50](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf) model. Use [Model Downloader](@ref omz_tools_downloader) tool to download the `fp16` model from [OpenVINO™ Toolkit - Open Model Zoo repository](https://github.com/openvinotoolkit/open_model_zoo): +Let's explore the quantized [TensorFlow* implementation of ResNet-50](https://github.com/openvinotoolkit/open_model_zoo/tree/master/models/public/resnet-50-tf) model. Use the [Model Downloader](@ref omz_tools_downloader) tool to download the `fp16` model from [OpenVINO™ Toolkit - Open Model Zoo repository](https://github.com/openvinotoolkit/open_model_zoo): ```sh -./downloader.py --name resnet-50-tf --precisions FP16-INT8 +cd $INTEL_OPENVINO_DIR/deployment_tools/tools/model_downloader +./downloader.py --name resnet-50-tf --precisions FP16-INT8 --output_dir ``` -After that you should quantize model by the [Model Quantizer](@ref omz_tools_downloader) tool. +After that, you should quantize the model by the [Model Quantizer](@ref omz_tools_downloader) tool. For the dataset, you can choose to download the ImageNet dataset from [here](https://www.image-net.org/download.php). ```sh -./quantizer.py --model_dir public/resnet-50-tf --dataset_dir --precisions=FP16-INT8 +./quantizer.py --model_dir --name public/resnet-50-tf --dataset_dir --precisions=FP16-INT8 ``` -The simplest way to infer the model and collect performance counters is [C++ Benchmark Application](../../inference-engine/samples/benchmark_app/README.md). +The simplest way to infer the model and collect performance counters is the [C++ Benchmark Application](../../inference-engine/samples/benchmark_app/README.md). ```sh ./benchmark_app -m resnet-50-tf.xml -d CPU -niter 1 -api sync -report_type average_counters -report_folder pc_report_dir ``` -If you infer the model with the OpenVINO™ CPU plugin and collect performance counters, all operations (except last not quantized SoftMax) are executed in INT8 precision. +If you infer the model with the OpenVINO™ CPU plugin and collect performance counters, all operations (except the last non-quantized SoftMax) are executed in INT8 precision. ## Low-Precision 8-bit Integer Inference Workflow diff --git a/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_F3Net.md b/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_F3Net.md index ffb16eb5f7cc5f..1d26b35e97379a 100644 --- a/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_F3Net.md +++ b/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_F3Net.md @@ -2,15 +2,20 @@ [F3Net](https://github.com/weijun88/F3Net): Fusion, Feedback and Focus for Salient Object Detection +## Clone the F3Net Repository + +To clone the repository, run the following command: + +```sh +git clone http://github.com/weijun88/F3Net.git" +``` + ## Download and Convert the Model to ONNX* To download the pre-trained model or train the model yourself, refer to the -[instruction](https://github.com/weijun88/F3Net/blob/master/README.md) in the F3Net model repository. Firstly, -convert the model to ONNX\* format. Create and run the script with the following content in the `src` -directory of the model repository: +[instruction](https://github.com/weijun88/F3Net/blob/master/README.md) in the F3Net model repository. First, convert the model to ONNX\* format. Create and run the following Python script in the `src` directory of the model repository: ```python import torch - from dataset import Config from net import F3Net @@ -19,7 +24,7 @@ net = F3Net(cfg) image = torch.zeros([1, 3, 352, 352]) torch.onnx.export(net, image, 'f3net.onnx', export_params=True, do_constant_folding=True, opset_version=11) ``` -The script generates the ONNX\* model file f3net.onnx. The model conversion was tested with the repository hash commit `eecace3adf1e8946b571a4f4397681252f9dc1b8`. +The script generates the ONNX\* model file f3net.onnx. This model conversion was tested with the repository hash commit `eecace3adf1e8946b571a4f4397681252f9dc1b8`. ## Convert ONNX* F3Net Model to IR diff --git a/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_RNNT.md b/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_RNNT.md index a58e886d4f4230..31de647f379158 100644 --- a/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_RNNT.md +++ b/docs/MO_DG/prepare_model/convert_model/pytorch_specific/Convert_RNNT.md @@ -20,15 +20,15 @@ mkdir rnnt_for_openvino cd rnnt_for_openvino ``` -**Step 3**. Download pretrained weights for PyTorch implementation from https://zenodo.org/record/3662521#.YG21DugzZaQ. -For UNIX*-like systems you can use wget: +**Step 3**. Download pretrained weights for PyTorch implementation from [https://zenodo.org/record/3662521#.YG21DugzZaQ](https://zenodo.org/record/3662521#.YG21DugzZaQ). +For UNIX*-like systems you can use `wget`: ```bash wget https://zenodo.org/record/3662521/files/DistributedDataParallel_1576581068.9962234-epoch-100.pt ``` The link was taken from `setup.sh` in the `speech_recoginitin/rnnt` subfolder. You will get exactly the same weights as -if you were following the steps from https://github.com/mlcommons/inference/tree/master/speech_recognition/rnnt. +if you were following the steps from [https://github.com/mlcommons/inference/tree/master/speech_recognition/rnnt](https://github.com/mlcommons/inference/tree/master/speech_recognition/rnnt). -**Step 4**. Install required python* packages: +**Step 4**. Install required Python packages: ```bash pip3 install torch toml ``` @@ -37,7 +37,7 @@ pip3 install torch toml `export_rnnt_to_onnx.py` and run it in the current directory `rnnt_for_openvino`: > **NOTE**: If you already have a full clone of MLCommons inference repository, you need to -> specify `mlcommons_inference_path` variable. +> specify the `mlcommons_inference_path` variable. ```python import toml @@ -92,8 +92,7 @@ torch.onnx.export(model.joint, (f, g), "rnnt_joint.onnx", opset_version=12, python3 export_rnnt_to_onnx.py ``` -After completing this step, the files rnnt_encoder.onnx, rnnt_prediction.onnx, and rnnt_joint.onnx will be saved in -the current directory. +After completing this step, the files `rnnt_encoder.onnx`, `rnnt_prediction.onnx`, and `rnnt_joint.onnx` will be saved in the current directory. **Step 6**. Run the conversion command: @@ -102,6 +101,6 @@ python3 {path_to_openvino}/mo.py --input_model rnnt_encoder.onnx --input "input. python3 {path_to_openvino}/mo.py --input_model rnnt_prediction.onnx --input "input.1[1 1],1[2 1 320],2[2 1 320]" python3 {path_to_openvino}/mo.py --input_model rnnt_joint.onnx --input "0[1 1 1024],1[1 1 320]" ``` -Please note that hardcoded value for sequence length = 157 was taken from the MLCommons, but conversion to IR preserves -network [reshapeability](../../../../IE_DG/ShapeInference.md); this means you can change input shapes manually to any value either during conversion or -inference. +Please note that hardcoded value for sequence length = 157 was taken from the MLCommons but conversion to IR preserves +network [reshapeability](../../../../IE_DG/ShapeInference.md), this means you can change input shapes manually to any value either during conversion or +inference. \ No newline at end of file diff --git a/docs/install_guides/installing-openvino-conda.md b/docs/install_guides/installing-openvino-conda.md index a5cefbfb97e579..fa607605410cd8 100644 --- a/docs/install_guides/installing-openvino-conda.md +++ b/docs/install_guides/installing-openvino-conda.md @@ -31,6 +31,10 @@ This guide provides installation steps for Intel® Distribution of OpenVINO™ t conda update --all ``` 3. Install the Intel® Distribution of OpenVINO™ Toolkit: + - Ubuntu* 20.04 + ```sh + conda install openvino-ie4py-ubuntu20 -c intel + ``` - Ubuntu* 18.04 ```sh conda install openvino-ie4py-ubuntu18 -c intel @@ -47,7 +51,7 @@ This guide provides installation steps for Intel® Distribution of OpenVINO™ t ```sh python -c "import openvino" ``` - + Now you can start to develop and run your application. diff --git a/docs/install_guides/installing-openvino-pip.md b/docs/install_guides/installing-openvino-pip.md index 7a639faff86120..25f362482486f7 100644 --- a/docs/install_guides/installing-openvino-pip.md +++ b/docs/install_guides/installing-openvino-pip.md @@ -1,15 +1,15 @@ # Install Intel® Distribution of OpenVINO™ Toolkit from PyPI Repository {#openvino_docs_install_guides_installing_openvino_pip} -OpenVINO™ toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, recommendation systems, and many others. Based on latest generations of artificial neural networks, including Convolutional Neural Networks (CNNs), recurrent and attention-based networks, the toolkit extends computer vision and non-vision workloads across Intel® hardware, maximizing performance. It accelerates applications with high-performance, AI and deep learning inference deployed from edge to cloud. +OpenVINO™ toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, recommendation systems, and many others. Based on the latest generations of artificial neural networks, including Convolutional Neural Networks (CNNs), recurrent and attention-based networks, the toolkit extends computer vision and non-vision workloads across Intel® hardware, maximizing performance. It accelerates applications with high-performance AI and deep learning inference deployed from edge to cloud. Intel® Distribution of OpenVINO™ Toolkit provides the following packages available for installation through the PyPI repository: -* Runtime package with the Inference Engine inside: [https://pypi.org/project/openvino/](https://pypi.org/project/openvino/). -* Developer package that includes the runtime package as a dependency, Model Optimizer and other developer tools: [https://pypi.org/project/openvino-dev](https://pypi.org/project/openvino-dev). +* Runtime package with the Inference Engine inside: [https://pypi.org/project/openvino/](https://pypi.org/project/openvino/) +* Developers package (including the runtime package as a dependency), Model Optimizer, Accuracy Checker and Post-Training Optimization Tool: [https://pypi.org/project/openvino-dev](https://pypi.org/project/openvino-dev) ## Additional Resources -- [Intel® Distribution of OpenVINO™ toolkit](https://software.intel.com/en-us/openvino-toolkit). -- [Model Optimizer Developer Guide](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md). -- [Inference Engine Developer Guide](../IE_DG/Deep_Learning_Inference_Engine_DevGuide.md). -- [Inference Engine Samples Overview](../IE_DG/Samples_Overview.md). +- [Intel® Distribution of OpenVINO™ toolkit](https://software.intel.com/en-us/openvino-toolkit) +- [Model Optimizer Developer Guide](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md) +- [Inference Engine Developer Guide](../IE_DG/Deep_Learning_Inference_Engine_DevGuide.md) +- [Inference Engine Samples Overview](../IE_DG/Samples_Overview.md) diff --git a/docs/install_guides/pypi-openvino-dev.md b/docs/install_guides/pypi-openvino-dev.md index 89bb5f3db614a3..0cdbd009a2c956 100644 --- a/docs/install_guides/pypi-openvino-dev.md +++ b/docs/install_guides/pypi-openvino-dev.md @@ -11,7 +11,7 @@ license terms for third party or open source software included in or with the So OpenVINO™ toolkit is a comprehensive toolkit for quickly developing applications and solutions that solve a variety of tasks including emulation of human vision, automatic speech recognition, natural language processing, recommendation systems, and many others. Based on latest generations of artificial neural networks, including Convolutional Neural Networks (CNNs), recurrent and attention-based networks, the toolkit extends computer vision and non-vision workloads across Intel® hardware, maximizing performance. It accelerates applications with high-performance, AI and deep learning inference deployed from edge to cloud. -The **developer package** includes the following components installed by default: +**The developer package includes the following components installed by default:** | Component | Console Script | Description | |------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -21,8 +21,9 @@ The **developer package** includes the following components installed by default | [Post-Training Optimization Tool](https://docs.openvinotoolkit.org/latest/pot_README.html)| `pot` |**Post-Training Optimization Tool** allows you to optimize trained models with advanced capabilities, such as quantization and low-precision optimizations, without the need to retrain or fine-tune models. Optimizations are also available through the [API](https://docs.openvinotoolkit.org/latest/pot_compression_api_README.html). | | [Model Downloader and other Open Model Zoo tools](https://docs.openvinotoolkit.org/latest/omz_tools_downloader.html)| `omz_downloader`
`omz_converter`
`omz_quantizer`
`omz_info_dumper`| **Model Downloader** is a tool for getting access to the collection of high-quality and extremely fast pre-trained deep learning [public](https://docs.openvinotoolkit.org/latest/omz_models_group_public.html) and [intel](https://docs.openvinotoolkit.org/latest/omz_models_group_intel.html)-trained models. Use these free pre-trained models instead of training your own models to speed up the development and production deployment process. The principle of the tool is as follows: it downloads model files from online sources and, if necessary, patches them with Model Optimizer to make them more usable. A number of additional tools are also provided to automate the process of working with downloaded models:
**Model Converter** is a tool for converting the models stored in a format other than the Intermediate Representation (IR) into that format using Model Optimizer.
**Model Quantizer** is a tool for automatic quantization of full-precision IR models into low-precision versions using Post-Training Optimization Tool.
**Model Information Dumper** is a helper utility for dumping information about the models in a stable machine-readable format.| +> **NOTE**: The developer package also installs the OpenVINO™ runtime package as a dependency. -**Developer package** also provides the **runtime package** installed as a dependency. The runtime package includes the following components: +**The runtime package installs the following components:** | Component | Description | |-----------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -87,10 +88,10 @@ python -m pip install --upgrade pip To install and configure the components of the development package for working with specific frameworks, use the `pip install openvino-dev[extras]` command, where `extras` is a list of extras from the table below: -| DL Framework | Extra | +| DL Framework | Extra | | :------------------------------------------------------------------------------- | :-------------------------------| | [Caffe*](https://caffe.berkeleyvision.org/) | caffe | -| [Caffe2*](https://caffe2.ai/) | caffe2 | +| [Caffe2*](https://caffe2.ai/) | caffe2 | | [Kaldi*](https://kaldi-asr.org/) | kaldi | | [MXNet*](https://mxnet.apache.org/) | mxnet | | [ONNX*](https://github.com/microsoft/onnxruntime/) | onnx | diff --git a/docs/optimization_guide/dldt_optimization_guide.md b/docs/optimization_guide/dldt_optimization_guide.md index e0ce090a420e84..fab8a576e3bfee 100644 --- a/docs/optimization_guide/dldt_optimization_guide.md +++ b/docs/optimization_guide/dldt_optimization_guide.md @@ -8,7 +8,7 @@ For information on the general workflow, refer to the documentation in -Deep Learning Inference Engine is a part of Intel® Deep Learning Deployment Toolkit (Intel® DL Deployment Toolkit) and OpenVINO™ toolkit. Inference Engine facilitates deployment of deep learning solutions by delivering a unified, device-agnostic API. +Deep Learning Inference Engine is a part of OpenVINO™ toolkit. Inference Engine facilitates deployment of deep learning solutions by delivering a unified, device-agnostic API. Below, there are the three main steps of the deployment process: @@ -25,14 +25,14 @@ Below, there are the three main steps of the deployment process: - *Performance flow*: Upon conversion to IR, the execution starts with existing [Inference Engine samples](../IE_DG/Samples_Overview.md) to measure and tweak the performance of the network on different devices.
> **NOTE**: While consuming the same IR, each plugin performs additional device-specific optimizations at load time, so the resulting accuracy might differ. Also, enabling and optimizing custom kernels is error-prone (see Optimizing Custom Kernels). - - *Tools*: Beyond inference performance that samples report (see Latency vs. Throughput), you can get further device- and kernel-level timing with the Inference Engine performance counters and Intel® VTune™. + - *Tools*: Beyond inference performance that samples report (see Latency vs. Throughput), you can get further device- and kernel-level timing with the Inference Engine performance counters and Intel® VTune™. 3. **Integration to the product**
After model inference is verified with the [samples](../IE_DG/Samples_Overview.md), the Inference Engine code is typically integrated into a real application or pipeline. - *Performance flow*: The most important point is to preserve the sustained performance achieved with the stand-alone model execution. Take precautions when combining with other APIs and be careful testing the performance of every integration step. - - *Tools*: Beyond tracking the actual wall-clock time of your application, see Intel® VTune™ Examples for application-level and system-level information. + - *Tools*: Beyond tracking the actual wall-clock time of your application, see Intel® VTune™ Examples for application-level and system-level information. ## Gathering the Performance Numbers @@ -50,12 +50,12 @@ When evaluating performance of your model with the Inference Engine, you must me ### Latency vs. Throughput -In the asynchronous case (see Request-Based API and “GetBlob” Idiom), the performance of an individual infer request is usually of less concern. Instead, you typically execute multiple requests asynchronously and measure the throughput in images per second by dividing the number of images that were processed by the processing time. -In contrast, for the latency-oriented tasks, the time to a single frame is more important. +In the asynchronous case (see Request-Based API and “GetBlob” Idiom), the performance of an individual infer request is usually of less concern. Instead, you typically execute multiple requests asynchronously and measure the throughput in images per second by dividing the number of images that were processed by the processing time. +In contrast, for latency-oriented tasks, the time to a single frame is more important. Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample, which allows latency vs. throughput measuring. -> **NOTE**: The [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample also supports batching, that is automatically packing multiple input images into a single request. However, high batch size results in a latency penalty. So for more real-time oriented usages, batch sizes that are as low as a single input are usually used. Still, devices like CPU, Intel®Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, or Intel® Vision Accelerator Design with Intel® Movidius™ VPU require a number of parallel requests instead of batching to leverage the performance. Running multiple requests should be coupled with a device configured to the corresponding number of streams. See details on CPU streams for an example. +> **NOTE**: The [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample also supports batching, that is, automatically packing multiple input images into a single request. However, high batch size results in a latency penalty. So for more real-time oriented usages, batch sizes that are as low as a single input are usually used. Still, devices like CPU, Intel®Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, or Intel® Vision Accelerator Design with Intel® Movidius™ VPU require a number of parallel requests instead of batching to leverage the performance. Running multiple requests should be coupled with a device configured to the corresponding number of streams. See details on CPU streams for an example. [OpenVINO™ Deep Learning Workbench tool](https://docs.openvinotoolkit.org/latest/workbench_docs_Workbench_DG_Introduction.html) provides throughput versus latency charts for different numbers of streams, requests, and batch sizes to find the performance sweet spot. @@ -65,7 +65,7 @@ When comparing the Inference Engine performance with the framework or another re - Wrap exactly the inference execution (refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample for an example). - Track model loading time separately. -- Ensure the inputs are identical for the Inference Engine and the framework. For example, Caffe\* allows to auto-populate the input with random values. Notice that it might give different performance than on real images. +- Ensure the inputs are identical for the Inference Engine and the framework. For example, Caffe\* allows you to auto-populate the input with random values. Notice that it might give different performance than on real images. - Similarly, for correct performance comparison, make sure the access pattern, for example, input layouts, is optimal for Inference Engine (currently, it is NCHW). - Any user-side pre-processing should be tracked separately. - Make sure to try the same environment settings that the framework developers recommend, for example, for TensorFlow*. In many cases, things that are more machine friendly, like respecting NUMA (see CPU Checklist), might work well for the Inference Engine as well. @@ -83,11 +83,11 @@ Refer to the [Benchmark App](../../inference-engine/samples/benchmark_app/README ## Model Optimizer Knobs Related to Performance -Networks training is typically done on high-end data centers, using popular training frameworks like Caffe\*, TensorFlow\*, and MXNet\*. Model Optimizer converts the trained model in original proprietary formats to IR that describes the topology. IR is accompanied by a binary file with weights. These files in turn are consumed by the Inference Engine and used for scoring. +Network training is typically done on high-end data centers, using popular training frameworks like Caffe\*, TensorFlow\*, and MXNet\*. Model Optimizer converts the trained model in original proprietary formats to IR that describes the topology. IR is accompanied by a binary file with weights. These files in turn are consumed by the Inference Engine and used for scoring. ![](../img/workflow_steps.png) -As described in the [Model Optimizer Guide](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md), there are a number of device-agnostic optimizations the tool performs. For example, certain primitives like linear operations (BatchNorm and ScaleShift), are automatically fused into convolutions. Generally, these layers should not be manifested in the resulting IR: +As described in the [Model Optimizer Guide](../MO_DG/Deep_Learning_Model_Optimizer_DevGuide.md), there are a number of device-agnostic optimizations the tool performs. For example, certain primitives like linear operations (BatchNorm and ScaleShift) are automatically fused into convolutions. Generally, these layers should not be manifested in the resulting IR: ![](../img/resnet_269.png) @@ -109,43 +109,42 @@ Also: Notice that the devices like GPU are doing better with larger batch size. While it is possible to set the batch size in the runtime using the Inference Engine [ShapeInference feature](../IE_DG/ShapeInference.md). - **Resulting IR precision**
-The resulting IR precision, for instance, `FP16` or `FP32`, directly affects performance. As CPU now supports `FP16` (while internally upscaling to `FP32` anyway) and because this is the best precision for a GPU target, you may want to always convert models to `FP16`. Notice that this is the only precision that Intel® Movidius™ Myriad™ 2 and Intel® Myriad™ X VPUs support. +The resulting IR precision, for instance, `FP16` or `FP32`, directly affects performance. As CPU now supports `FP16` (while internally upscaling to `FP32` anyway) and because this is the best precision for a GPU target, you may want to always convert models to `FP16`. Notice that this is the only precision that Intel® Movidius™ Myriad™ 2 and Intel® Myriad™ X VPUs support. ## Multi-Device Execution -OpenVINO™ toolkit supports automatic multi-device execution, please see [MULTI-Device plugin description](../IE_DG/supported_plugins/MULTI.md). +OpenVINO™ toolkit supports automatic multi-device execution, please see [MULTI-Device plugin description](../IE_DG/supported_plugins/MULTI.md). In the next chapter you can find the device-specific tips, while this section covers few recommendations for the multi-device execution: -- MULTI usually performs best when the fastest device is specified first in the list of the devices. - This is particularly important when the parallelism is not sufficient - (e.g. the number of request in the flight is not enough to saturate all devices). -- It is highly recommended to query the optimal number of inference requests directly from the instance of the ExecutionNetwork - (resulted from the LoadNetwork call with the specific multi-device configuration as a parameter). -Please refer to the code of the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample for details. -- Notice that for example CPU+GPU execution performs better with certain knobs +- MULTI usually performs best when the fastest device is specified first in the list of the devices. + This is particularly important when the parallelism is not sufficient + (e.g., the number of request in the flight is not enough to saturate all devices). +- It is highly recommended to query the optimal number of inference requests directly from the instance of the ExecutionNetwork + (resulted from the LoadNetwork call with the specific multi-device configuration as a parameter). +Please refer to the code of the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample for details. +- Notice that for example CPU+GPU execution performs better with certain knobs which you can find in the code of the same [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample. One specific example is disabling GPU driver polling, which in turn requires multiple GPU streams (which is already a default for the GPU) to amortize slower inference completion from the device to the host. -- Multi-device logic always attempts to save on the (e.g. inputs) data copies between device-agnostic, user-facing inference requests +- Multi-device logic always attempts to save on the (e.g., inputs) data copies between device-agnostic, user-facing inference requests and device-specific 'worker' requests that are being actually scheduled behind the scene. To facilitate the copy savings, it is recommended to start the requests in the order that they were created (with ExecutableNetwork's CreateInferRequest). - ## Device-Specific Optimizations -The Inference Engine supports several target devices (CPU, GPU, Intel® Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, Intel® Vision Accelerator Design with Intel® Movidius™ Vision Processing Units (VPU) and FPGA), and each of them has a corresponding plugin. If you want to optimize a specific device, you must keep in mind the following tips to increase the performance. +The Inference Engine supports several target devices (CPU, GPU, Intel® Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, Intel® Vision Accelerator Design with Intel® Movidius™ Vision Processing Units (VPU) and FPGA), and each of them has a corresponding plugin. If you want to optimize a specific device, keep in mind the following tips to increase the performance. ### CPU Checklist -CPU plugin completely relies on the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) for major primitives acceleration, for example, Convolutions or FullyConnected. +The CPU plugin completely relies on the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) for major primitives acceleration, for example, Convolutions or FullyConnected. -The only hint you can get from that is how the major primitives are accelerated (and you cannot change this). For example, on the Core machines, you should see variations of the `jit_avx2` when inspecting the internal inference performance counters (and additional '_int8' postfix for [int8 inference](../IE_DG/Int8Inference.md)). If you are an advanced user, you can further trace the CPU execution with (see Intel® VTune™). +The only hint you can get from that is how the major primitives are accelerated (and you cannot change this). For example, on machines with Intel® Core™ processors, you should see variations of the `jit_avx2` when inspecting the internal inference performance counters (and additional '_int8' postfix for [int8 inference](../IE_DG/Int8Inference.md)). If you are an advanced user, you can further trace the CPU execution with (see Intel® VTune™). -Internally, the Inference Engine has a threading abstraction level, which allows for compiling the [open source version](https://github.com/openvinotoolkit/openvino) with either Intel® Threading Building Blocks (Intel® TBB) which is now default, or OpenMP* as an alternative parallelism solution. When using inference on the CPU, this is particularly important to align threading model with the rest of your application (and any third-party libraries that you use) to avoid oversubscription. For more information, see Note on the App-Level Threading section. +Internally, the Inference Engine has a threading abstraction level, which allows for compiling the [open source version](https://github.com/openvinotoolkit/openvino) with either Intel® Threading Building Blocks (Intel® TBB) which is now default, or OpenMP* as an alternative parallelism solution. When using inference on the CPU, this is particularly important to align threading model with the rest of your application (and any third-party libraries that you use) to avoid oversubscription. For more information, see Note on the App-Level Threading section. - Since R1 2019, the OpenVINO™ toolkit comes pre-compiled with Intel TBB, - so any OpenMP* API or environment settings (like `OMP_NUM_THREADS`) has no effect. + Since R1 2019, OpenVINO™ toolkit comes pre-compiled with Intel TBB, + so any OpenMP* API or environment settings (like `OMP_NUM_THREADS`) have no effect. Certain tweaks (like number of threads used for inference on the CPU) are still possible via [CPU configuration options](../IE_DG/supported_plugins/CPU.md). Finally, the OpenVINO CPU inference is NUMA-aware, please refer to the Tips for inference on NUMA systems section. @@ -165,7 +164,7 @@ This feature usually provides much better performance for the networks than batc Compared with the batching, the parallelism is somewhat transposed (i.e. performed over inputs, and much less within CNN ops): ![](../img/cpu_streams_explained.png) -Try the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample and play with number of streams running in parallel. The rule of thumb is tying up to a number of CPU cores on your machine. +Try the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md) sample and play with the number of streams running in parallel. The rule of thumb is tying up to a number of CPU cores on your machine. For example, on an 8-core CPU, compare the `-nstreams 1` (which is a legacy, latency-oriented scenario) to the 2, 4, and 8 streams. Notice that on a multi-socket machine, the bare minimum of streams for a latency scenario equals the number of sockets. @@ -178,7 +177,7 @@ If your application is hard or impossible to change in accordance with the multi ### GPU Checklist -Inference Engine relies on the [Compute Library for Deep Neural Networks (clDNN)](https://01.org/cldnn) for Convolutional Neural Networks acceleration on Intel® GPUs. Internally, clDNN uses OpenCL™ to implement the kernels. Thus, many general tips apply: +Inference Engine relies on the [Compute Library for Deep Neural Networks (clDNN)](https://01.org/cldnn) for Convolutional Neural Networks acceleration on Intel® GPUs. Internally, clDNN uses OpenCL™ to implement the kernels. Thus, many general tips apply: - Prefer `FP16` over `FP32`, as the Model Optimizer can generate both variants and the `FP32` is default. - Try to group individual infer jobs by using batches. @@ -190,17 +189,17 @@ Inference Engine relies on the [Compute Library for Deep Neural Networks (clDNN) Notice that while disabling the polling, this option might reduce the GPU performance, so usually this option is used with multiple [GPU streams](../IE_DG/supported_plugins/GPU.md). -### Intel® Movidius™ Myriad™ X Visual Processing Unit and Intel® Vision Accelerator Design with Intel® Movidius™ VPUs +### Intel® Movidius™ Myriad™ X Visual Processing Unit and Intel® Vision Accelerator Design with Intel® Movidius™ VPUs -Since Intel® Movidius™ Myriad™ X Visual Processing Unit (Intel® Movidius™ Myriad™ 2 VPU) communicates with the host over USB, minimum four infer requests in flight are recommended to hide the data transfer costs. See Request-Based API and “GetBlob” Idiom and [Benchmark App Sample](../../inference-engine/samples/benchmark_app/README.md) for more information. +Since Intel® Movidius™ Myriad™ X Visual Processing Unit (Intel® Movidius™ Myriad™ 2 VPU) communicates with the host over USB, minimum four infer requests in flight are recommended to hide the data transfer costs. See Request-Based API and “GetBlob” Idiom and [Benchmark App Sample](../../inference-engine/samples/benchmark_app/README.md) for more information. -Intel® Vision Accelerator Design with Intel® Movidius™ VPUs requires to keep at least 32 inference requests in flight to fully saturate the device. +Intel® Vision Accelerator Design with Intel® Movidius™ VPUs requires keeping at least 32 inference requests in flight to fully saturate the device. ### FPGA Below are listed the most important tips for the efficient usage of the FPGA: -- Just like for the Intel® Movidius™ Myriad™ VPU flavors, for the FPGA, it is important to hide the communication overheads by running multiple inference requests in parallel. For examples, refer to the [Benchmark App Sample](../../inference-engine/samples/benchmark_app/README.md). +- Just like for the Intel® Movidius™ Myriad™ VPU flavors, for the FPGA, it is important to hide the communication overheads by running multiple inference requests in parallel. For examples, refer to the [Benchmark App Sample](../../inference-engine/samples/benchmark_app/README.md). - Since the first inference iteration with FPGA is always significantly slower than the subsequent ones, make sure you run multiple iterations (all samples, except GUI-based demos, have the `-ni` or 'niter' option to do that). - FPGA performance heavily depends on the bitstream. - Number of the infer request per executable network is limited to five, so “channel” parallelism (keeping individual infer request per camera/video input) would not work beyond five inputs. Instead, you need to mux the inputs into some queue that will internally use a pool of (5) requests. @@ -231,15 +230,15 @@ The execution through heterogeneous plugin has three distinct steps: - The affinity setting is made before loading the network to the (heterogeneous) plugin, so this is always a **static** setup with respect to execution. 2. **Loading a network to the heterogeneous plugin**, which internally splits the network into subgraphs.
- You can check the decisions the plugin makes, see Analysing the Heterogeneous Execution. + You can check the decisions the plugin makes, see Analyzing the Heterogeneous Execution. 3. **Executing the infer requests**. From user’s side, this looks identical to a single-device case, while internally, the subgraphs are executed by actual plugins/devices. -Performance benefits of the heterogeneous execution depend heavily on the communications granularity between devices. If transmitting/converting data from one part device to another takes more time than the execution, the heterogeneous approach makes little or no sense. Using Intel® VTune™ helps to visualize the execution flow on a timeline (see Intel® VTune™ Examples). +Performance benefits of the heterogeneous execution depend heavily on the communications granularity between devices. If transmitting/converting data from one part device to another takes more time than the execution, the heterogeneous approach makes little or no sense. Using Intel® VTune™ helps to visualize the execution flow on a timeline (see Intel® VTune™ Examples). -Similarly, if there are too much subgraphs, the synchronization and data transfers might eat the entire performance. In some cases, you can define the (coarser) affinity manually to avoid sending data back and forth many times during one inference. +Similarly, if there are too many subgraphs, the synchronization and data transfers might eat the entire performance. In some cases, you can define the (coarser) affinity manually to avoid sending data back and forth many times during one inference. -The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and "glue" or helper kernels on the CPU. Notice that this includes the granularity considerations. For example, running some custom activation (that comes after every accelerator-equipped convolution) on the CPU might result in performance degradation due to too much data type and/or layout conversions, even though the activation itself can be extremely fast. In this case, it might make sense to consider implementing the kernel for the accelerator (see Optimizing Custom Kernels). The conversions typically manifest themselves as outstanding (comparing to CPU-only execution) 'Reorder' entries (see Internal Inference Performance Counters). +The general affinity rule of thumb is to keep computationally-intensive kernels on the accelerator, and "glue" or helper kernels on the CPU. Notice that this includes the granularity considerations. For example, running some custom activation (that comes after every accelerator-equipped convolution) on the CPU might result in performance degradation due to too much data type and/or layout conversions, even though the activation itself can be extremely fast. In this case, it might make sense to consider implementing the kernel for the accelerator (see Optimizing Custom Kernels). The conversions typically manifest themselves as outstanding (comparing to CPU-only execution) 'Reorder' entries (see Internal Inference Performance Counters). For general details on the heterogeneous plugin, refer to the [corresponding section in the Inference Engine Developer Guide](../IE_DG/supported_plugins/HETERO.md). @@ -264,7 +263,7 @@ You can point more than two devices: `-d HETERO:FPGA,GPU,CPU`. As FPGA is considered as an inference accelerator, most performance issues are related to the fact that due to the fallback, the CPU can be still used quite heavily. - Yet in most cases, the CPU does only small/lightweight layers, for example, post-processing (`SoftMax` in most classification models or `DetectionOutput` in the SSD*-based topologies). In that case, limiting the number of CPU threads with [`KEY_CPU_THREADS_NUM`](../IE_DG/supported_plugins/CPU.md) config would further reduce the CPU utilization without significantly degrading the overall performance. -- Also, if you are still using OpenVINO version earlier than R1 2019, or if you have recompiled the Inference Engine with OpemMP (say for backward compatibility), setting the `KMP_BLOCKTIME` environment variable to something less than default 200ms (we suggest 1ms) is particularly helpful. Use `KMP_BLOCKTIME=0` if the CPU subgraph is small. +- Also, if you are still using OpenVINO™ toolkit version earlier than R1 2019, or if you have recompiled the Inference Engine with OpenMP (say for backward compatibility), setting the `KMP_BLOCKTIME` environment variable to something less than default 200ms (we suggest 1ms) is particularly helpful. Use `KMP_BLOCKTIME=0` if the CPU subgraph is small. > **NOTE**: General threading tips (see Note on the App-Level Threading) apply well, even when the entire topology fits the FPGA, because there is still a host-side code for data pre- and post-processing. @@ -278,11 +277,11 @@ The following tips are provided to give general guidance on optimizing execution - The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and "glue" (or helper) kernels on the CPU. Notice that this includes the granularity considerations. For example, running some (custom) activation on the CPU would result in too many conversions. -- It is advised to do performance analysis to determine “hotspot” kernels, which should be the first candidates for offloading. At the same time, it is often more efficient to offload some reasonably sized sequence of kernels, rather than individual kernels, to minimize scheduling and other runtime overhead. +- It is advised to do performance analysis to determine “hotspot” kernels, which should be the first candidates for offloading. At the same time, it is often more efficient to offload some reasonably sized sequence of kernels, rather than individual kernels, to minimize scheduling and other runtime overhead. -- Notice that GPU can be busy with other tasks (like rendering). Similarly, the CPU can be in charge for the general OS routines and other application threads (see Note on the App-Level Threading). Also, a high interrupt rate due to many subgraphs can raise the frequency of the one device and drag the frequency of another down. +- Notice that the GPU can be busy with other tasks (like rendering). Similarly, the CPU can be in charge for the general OS routines and other application threads (see Note on the App-Level Threading). Also, a high interrupt rate due to many subgraphs can raise the frequency of the device and drag down the frequency of another. -- Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices stopping use of the Intel® Turbo Boost Technology. This might result in overall performance decrease, even comparing to single-device scenario. +- Device performance can be affected by dynamic frequency scaling. For example, running long kernels on both devices simultaneously might eventually result in one or both devices stopping use of the Intel® Turbo Boost Technology. This might result in overall performance decrease, even comparing to single-device scenario. - Mixing the `FP16` (GPU) and `FP32` (CPU) execution results in conversions and, thus, performance issues. If you are seeing a lot of heavy outstanding (compared to the CPU-only execution) Reorders, consider implementing actual GPU kernels. Refer to Internal Inference Performance Counters for more information. @@ -295,22 +294,22 @@ After enabling the configuration key, the heterogeneous plugin generates two fil - `hetero_affinity.dot` - per-layer affinities. This file is generated only if default fallback policy was executed (as otherwise you have set the affinities by yourself, so you know them). - `hetero_subgraphs.dot` - affinities per sub-graph. This file is written to the disk during execution of `Core::LoadNetwork` for the heterogeneous flow. -You can use GraphViz\* utility or `.dot` converters (for example, to `.png` or `.pdf`), like xdot\*, available on Linux\* OS with `sudo apt-get install xdot`. Below is an example of the output trimmed to the two last layers (one executed on the FPGA and another on the CPU): +You can use the GraphViz\* utility or `.dot` converters (for example, to `.png` or `.pdf`), like xdot\*, available on Linux\* OS with `sudo apt-get install xdot`. Below is an example of the output trimmed to the two last layers (one executed on the FPGA and another on the CPU): ![](../img/output_trimmed.png) -You can also use performance data (in the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md), it is an option `-pc`) to get performance data on each subgraph. Again, refer to the [HETERO plugin documentation](https://docs.openvinotoolkit.org/latest/openvino_docs_IE_DG_supported_plugins_HETERO.html#analyzing_heterogeneous_execution) and to Internal Inference Performance Counters for a general counters information. +You can also use performance data (in the [Benchmark App](../../inference-engine/samples/benchmark_app/README.md), it is an option `-pc`) to get performance data on each subgraph. Again, refer to the [HETERO plugin documentation](https://docs.openvinotoolkit.org/latest/openvino_docs_IE_DG_supported_plugins_HETERO.html#analyzing_heterogeneous_execution) and to Internal Inference Performance Counters for general counter information. ## Optimizing Custom Kernels -### Few Initial Performance Considerations +### A Few Initial Performance Considerations The Inference Engine supports CPU, GPU and VPU custom kernels. Typically, custom kernels are used to quickly implement missing layers for new topologies. You should not override standard layers implementation, especially on the critical path, for example, Convolutions. Also, overriding existing layers can disable some existing performance optimizations, such as fusing. It is usually easier to start with the CPU extension and switch to the GPU after debugging with the CPU path. Sometimes, when the custom layers are at the very end of your pipeline, it is easier to implement them as regular post-processing in your application without wrapping them as kernels. This is particularly true for the kernels that do not fit the GPU well, for example, output bounding boxes sorting. In many cases, you can do such post-processing on the CPU. -There are many cases when sequence of the custom kernels can be implemented as a "super" kernel allowing to save on data accesses. +There are many cases when sequence of the custom kernels can be implemented as a "super" kernel, allowing you to save on data accesses. Finally, with the heterogeneous execution, it is possible to execute the vast majority of intensive computations with the accelerator and keep the custom pieces on the CPU. The tradeoff is granularity/costs of communication between different devices. @@ -322,10 +321,10 @@ In most cases, before actually implementing a full-blown code for the kernel, yo Other than that, when implementing the kernels, you can try the methods from the previous chapter to understand actual contribution and, if any custom kernel is in the hotspots, optimize that. -### Few Device-Specific Tips +### A Few Device-Specific Tips - As already outlined in the CPU Checklist, align the threading model that you use in your CPU kernels with the model that the rest of the Inference Engine compiled with. -- For CPU extensions, consider kernel flavor that supports blocked layout, if your kernel is in the hotspots (see Internal Inference Performance Counters). Since Intel MKL-DNN internally operates on the blocked layouts, this would save you a data packing (Reorder) on tensor inputs/outputs of your kernel. For example of the blocked layout support, please, refer to the extensions in the `/deployment_tools/samples/extension/`. +- For CPU extensions, consider kernel flavor that supports blocked layout, if your kernel is in the hotspots (see Internal Inference Performance Counters). Since Intel MKL-DNN internally operates on the blocked layouts, this would save you a data packing (Reorder) on tensor inputs/outputs of your kernel. For example of the blocked layout support, please, refer to the extensions in the `/deployment_tools/samples/extension/` directory. ## Plugging Inference Engine to Applications @@ -338,8 +337,8 @@ For inference on the CPU there are multiple threads binding options, see If you are building an app-level pipeline with third-party components like GStreamer*, the general guidance for NUMA machines is as follows: - Whenever possible, use at least one instance of the pipeline per NUMA node: - Pin the _entire_ pipeline instance to the specific NUMA node at the outer-most level (for example, use Kubernetes* and/or `numactl` command with proper settings before actual GStreamer commands). - - Disable any individual pinning by the pipeline components (e.g. set [CPU_BIND_THREADS to 'NO'](../IE_DG/supported_plugins/CPU.md)). - - Limit each instance with respect to number of inference threads. Use [CPU_THREADS_NUM](../IE_DG/supported_plugins/CPU.md) or or other means (e.g. virtualization, Kubernetes*, etc), to avoid oversubscription. + - Disable any individual pinning by the pipeline components (e.g., set [CPU_BIND_THREADS to 'NO'](../IE_DG/supported_plugins/CPU.md)). + - Limit each instance with respect to number of inference threads. Use [CPU_THREADS_NUM](../IE_DG/supported_plugins/CPU.md) or or other means (e.g., virtualization, Kubernetes*, etc), to avoid oversubscription. - If pinning instancing/pinning of the entire pipeline is not possible or desirable, relax the inference threads pinning to just 'NUMA'. - This is less restrictive compared to the default pinning of threads to cores, yet avoids NUMA penalties. @@ -349,7 +348,7 @@ If you are building an app-level pipeline with third-party components like GStre - The rule of thumb is that you should try to have the overall number of active threads in your application equal to the number of cores in your machine. Keep in mind the spare core(s) that the OpenCL driver under the GPU plugin might also need. - One specific workaround to limit the number of threads for the Inference Engine is using the [CPU configuration options](../IE_DG/supported_plugins/CPU.md). - To avoid further oversubscription, use the same threading model in all modules/libraries that your application uses. Notice that third party components might bring their own threading. For example, using Inference Engine which is now compiled with the TBB by default might lead to [performance troubles](https://www.threadingbuildingblocks.org/docs/help/reference/appendices/known_issues/interoperability.html) when mixed in the same app with another computationally-intensive library, but compiled with OpenMP. You can try to compile the [open source version](https://github.com/openvinotoolkit/openvino) of the Inference Engine to use the OpenMP as well. But notice that in general, the TBB offers much better composability, than other threading solutions. -- If your code (or third party libraries) uses GNU OpenMP, the Intel® OpenMP (if you have recompiled Inference Engine with that) must be initialized first. This can be achieved by linking your application with the Intel OpenMP instead of GNU OpenMP, or using `LD_PRELOAD` on Linux* OS. +- If your code (or third party libraries) uses GNU OpenMP, the Intel® OpenMP (if you have recompiled Inference Engine with that) must be initialized first. This can be achieved by linking your application with the Intel OpenMP instead of GNU OpenMP, or using `LD_PRELOAD` on Linux* OS. ### Letting the Inference Engine Accelerate Image Pre-processing/Conversion @@ -363,7 +362,7 @@ Note that in many cases, you can directly share the (input) data with the Infere ### Basic Interoperability with Other APIs -The general approach for sharing data between Inference Engine and media/graphics APIs like Intel® Media Server Studio (Intel® MSS) is based on sharing the *system* memory. That is, in your code, you should map or copy the data from the API to the CPU address space first. +The general approach for sharing data between Inference Engine and media/graphics APIs like Intel® Media Server Studio (Intel® MSS) is based on sharing the *system* memory. That is, in your code, you should map or copy the data from the API to the CPU address space first. For Intel® Media SDK, it is recommended to perform a viable pre-processing, for example, crop/resize, and then convert to RGB again with the [Video Processing Procedures (VPP)](https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onevpl.htm). Then lock the result and create an Inference Engine blob on top of that. The resulting pointer can be used for `SetBlob`: @@ -408,11 +407,11 @@ If your application simultaneously executes multiple infer requests: @snippet snippets/dldt_optimization_guide7.cpp part7 -
For more information on the executable networks notation, see Request-Based API and “GetBlob” Idiom. +
For more information on the executable networks notation, see Request-Based API and “GetBlob” Idiom. - - The heterogeneous device uses the `EXCLUSIVE_ASYNC_REQUESTS` by default. +- The heterogeneous device uses the `EXCLUSIVE_ASYNC_REQUESTS` by default. - - `KEY_EXCLUSIVE_ASYNC_REQUESTS` option affects only device queues of the individual application. +- The `KEY_EXCLUSIVE_ASYNC_REQUESTS` option affects only device queues of the individual application. - For FPGA and GPU, the actual work is serialized by a plugin and/or a driver anyway. @@ -432,33 +431,33 @@ You can compare the pseudo-codes for the regular and async-based approaches: @snippet snippets/dldt_optimization_guide8.cpp part8 -![Intel® VTune™ screenshot](../img/vtune_regular.png) +![Intel® VTune™ screenshot](../img/vtune_regular.png) - In the "true" async mode, the `NEXT` request is populated in the main (application) thread, while the `CURRENT` request is processed:
@snippet snippets/dldt_optimization_guide9.cpp part9 -![Intel® VTune™ screenshot](../img/vtune_async.png) +![Intel® VTune™ screenshot](../img/vtune_async.png) The technique can be generalized to any available parallel slack. For example, you can do inference and simultaneously encode the resulting or previous frames or run further inference, like emotion detection on top of the face detection results. There are important performance caveats though: for example, the tasks that run in parallel should try to avoid oversubscribing the shared compute resources. If the inference is performed on the FPGA and the CPU is essentially idle, it makes sense to do things on the CPU in parallel. However, multiple infer requests can oversubscribe that. Notice that heterogeneous execution can implicitly use the CPU, refer to Heterogeneity. -Also, if the inference is performed on the graphics processing unit (GPU), it can take little gain to do the encoding, for instance, of the resulting video, on the same GPU in parallel, because the device is already busy. +Also, if the inference is performed on the graphics processing unit (GPU), there is very little to gain by doing the encoding, for instance, of the resulting video on the same GPU in parallel, because the device is already busy. Refer to the [Object Detection SSD Demo](@ref omz_demos_object_detection_demo_cpp) (latency-oriented Async API showcase) and [Benchmark App Sample](../../inference-engine/samples/benchmark_app/README.md) (which has both latency and throughput-oriented modes) for complete examples of the Async API in action. ## Using Tools -Whether you are tuning for the first time or doing advanced performance optimization, you need a a tool that provides accurate insights. Intel® VTune™ Amplifier gives you the tool to mine it and interpret the profiling data. +Whether you are tuning for the first time or doing advanced performance optimization, you need a a tool that provides accurate insights. Intel® VTune™ Amplifier gives you the tool to mine it and interpret the profiling data. Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these. -### Intel® VTune™ Examples +### Intel® VTune™ Examples -All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel® VTune™ timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown. +All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel® VTune™ timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown. -When choosing the Analysis type in Intel® VTune™ Amplifier, make sure to select the **Analyze user tasks, events, and counters** option: +When choosing the Analysis type in Intel® VTune™ Amplifier, make sure to select the **Analyze user tasks, events, and counters** option: ![](../img/vtune_option.jpg) @@ -478,7 +477,7 @@ Example of Inference Engine calls: Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels. -Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for optimizing custom kernels. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-analyze-performance)). +Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for optimizing custom kernels. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the [corresponding section in the Intel® VTune™ Amplifier User's Guide](https://software.intel.com/en-us/vtune-amplifier-help-analyze-performance)). ### Internal Inference Performance Counters