Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R2.10.1 Fixes #126

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

R2.10.1 Fixes #126

wants to merge 14 commits into from

Conversation

claynerobison
Copy link
Member

No description provided.

WafaaT and others added 14 commits September 19, 2022 09:28
* revert bf16 changes (#488)

* Add partials and spec yml for the end2end DLSA pipeline (#460)

* Add partials and specs for the end2end DLSA pipeline

* Add missing end line

* Update name to include ipex

* update specs to have use the public image as a base on one and SPR for the other

* Dockerfile updates for the updated DLSA repo

* Update pip install list

* Rename to public

* Removing partials that aren't used anymore

* Fixes for 'kmp-blocktime' env var (#493)

* Fixes for 'kmp-blocktime' env var

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update per review feedback

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'kmp-blocktime' for mlperf-gnmt (#494)

* Add 'kmp-blocktime' for mlperf-gnmt

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove duplicate parameter definition

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* add sample_input for resnet50 training (#495)

* remove the case when fragment_size not equal args.batch_size (#500)

* Changed the transformer_mlperf fp32 model so that we can fuse the ops… (#389)

* Changed the transformer_mlperf fp32 model so that we can fuse the ops in the model, and also minor changes for python3

* Changed the transformer_mlperf int8 model so that we can fuse the ops in the model, and also minor changes for python3

* SPR updates for WW12, 2022 (#492)

* SPR updates for WW12, 2022

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update for PyTorch SPR WW2022-12

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pytorch base for SPR too

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Stick with specific 'keras-nightly' version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Updates per code review

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update maskrcnn training_multinode.sh (#502)

* Fixed a bug in the transformer_mlperf model threads setting (#482)

* Fixed a bug in the transformer_mlperf model threads setting

* Fix failing tests

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Added the default threads setting for transformer_mlperf inference in… (#504)

* Added the default threads setting for transformer_mlperf inference in case there is no command line input

* Fix unit tests

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* PyTorch Image Classification TL notebook (#490)

* Adds new TL notebook with documentation

* Added newline

* Added to main TL README

* Small fixes

* Updated for review feedback

* Added more models and a download limit arg

* Removed py3.9 requirement and changed default model

* Adds Kitti torchvision dataset to TL notebook (#512)

* Adds Kitti torchvision dataset to TL notebook

* Fixed citations formatting

* update maskrcnn model (#515)

* minor update. (#465)

* Create unit-test github action workflow (#518)

* Create unit-test github action workflow

Tested here: https://github.com/sriester/frameworks.ai.models.intel-models/runs/6089350443?check_suite_focus=true
Runs tox py.test on push.

* Containerize job

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Added login credentials to docker

Trying to fix pull rate issue

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

Changed pip install command.

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

Changed docker credentials to imzbot

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update distilbert model to  4.18 transformers and enable int8 path (#521)

* rnnt: use launcher to set output file path and name (#524)

* Update BareMetalSetup.md (#526)

Always use the latest torchvision

* Reduce memory usage for dlrm acc test (#527)

* updatedistilbert with text_classification (#529)

* add patch for distilbert (#530)

* Update the model-builder dockerfile to use ubuntu 20.04 (#532)

* Add script for coco training dataset processing (#525)

* and update tensorflow ssd-resnet34 training dataset instructions

* update patch (#533)

Co-authored-by: Wang, Chuanqi <[email protected]>

* [RNN-T training] Enable FP32 gemm using oneDNN (#531)

* Update the Readme guide for distilbert (#534)

* Update the Readme guide for distilbert

* Fix accuracy grep bug, and grep accuracy for distilbert

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update end2end public dockerfile to look for IPEX in the conda directory (#535)

* Notebook to script conversion example (#516)

* Add notebook script conversion example

* Fixed doc

* Replaces custom preprocessor with built-in one

* Changed tag to remove_for_custom_dataset

* Add URL check prior to calling urlretrieve (#538)

* Add URL check prior to calling urlretrieve

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* disable for ssd since fused cat cat kernel is slow (#537)

* fix bug when adding steps in rnnt inference (#528)

* Fix and updates for TensorFlow WW18-2022 SPR (#542)

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for PyTorch WW14-2022 SPR (#543)

* Fix and updates for PyTorch WW14-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Updates required to the base image

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547)

* fix quickstart scripts, detect platform type, update to run with pytorch only

* Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548)

* Enable bert large multi stream inference (#554)

* test bert multi stream module

* enable input split and output concat for accuracy run

* change the default num_streams batchsize cores to 56

* change ssd multi stream throughput to 1 core 1 batch

* change the default parameter for rn50 ssd multi stream module

* modify enable_ipex_for_squad.diff to align new multistream hint implementation

* enable warmup and multi socket support

* change default parameter for rn50 ssd multi stream inference

* Add train-no-eval for rn50 pytorch (#555)

* PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550)

* Add h5py install to bert training dockerfile

* documentation updates

* update docs, and add input_preprocessing to the wrapper package

* Update LD_PRELOAD trailing :

* Fix syntax

* removing unnecessary change

* Update DLRM entrypoint

* Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR

* Fix syntax

* increase shm-size to 10g

* [RNN-T training] Update scripts -- run on 1S (#561)

* Update maskrcnn training script to run on 1s (#562)

* use single node to do ssd-rn34 training (#563)

* Update training.sh (#564)

* Update training.sh (#565)

Use tcmalloc instead of jemalloc

* use single node to do resnet50 training (#568)

* add numactl -C and remove jit warm in main thread (#569)

* Update unit-test.yml (#546)

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Fixed make command, updated pip install.

Fixed make command to run from the root directory. Replaced pip install tox with a pip install -r requirements-tests.txt to install all dependencies for the tests.

* Add tox to test dependencies. 

Added tox to the dependencies so that the Workflow and others may install it with pip install -r requirements-test.txt and be covered for running make lint and make unit-test.

* Update unit-test.yml

Changed 'make unit-test' to 'make unit_test' as that is the actual target defined in the Makefile.

* Update unit-test.yml

Changed apt-get install command.

* re-enable int8 for api change (#579)

* saperate fully convergency test from training test (#581)

Co-authored-by: jianan-gu <[email protected]>

* ssd enable new int8 (#580)

* v1

* enable new int8 method

* Revert "ssd enable new int8 (#580)" (#584)

This reverts commit 9eb3211.

* Revert "re-enable int8 for api change (#579)" (#583)

This reverts commit 0bded92.

* Update training script using 1s (#560)

* Enable checkpoint during training for bert-large (#573)

* minor fix

* Add readme for enabling checkpoint

* update phase1 to enable checkpoint by default

* Update README.md

* Enable ssd bf32 inference training (#589)

* enable ssd bf32 inference

* enable ssd bf32 train

* enable RNN-T bf32 inference (#591)

* Enable bf32 for bert and distilbert for inference (#593)

* enable bf32 distilbert

* enable bert bf32

* Enable RNN-T bf32 training (#594)

* enable maskrcnn bf32 inference and training (#595)

* enable resnet50 and resnext101 bf16 path (#596)

* enable bert bf32 train (#600)

* update resnet int8 path using new int8 api (#603)

* re-enable int8 for api change (#604)

Co-authored-by: jianan-gu <[email protected]>

* Leslie/ssd enable new int8 (#605)

* v1

* enable new int8 method

* update json file

* add rn50 int8 weight sharing

Co-authored-by: Jiang, Xiaofei <[email protected]>

* update ssd training bs to the multily of core numbers (#606)

* enable bf32 for dlrm (#607)

Co-authored-by: jianan-gu <[email protected]>

* Update IPEX new int8 API enabling for distilbert/bert-large (#608)

* enable distilbert

* enable bert

* fix max-ind-range and add memory info (#609)

Co-authored-by: jianan-gu <[email protected]>

* Remove debug code (#610)

* update training steps (#611)

* fix bandit scan fails (#612)

* PYT Image recognition models support on Windows (#549)

* fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX

* [RNN-T training] fix bandit scan fails (#614)

* RNN-T inference: fix IMZ Bandit scan fails (#615)

* Update unit-test.yml (#570)

Changed the docker user credential to utilize GitHub Secret.

* MaskRCNN: fix IMZ Bandit scan fails (#623)

* Fix for horovod-related failures in TF nightly runs (#613)

* cpp17 horovod failure fix

* minor debugging changes

* minor fixes - directory name

* cleanup

* addressing reviewer comments

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624)

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Set 'HOROVOD_WITH_MPI=1' explicitly

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update GCC version to GCC 9

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'horovodrun --check-build' for sanity check

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* removo force install inside Docker

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* [RNN-T training] Fix ddp sample number issue (#625)

* update BF32 usage (#627)

* resnet50 training: add warm up before collecting time (#628)

* image to bf16 (#629)

* Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631)

* Update mlpc path for SPR wheels

* remove patch

* Update Horovod commit id for BareMetal, Docker will be updated next (#630)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* fix dlrm convergence and change training performance BS to 32K (#633)

Co-authored-by: jianan-gu <[email protected]>

* [RNN-T training] Merge sh files to one (#635)

* update torch-ccl into 1.12 (#636)

* Liangan1/update torch ccl version (#637)

* Update torch_ccl version

* resnet50_distributed_training: don't set MASTER_ADDR by user (#638)

* Update torch_ccl in script (#639)

* Enable offline download distilbert (#632)

* enable offline download distilbert

* add convert

* Update README.md

* add accuracy.py

* add file

* refine download

* refine path

* refine path

* add license

* Update dlrm_s_pytorch.py (#643)

* Update README.md (#649)

* init pytorch T5 language model (#648)

* init pytorch T5 language model

* update README.md

* update doc

* update fpn models (#650)

* pytorch resnet50: directly call ipex.quantization (#653)

* fix int8 accuracy (#655)

Co-authored-by: Zhang, Weizhuo <[email protected]>

* Made fixes to the broken links (#652)

* Made fixes to the broken links

* Changed the ResNet50v1_5 version back to v2_7_0

* Modified the setup AI kit instructions

Co-authored-by: msalopan <[email protected]>

* Update Security Center URL (#657)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Weizhuoz/fix for pt 1.12 (#656)

* fix vgg11_bn accuracy syntax error

* remove exact_match from roberta-base

* modify maskrcnn BS to 2*num_cores

* Update dlrm_s_pytorch.py (#660)

* Update dlrm_s_pytorch.py

Reduce int8 memory usage.

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Add BF32 DDP for bert-large (#663)

* Update run_ddp_bert_pretrain_phase1.sh

* Update run_ddp_bert_pretrain_phase2.sh

* Update README.md

* move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664)

minor changes

* remove rn50 ao (#665)

* Re-organize models list to be grouped by framework  (#654)

* re-organize models list to be grouped by framework

* update tensorflow ssd-resnet34 training dataset

* add T5 in benchmark/README.md

* mannuel set torch num threads only for int8 (#666)

* Update inference_performance.sh (#669)

* improve ssdrn34 perf. (#671)

* improve ssdrn34 perf.

* minor update.

* Fix linting

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix unit tests too

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* update py version in base spec (#678)

* TF addons upgrade to 0.17.1 (#689)

* updated tf adons version

* remove comment

* Sriniva2/ssd rn34 (#682)

* improve ssdrn34 perf.

* minor update.

* enabling synthetic data.

* Update base_benchmark_util.py

* Fix linting error

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Update Dockerfiles prior to IMZ 2.8 release (#693)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update Documents prior to IMZ 2.8 release (#694)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* add support for open SUSE leap operating system (#708) (#715)

* updated tpps (#725)

* remove tf bert int8 from main readmes, model is not supported in this release. (#743)

* Adding Scipy for TensorFlow serving SSD-MobileNet model (#764) (#766)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* remove .github

Signed-off-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: leslie-fang-intel <[email protected]>
Co-authored-by: Dina Suehiro Jones <[email protected]>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Xiaoming (Jason) Cui <[email protected]>
Co-authored-by: jiayisunx <[email protected]>
Co-authored-by: Melanie Buehler <[email protected]>
Co-authored-by: Srini511 <[email protected]>
Co-authored-by: Sean-Michael Riesterer <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: Chunyuan WU <[email protected]>
Co-authored-by: zhuhaozhe <[email protected]>
Co-authored-by: Wang, Chuanqi <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: Weizhuo Zhang <[email protected]>
Co-authored-by: xiaofeij <[email protected]>
Co-authored-by: liangan1 <[email protected]>
Co-authored-by: blzheng <[email protected]>
Co-authored-by: Om Thakkar <[email protected]>
Co-authored-by: mahathis <[email protected]>
Co-authored-by: msalopan <[email protected]>
Co-authored-by: Jitendra Patil <[email protected]>
* revert bf16 changes (#488)

* Add partials and spec yml for the end2end DLSA pipeline (#460)

* Add partials and specs for the end2end DLSA pipeline

* Add missing end line

* Update name to include ipex

* update specs to have use the public image as a base on one and SPR for the other

* Dockerfile updates for the updated DLSA repo

* Update pip install list

* Rename to public

* Removing partials that aren't used anymore

* Fixes for 'kmp-blocktime' env var (#493)

* Fixes for 'kmp-blocktime' env var

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update per review feedback

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'kmp-blocktime' for mlperf-gnmt (#494)

* Add 'kmp-blocktime' for mlperf-gnmt

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove duplicate parameter definition

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* add sample_input for resnet50 training (#495)

* remove the case when fragment_size not equal args.batch_size (#500)

* Changed the transformer_mlperf fp32 model so that we can fuse the ops… (#389)

* Changed the transformer_mlperf fp32 model so that we can fuse the ops in the model, and also minor changes for python3

* Changed the transformer_mlperf int8 model so that we can fuse the ops in the model, and also minor changes for python3

* SPR updates for WW12, 2022 (#492)

* SPR updates for WW12, 2022

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update for PyTorch SPR WW2022-12

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pytorch base for SPR too

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Stick with specific 'keras-nightly' version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Updates per code review

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update maskrcnn training_multinode.sh (#502)

* Fixed a bug in the transformer_mlperf model threads setting (#482)

* Fixed a bug in the transformer_mlperf model threads setting

* Fix failing tests

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Added the default threads setting for transformer_mlperf inference in… (#504)

* Added the default threads setting for transformer_mlperf inference in case there is no command line input

* Fix unit tests

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* PyTorch Image Classification TL notebook (#490)

* Adds new TL notebook with documentation

* Added newline

* Added to main TL README

* Small fixes

* Updated for review feedback

* Added more models and a download limit arg

* Removed py3.9 requirement and changed default model

* Adds Kitti torchvision dataset to TL notebook (#512)

* Adds Kitti torchvision dataset to TL notebook

* Fixed citations formatting

* update maskrcnn model (#515)

* minor update. (#465)

* Create unit-test github action workflow (#518)

* Create unit-test github action workflow

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update distilbert model to  4.18 transformers and enable int8 path (#521)

* rnnt: use launcher to set output file path and name (#524)

* Update BareMetalSetup.md (#526)

Always use the latest torchvision

* Reduce memory usage for dlrm acc test (#527)

* updatedistilbert with text_classification (#529)

* add patch for distilbert (#530)

* Update the model-builder dockerfile to use ubuntu 20.04 (#532)

* Add script for coco training dataset processing (#525)

* and update tensorflow ssd-resnet34 training dataset instructions

* update patch (#533)

Co-authored-by: Wang, Chuanqi <[email protected]>

* [RNN-T training] Enable FP32 gemm using oneDNN (#531)

* Update the Readme guide for distilbert (#534)

* Update the Readme guide for distilbert

* Fix accuracy grep bug, and grep accuracy for distilbert

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update end2end public dockerfile to look for IPEX in the conda directory (#535)

* Notebook to script conversion example (#516)

* Add notebook script conversion example

* Fixed doc

* Replaces custom preprocessor with built-in one

* Changed tag to remove_for_custom_dataset

* Add URL check prior to calling urlretrieve (#538)

* Add URL check prior to calling urlretrieve

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* disable for ssd since fused cat cat kernel is slow (#537)

* fix bug when adding steps in rnnt inference (#528)

* Fix and updates for TensorFlow WW18-2022 SPR (#542)

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for PyTorch WW14-2022 SPR (#543)

* Fix and updates for PyTorch WW14-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Updates required to the base image

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547)

* fix quickstart scripts, detect platform type, update to run with pytorch only

* Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548)

* Enable bert large multi stream inference (#554)

* test bert multi stream module

* enable input split and output concat for accuracy run

* change the default num_streams batchsize cores to 56

* change ssd multi stream throughput to 1 core 1 batch

* change the default parameter for rn50 ssd multi stream module

* modify enable_ipex_for_squad.diff to align new multistream hint implementation

* enable warmup and multi socket support

* change default parameter for rn50 ssd multi stream inference

* Add train-no-eval for rn50 pytorch (#555)

* PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550)

* Add h5py install to bert training dockerfile

* documentation updates

* update docs, and add input_preprocessing to the wrapper package

* Update LD_PRELOAD trailing :

* Fix syntax

* removing unnecessary change

* Update DLRM entrypoint

* Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR

* Fix syntax

* increase shm-size to 10g

* [RNN-T training] Update scripts -- run on 1S (#561)

* Update maskrcnn training script to run on 1s (#562)

* use single node to do ssd-rn34 training (#563)

* Update training.sh (#564)

* Update training.sh (#565)

Use tcmalloc instead of jemalloc

* use single node to do resnet50 training (#568)

* add numactl -C and remove jit warm in main thread (#569)

* Update unit-test.yml (#546)

* re-enable int8 for api change (#579)

* saperate fully convergency test from training test (#581)

Co-authored-by: jianan-gu <[email protected]>

* ssd enable new int8 (#580)

* v1

* enable new int8 method

* Revert "ssd enable new int8 (#580)" (#584)

This reverts commit 9eb3211.

* Revert "re-enable int8 for api change (#579)" (#583)

This reverts commit 0bded92.

* Update training script using 1s (#560)

* Enable checkpoint during training for bert-large (#573)

* minor fix

* Add readme for enabling checkpoint

* update phase1 to enable checkpoint by default

* Update README.md

* Enable ssd bf32 inference training (#589)

* enable ssd bf32 inference

* enable ssd bf32 train

* enable RNN-T bf32 inference (#591)

* Enable bf32 for bert and distilbert for inference (#593)

* enable bf32 distilbert

* enable bert bf32

* Enable RNN-T bf32 training (#594)

* enable maskrcnn bf32 inference and training (#595)

* enable resnet50 and resnext101 bf16 path (#596)

* enable bert bf32 train (#600)

* update resnet int8 path using new int8 api (#603)

* re-enable int8 for api change (#604)

Co-authored-by: jianan-gu <[email protected]>

* Leslie/ssd enable new int8 (#605)

* v1

* enable new int8 method

* update json file

* add rn50 int8 weight sharing

Co-authored-by: Jiang, Xiaofei <[email protected]>

* update ssd training bs to the multily of core numbers (#606)

* enable bf32 for dlrm (#607)

Co-authored-by: jianan-gu <[email protected]>

* Update IPEX new int8 API enabling for distilbert/bert-large (#608)

* enable distilbert

* enable bert

* fix max-ind-range and add memory info (#609)

Co-authored-by: jianan-gu <[email protected]>

* Remove debug code (#610)

* update training steps (#611)

* fix bandit scan fails (#612)

* PYT Image recognition models support on Windows (#549)

* fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX

* [RNN-T training] fix bandit scan fails (#614)

* RNN-T inference: fix IMZ Bandit scan fails (#615)

* Update unit-test.yml (#570)

* MaskRCNN: fix IMZ Bandit scan fails (#623)

* Fix for horovod-related failures in TF nightly runs (#613)

* cpp17 horovod failure fix

* minor debugging changes

* minor fixes - directory name

* cleanup

* addressing reviewer comments

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624)

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Set 'HOROVOD_WITH_MPI=1' explicitly

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update GCC version to GCC 9

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'horovodrun --check-build' for sanity check

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* removo force install inside Docker

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* [RNN-T training] Fix ddp sample number issue (#625)

* update BF32 usage (#627)

* resnet50 training: add warm up before collecting time (#628)

* image to bf16 (#629)

* Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631)

* Update mlpc path for SPR wheels

* remove patch

* Update Horovod commit id for BareMetal, Docker will be updated next (#630)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* fix dlrm convergence and change training performance BS to 32K (#633)

Co-authored-by: jianan-gu <[email protected]>

* [RNN-T training] Merge sh files to one (#635)

* update torch-ccl into 1.12 (#636)

* Liangan1/update torch ccl version (#637)

* Update torch_ccl version

* resnet50_distributed_training: don't set MASTER_ADDR by user (#638)

* Update torch_ccl in script (#639)

* Enable offline download distilbert (#632)

* enable offline download distilbert

* add convert

* Update README.md

* add accuracy.py

* add file

* refine download

* refine path

* refine path

* add license

* Update dlrm_s_pytorch.py (#643)

* Update README.md (#649)

* init pytorch T5 language model (#648)

* init pytorch T5 language model

* update README.md

* update doc

* update fpn models (#650)

* pytorch resnet50: directly call ipex.quantization (#653)

* fix int8 accuracy (#655)

Co-authored-by: Zhang, Weizhuo <[email protected]>

* Made fixes to the broken links (#652)

* Made fixes to the broken links

* Changed the ResNet50v1_5 version back to v2_7_0

* Modified the setup AI kit instructions

Co-authored-by: msalopan <[email protected]>

* Update Security Center URL (#657)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Weizhuoz/fix for pt 1.12 (#656)

* fix vgg11_bn accuracy syntax error

* remove exact_match from roberta-base

* modify maskrcnn BS to 2*num_cores

* Update dlrm_s_pytorch.py (#660)

* Update dlrm_s_pytorch.py

Reduce int8 memory usage.

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Add BF32 DDP for bert-large (#663)

* Update run_ddp_bert_pretrain_phase1.sh

* Update run_ddp_bert_pretrain_phase2.sh

* Update README.md

* move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664)

minor changes

* remove rn50 ao (#665)

* Re-organize models list to be grouped by framework  (#654)

* re-organize models list to be grouped by framework

* update tensorflow ssd-resnet34 training dataset

* add T5 in benchmark/README.md

* mannuel set torch num threads only for int8 (#666)

* Update inference_performance.sh (#669)

* improve ssdrn34 perf. (#671)

* improve ssdrn34 perf.

* minor update.

* Fix linting

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix unit tests too

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Use IPEX Pytorch whls instead of building IPEX from source (#674)

Co-authored-by: Clayne Robison <[email protected]>

* Lpot2inc (#446)

Co-authored-by: ltsai1 <[email protected]>

* Sriniva2/ssd rn34 (#682)

* improve ssdrn34 perf.

* minor update.

* enabling synthetic data.

* Update base_benchmark_util.py

* Fix linting error

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Add doc updates for '--synthetic-data' option (#683)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Change checkpoint setting for Bert train phase 1 (#602)

* Change checkpoint setting for Bert train phase 1

* fix model and config saving

* fix error when runing gpu path (#686)

* fix load pretrained model error when using torch_ccl (#688)

* update py version in base spec (#678) (#690)

* TF addons upgrade to 0.17.1 (#689) (#691)

* updated tf adons version

* remove comment

* Update Dockerfiles prior to IMZ 2.8 release (#693)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update Documents prior to IMZ 2.8 release (#694)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update README.md (#697)

* change numpy version requirement (#703)

* Remove MiniGo training from IMZ (#644)

* remove MiniGo training scripts and unit test

* [RNN-T] [Inference] optimize the batch decoder (#711)

* reduce fill_ OP in rnnt embedding kernel

* optimize add between int and log to reduce dtype conversion

* rnnt: support dump tracing file and print profile table (#712)

* add support for open SUSE leap operating system (#708)

* rnnt inference: pre convert data to bf16 (#713)

* remove squeeze/slice/transpose (#714)

* update resnet50 training code (#710)

* update resnet50 training code

* not using ipex optimize for resnet50 training

* use ipex.optimize() on the whole model (#718)

* resnet50 bf32: calling ipex.optimize to enable bf32 path (#719)

* Added batch size as an env variable to the quickstart scripts (#676)

Co-authored-by: Clayne Robison <[email protected]>

* Added batchsize as an env variable to quickstart scripts (#680)

* updated readme: nit fix (#723)

Co-authored-by: Rahul Nair <[email protected]>

* compute throughput by test_mini_batch_size (#740)

* pytorch resnet50: fix bf32 training path error (#739)

* Fix a subtle 'E275' style issue that causes unknown behavior (#742)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* rearrange the paragraphs and fix Markdown headers (#744)

* Align Transformers version for BERT models (#738)

* align transformer version(4.18) for bert models

* change scripts to legacy

* redo calibration

* patch fix

* Update README.md (#746)

* Add support for stock PYT- object detection models (#732)

* stock PYT and windows support for object detection models

* Weizhuoz/reduce model zoo steps (#762)

* reduce steps for bert-base, roberta, fpn models

* modify max_iter for fpn models

* reduce all img classification models steps

* update new config for bert models (#763)

* Addin Scipy for TensorFlow serving SSD-MobileNet model (#764)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update TF ResNet50v1.5 inference for SPR (baremetal) (#749)

* Added matplotlib dependency to image_segmentation requirements (#768)

* Update readmes for the path to output directory (#769)

* update wide & deep readme for the path to pretrained model directory (#771)

* add a check for ubuntu 22.04 support (#721)

* Changes to add bfloat16 support for DIEN training (#679)

* Changes to add bfloat16 support for DIEN training
* Some for for reporting performance
* Fixes for dien training and unit tests

* updated tpp file withr2.8 approvals (#773)

* Add Windows stock PyTorch support for TransNet v2 (#779)

* update TransNet v2 to work with stock pytorch
* update Windows.md path in all relevant docs

* add P99 metric for LZ models (#780)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Rn50 training multiple epoches output 1 KPI and add training_steps argument. (#775)

* enable --training_steps and 1 training KPI output with multiple epoches

* add prefix

* update print freq

* fix display bug

* enable PyTorch resnet50 fp16 path (#783)

* enable PyTorch resnet50 fp16 path

* fix conflict

* Extract p99 metric from log to summary (#784)

* enable fp16 bert train and inference (#782)

* Vruddarr/pt update windows readmes (#778)

* remove bfloat16 experimental support note (#786)

* Update IPEX installation path (#788)

* Clean up _pycache_ files, remove symlinks, and add license headers for dien training bf16 (#787)

* update readme for jemalloc and iomp path (#789)

* update readme for jemalloc and iomp path

* Updated IOMP path as path to the intel-openmp directory

* PyTorch: fix resnext101 running script (#795)

* Update 3dunet mlperf bash scripts and README (#797)

* update 3dunet mlperf doc to use quickstart scripts, rename quickstart scripts for multi-instance

* fix tests job (#803)

* rnnt inference: align replace lstm API due to IPEX change (#802)

* Adding quick start scripts to MobileNetV1 bfloat16 precision (#793)

* Adding quick start scripts to ssd-mobilenet bfloat16 precision (#798)

* Update T5 model with windows quick start scripts (#790)

* Update T5 model with windows quick start scripts

* Updated Readme by specifying values to environment variables

* Update inference int8 readme and script of 4 CV models using INC (#698)

* update docs to add INC int8 models as an option
* add instructions for how to quantize a fp32 model using INC

* rnnt: fix stft due to PyTorch API change (#811)

* rnnt training: fix stft due to PyTorch API change (#813)

* Update BareMetalSetup.md (#817)

* Gerardod/build container (#807)

First phase of GHA WF to build the image of a Model Zoo workload container and push it to CAAS.

* Sharvils/tf workload (#808)

* TFv2.10 support added. Horovod version updated.

* Vruddarr/tf add language translation bert fp32 quick start scripts (#804)

* Adding quick start scripts to language translation BERT FP32 model

* Updated TL notebooks for SPR Launch (#810)

* Updates for TL PyTorch notebook

* Edits for two more TL notebooks

* Reverting previous change for virtualenv

* Removed --no-deps and some nonexistent links

* Added TFHub cache dir

* Updated TL notebook README for legal/branding

* Update typo in Readme (#821)

Co-authored-by: veena.mounika.ruddarraju <[email protected]>

* PyTorch: using ipex.optimize for bf16 training (#824)

* Fix CVEs for Pillow and notebook packages (#831)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* add intel-alphafold2 optimized w/ IPEX from realm of AIDD (#737)

* add alphafold2 from AIDD realm

* Remove unused variable in mlperf 3DUnet performance run (#832)

* Update Model Zoo name, Python version and message for IPEX (#833)

* Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updt… (#830)

* Update models main tables (#836)

*update main readmes

* Adding jemalloc instructions and environment variables (#838)

* Add support for dGPU models (#840)

* add support for dGPU support

* remove spr dockerfiles and spec files (#842)

* delete links to 3dunet mlperf and bert large int8 (#841)

* update tbb files (#843)

* fix vulnerability issues reported by snyk scans (#848)

* update for new precision (#849)

* upgrade for ipex 1.13

* delete workflows

Signed-off-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: leslie-fang-intel <[email protected]>
Co-authored-by: Dina Suehiro Jones <[email protected]>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Xiaoming (Jason) Cui <[email protected]>
Co-authored-by: jiayisunx <[email protected]>
Co-authored-by: Melanie Buehler <[email protected]>
Co-authored-by: Srini511 <[email protected]>
Co-authored-by: Sean-Michael Riesterer <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: Chunyuan WU <[email protected]>
Co-authored-by: zhuhaozhe <[email protected]>
Co-authored-by: Wang, Chuanqi <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: Weizhuo Zhang <[email protected]>
Co-authored-by: xiaofeij <[email protected]>
Co-authored-by: liangan1 <[email protected]>
Co-authored-by: blzheng <[email protected]>
Co-authored-by: Om Thakkar <[email protected]>
Co-authored-by: mahathis <[email protected]>
Co-authored-by: Clayne Robison <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: ltsai1 <[email protected]>
Co-authored-by: Jitendra Patil <[email protected]>
Co-authored-by: Kanvi Khanna <[email protected]>
Co-authored-by: Rahul Nair <[email protected]>
Co-authored-by: Veena2207 <[email protected]>
Co-authored-by: jojivk-intel-nervana <[email protected]>
Co-authored-by: xiangdong <[email protected]>
Co-authored-by: Huang, Zhiwei <[email protected]>
Co-authored-by: gera-aldama <[email protected]>
Co-authored-by: Sharvil Shah <[email protected]>
Co-authored-by: wyang2 <[email protected]>
Co-authored-by: Yimei Sun <[email protected]>
Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>
* Update Pillow to '>=9.3.0' (#884)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* remove supported OS checks (#926)

* Remove Linux/windows OS platform support checks (#927)

* upgrade Pillow version for Yolov4

Signed-off-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
* rnnt: use launcher to set output file path and name (#524)

* Update BareMetalSetup.md (#526)

Always use the latest torchvision

* Reduce memory usage for dlrm acc test (#527)

* updatedistilbert with text_classification (#529)

* add patch for distilbert (#530)

* Update the model-builder dockerfile to use ubuntu 20.04 (#532)

* Add script for coco training dataset processing (#525)

* and update tensorflow ssd-resnet34 training dataset instructions

* update patch (#533)

Co-authored-by: Wang, Chuanqi <[email protected]>

* [RNN-T training] Enable FP32 gemm using oneDNN (#531)

* Update the Readme guide for distilbert (#534)

* Update the Readme guide for distilbert

* Fix accuracy grep bug, and grep accuracy for distilbert

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update end2end public dockerfile to look for IPEX in the conda directory (#535)

* Notebook to script conversion example (#516)

* Add notebook script conversion example

* Fixed doc

* Replaces custom preprocessor with built-in one

* Changed tag to remove_for_custom_dataset

* Add URL check prior to calling urlretrieve (#538)

* Add URL check prior to calling urlretrieve

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* disable for ssd since fused cat cat kernel is slow (#537)

* fix bug when adding steps in rnnt inference (#528)

* Fix and updates for TensorFlow WW18-2022 SPR (#542)

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for PyTorch WW14-2022 SPR (#543)

* Fix and updates for PyTorch WW14-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Updates required to the base image

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547)

* fix quickstart scripts, detect platform type, update to run with pytorch only

* Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548)

* Enable bert large multi stream inference (#554)

* test bert multi stream module

* enable input split and output concat for accuracy run

* change the default num_streams batchsize cores to 56

* change ssd multi stream throughput to 1 core 1 batch

* change the default parameter for rn50 ssd multi stream module

* modify enable_ipex_for_squad.diff to align new multistream hint implementation

* enable warmup and multi socket support

* change default parameter for rn50 ssd multi stream inference

* Add train-no-eval for rn50 pytorch (#555)

* PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550)

* Add h5py install to bert training dockerfile

* documentation updates

* update docs, and add input_preprocessing to the wrapper package

* Update LD_PRELOAD trailing :

* Fix syntax

* removing unnecessary change

* Update DLRM entrypoint

* Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR

* Fix syntax

* increase shm-size to 10g

* [RNN-T training] Update scripts -- run on 1S (#561)

* Update maskrcnn training script to run on 1s (#562)

* use single node to do ssd-rn34 training (#563)

* Update training.sh (#564)

* Update training.sh (#565)

Use tcmalloc instead of jemalloc

* use single node to do resnet50 training (#568)

* add numactl -C and remove jit warm in main thread (#569)

* Update unit-test.yml (#546)

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Fixed make command, updated pip install.

Fixed make command to run from the root directory. Replaced pip install tox with a pip install -r requirements-tests.txt to install all dependencies for the tests.

* Add tox to test dependencies. 

Added tox to the dependencies so that the Workflow and others may install it with pip install -r requirements-test.txt and be covered for running make lint and make unit-test.

* Update unit-test.yml

Changed 'make unit-test' to 'make unit_test' as that is the actual target defined in the Makefile.

* Update unit-test.yml

Changed apt-get install command.

* re-enable int8 for api change (#579)

* saperate fully convergency test from training test (#581)

Co-authored-by: jianan-gu <[email protected]>

* ssd enable new int8 (#580)

* v1

* enable new int8 method

* Revert "ssd enable new int8 (#580)" (#584)

This reverts commit 9eb3211.

* Revert "re-enable int8 for api change (#579)" (#583)

This reverts commit 0bded92.

* Update training script using 1s (#560)

* Enable checkpoint during training for bert-large (#573)

* minor fix

* Add readme for enabling checkpoint

* update phase1 to enable checkpoint by default

* Update README.md

* Enable ssd bf32 inference training (#589)

* enable ssd bf32 inference

* enable ssd bf32 train

* enable RNN-T bf32 inference (#591)

* Enable bf32 for bert and distilbert for inference (#593)

* enable bf32 distilbert

* enable bert bf32

* Enable RNN-T bf32 training (#594)

* enable maskrcnn bf32 inference and training (#595)

* enable resnet50 and resnext101 bf16 path (#596)

* enable bert bf32 train (#600)

* update resnet int8 path using new int8 api (#603)

* re-enable int8 for api change (#604)

Co-authored-by: jianan-gu <[email protected]>

* Leslie/ssd enable new int8 (#605)

* v1

* enable new int8 method

* update json file

* add rn50 int8 weight sharing

Co-authored-by: Jiang, Xiaofei <[email protected]>

* update ssd training bs to the multily of core numbers (#606)

* enable bf32 for dlrm (#607)

Co-authored-by: jianan-gu <[email protected]>

* Update IPEX new int8 API enabling for distilbert/bert-large (#608)

* enable distilbert

* enable bert

* fix max-ind-range and add memory info (#609)

Co-authored-by: jianan-gu <[email protected]>

* Remove debug code (#610)

* update training steps (#611)

* fix bandit scan fails (#612)

* PYT Image recognition models support on Windows (#549)

* fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX

* [RNN-T training] fix bandit scan fails (#614)

* RNN-T inference: fix IMZ Bandit scan fails (#615)

* Update unit-test.yml (#570)

Changed the docker user credential to utilize GitHub Secret.

* MaskRCNN: fix IMZ Bandit scan fails (#623)

* Fix for horovod-related failures in TF nightly runs (#613)

* cpp17 horovod failure fix

* minor debugging changes

* minor fixes - directory name

* cleanup

* addressing reviewer comments

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624)

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Set 'HOROVOD_WITH_MPI=1' explicitly

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update GCC version to GCC 9

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'horovodrun --check-build' for sanity check

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* removo force install inside Docker

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* [RNN-T training] Fix ddp sample number issue (#625)

* update BF32 usage (#627)

* resnet50 training: add warm up before collecting time (#628)

* image to bf16 (#629)

* Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631)

* Update mlpc path for SPR wheels

* remove patch

* Update Horovod commit id for BareMetal, Docker will be updated next (#630)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* fix dlrm convergence and change training performance BS to 32K (#633)

Co-authored-by: jianan-gu <[email protected]>

* [RNN-T training] Merge sh files to one (#635)

* update torch-ccl into 1.12 (#636)

* Liangan1/update torch ccl version (#637)

* Update torch_ccl version

* resnet50_distributed_training: don't set MASTER_ADDR by user (#638)

* Update torch_ccl in script (#639)

* Enable offline download distilbert (#632)

* enable offline download distilbert

* add convert

* Update README.md

* add accuracy.py

* add file

* refine download

* refine path

* refine path

* add license

* Update dlrm_s_pytorch.py (#643)

* Update README.md (#649)

* init pytorch T5 language model (#648)

* init pytorch T5 language model

* update README.md

* update doc

* update fpn models (#650)

* pytorch resnet50: directly call ipex.quantization (#653)

* fix int8 accuracy (#655)

Co-authored-by: Zhang, Weizhuo <[email protected]>

* Made fixes to the broken links (#652)

* Changed the ResNet50v1_5 version back to v2_7_0

* Update Security Center URL (#657)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Weizhuoz/fix for pt 1.12 (#656)

* fix vgg11_bn accuracy syntax error

* remove exact_match from roberta-base

* modify maskrcnn BS to 2*num_cores

* Update dlrm_s_pytorch.py (#660)

* Update dlrm_s_pytorch.py

Reduce int8 memory usage.

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Add BF32 DDP for bert-large (#663)

* Update run_ddp_bert_pretrain_phase1.sh

* Update run_ddp_bert_pretrain_phase2.sh

* Update README.md

* move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664)

minor changes

* remove rn50 ao (#665)

* Re-organize models list to be grouped by framework  (#654)

* re-organize models list to be grouped by framework

* update tensorflow ssd-resnet34 training dataset

* add T5 in benchmark/README.md

* mannuel set torch num threads only for int8 (#666)

* Update inference_performance.sh (#669)

* improve ssdrn34 perf. (#671)

* improve ssdrn34 perf.

* minor update.

* Fix linting

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix unit tests too

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Use IPEX Pytorch whls instead of building IPEX from source (#674)

* Use IPEX Pytorch whls instead of building IPEX from source

* Corrected the link to install pytorch/IPEX

* Corrected the link to install pytorch/IPEX

* Updated the link with latest tutorial to install pytorch/IPEX

* Update docs/general/pytorch/BareMetalSetup.md

Co-authored-by: Clayne Robison <[email protected]>

* Update docs/general/pytorch/BareMetalSetup.md

Co-authored-by: Clayne Robison <[email protected]>

* Made the suggested tweaks in the names

* Adding condition to install jemalloc and tcmalloc

Co-authored-by: Clayne Robison <[email protected]>

* Added condition to install jemalloc, tcmalloc, vision and torch-ccl

* Added some tweaks

Co-authored-by: Clayne Robison <[email protected]>
Co-authored-by: root <[email protected]>

* Lpot2inc (#446)

* draft for lpot quantization and perf analysis jupyter notebook

* update with formal name of model zoo, correct wrong words, add license in python file

* rm empty line

* renmae LPOT to INC in text and code, and use new api

* Update README.md

* Update set_env.sh

* Update README.md

* Update ut.sh

* Update local_banchmark.sh

* Create local_benchmark.sh

* Update README.md

* Update inc_for_tensorflow.ipynb

* Update ut.sh

* Update README.md

* rename to local_benchmark.sh

* Update ut.sh

* Update ut.sh

* Update run_jupyter.sh

* Delete lpot_for_tensorflow.ipynb

* Delete lpot_quantize_model.py

* Update README.md

* Update README.md

* Update README.md

* Update inc_for_tensorflow.ipynb

* Update README.md

* Update README.md

* Update inc_for_tensorflow.ipynb

* Update requirements.txt

Co-authored-by: ltsai1 <[email protected]>

* Sriniva2/ssd rn34 (#682)

* improve ssdrn34 perf.

* minor update.

* enabling synthetic data.

* Update base_benchmark_util.py

* Fix linting error

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Add doc updates for '--synthetic-data' option (#683)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Change checkpoint setting for Bert train phase 1 (#602)

* Change checkpoint setting for Bert train phase 1

* fix model and config saving

* fix error when runing gpu path (#686)

* fix load pretrained model error when using torch_ccl (#688)

* update py version in base spec (#678) (#690)

* TF addons upgrade to 0.17.1 (#689) (#691)

* updated tf adons version

* remove comment

* Update Dockerfiles prior to IMZ 2.8 release (#693)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update Documents prior to IMZ 2.8 release (#694)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update README.md (#697)

* change numpy version requirement (#703)

* Remove MiniGo training from IMZ (#644)

* remove MiniGo training scripts and unit test

* [RNN-T] [Inference] optimize the batch decoder (#711)

* reduce fill_ OP in rnnt embedding kernel

* optimize add between int and log to reduce dtype conversion

* rnnt: support dump tracing file and print profile table (#712)

* add support for open SUSE leap operating system (#708)

* rnnt inference: pre convert data to bf16 (#713)

* remove squeeze/slice/transpose (#714)

* update resnet50 training code (#710)

* update resnet50 training code

* not using ipex optimize for resnet50 training

* use ipex.optimize() on the whole model (#718)

* resnet50 bf32: calling ipex.optimize to enable bf32 path (#719)

* Added batch size as an env variable to the quickstart scripts (#676)

* WIP: Adding batch size as an environment variable to the quickstart scripts

* Added instructions in README.md for all workloads

* Update README.md

* Corrected typo in launch_benchmark

* Made corrections to .docs and ran model-builder

* Delete .README.md.swp

* Delete .fp32_accuracy.sh.swp

* Update quickstart/image_segmentation/tensorflow/3d_unet_mlperf/inference/cpu/inference_throughput.sh

Co-authored-by: Clayne Robison <[email protected]>

* Update quickstart/language_translation/tensorflow/transformer_mlperf/inference/cpu/inference_realtime.sh

Co-authored-by: Clayne Robison <[email protected]>

* Update benchmarks/launch_benchmark.py

Co-authored-by: Clayne Robison <[email protected]>

* Made corrections to batch-size parameter

* Made changes in launch_benchmark for batch-size arg

* Made modifications to the README's

* Resolved merge conflict by keeping README.md file.

* Modified readme for windows

* Resolved merge conflict by keeping README.md file.

* Corrected SPR run.sh scripts

* Removed echo from run.sh

Co-authored-by: Clayne Robison <[email protected]>

* Added batchsize as an env variable to quickstart scripts (#680)

* Added batchsize as an env variable to quickstart scripts

* Made modifications to .docs and scripts

* Made modifications to README

* Resolved merge conflict by incorporating both suggestions.

* Made corrections in README.md

* Made corrections in README.md

* Undo changes in training.sh file

* updated readme: nit fix (#723)

Co-authored-by: Rahul Nair <[email protected]>

* compute throughput by test_mini_batch_size (#740)

* pytorch resnet50: fix bf32 training path error (#739)

* Fix a subtle 'E275' style issue that causes unknown behavior (#742)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* rearrange the paragraphs and fix Markdown headers (#744)

* Align Transformers version for BERT models (#738)

* align transformer version(4.18) for bert models

* change scripts to legacy

* redo calibration

* patch fix

* Update README.md (#746)

* Add support for stock PYT- object detection models (#732)

* stock PYT and windows support for object detection models

* Weizhuoz/reduce model zoo steps (#762)

* reduce steps for bert-base, roberta, fpn models

* modify max_iter for fpn models

* reduce all img classification models steps

* update new config for bert models (#763)

* Addin Scipy for TensorFlow serving SSD-MobileNet model (#764)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update TF ResNet50v1.5 inference for SPR (baremetal) (#749)

* Added matplotlib dependency to image_segmentation requirements (#768)

* Update readmes for the path to output directory (#769)

* update wide & deep readme for the path to pretrained model directory (#771)

* add a check for ubuntu 22.04 support (#721)

* Changes to add bfloat16 support for DIEN training (#679)

* Changes to add bfloat16 support for DIEN training
* Some for for reporting performance
* Fixes for dien training and unit tests

* updated tpp file withr2.8 approvals (#773)

* Add Windows stock PyTorch support for TransNet v2 (#779)

* update TransNet v2 to work with stock pytorch
* update Windows.md path in all relevant docs

* add P99 metric for LZ models (#780)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Rn50 training multiple epoches output 1 KPI and add training_steps argument. (#775)

* enable --training_steps and 1 training KPI output with multiple epoches

* add prefix

* update print freq

* fix display bug

* enable PyTorch resnet50 fp16 path (#783)

* enable PyTorch resnet50 fp16 path

* fix conflict

* Extract p99 metric from log to summary (#784)

* enable fp16 bert train and inference (#782)

* Vruddarr/pt update windows readmes (#778)

* remove bfloat16 experimental support note (#786)

* Update IPEX installation path (#788)

* Clean up _pycache_ files, remove symlinks, and add license headers for dien training bf16 (#787)

* update readme for jemalloc and iomp path (#789)

* update readme for jemalloc and iomp path

* Updated IOMP path as path to the intel-openmp directory

* PyTorch: fix resnext101 running script (#795)

* Update 3dunet mlperf bash scripts and README (#797)

* update 3dunet mlperf doc to use quickstart scripts, rename quickstart scripts for multi-instance

* fix tests job (#803)

* rnnt inference: align replace lstm API due to IPEX change (#802)

* Adding quick start scripts to MobileNetV1 bfloat16 precision (#793)

* Adding quick start scripts to ssd-mobilenet bfloat16 precision (#798)

* Update T5 model with windows quick start scripts (#790)

* Update T5 model with windows quick start scripts

* Updated Readme by specifying values to environment variables

* Update inference int8 readme and script of 4 CV models using INC (#698)

* update docs to add INC int8 models as an option
* add instructions for how to quantize a fp32 model using INC

* rnnt: fix stft due to PyTorch API change (#811)

* rnnt training: fix stft due to PyTorch API change (#813)

* Update BareMetalSetup.md (#817)

* Gerardod/build container (#807)

First phase of GHA WF to build the image of a Model Zoo workload container and push it to CAAS.

* Sharvils/tf workload (#808)

* TFv2.10 support added. Horovod version updated.

* Vruddarr/tf add language translation bert fp32 quick start scripts (#804)

* Adding quick start scripts to language translation BERT FP32 model

* Changed path to the Readme

* Adding spec file <bert-fp32-inference_spec.yml>

* Update spec file and model link in Readme tables

* Update Readme path in windows.md

* Updated TL notebooks for SPR Launch (#810)

* Updates for TL PyTorch notebook

* Edits for two more TL notebooks

* Reverting previous change for virtualenv

* Removed --no-deps and some nonexistent links

* Added TFHub cache dir

* Updated TL notebook README for legal/branding

* Update typo in Readme (#821)

* PyTorch: using ipex.optimize for bf16 training (#824)

* Fix CVEs for Pillow and notebook packages (#831)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* add intel-alphafold2 optimized w/ IPEX from realm of AIDD (#737)

* add alphafold2 from AIDD realm

* Remove unused variable in mlperf 3DUnet performance run (#832)

* Update Model Zoo name, Python version and message for IPEX (#833)

* Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updt… (#830)

* Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updting the readme by replacing conda with Miniconda.

* Adding comment to install torch in BareMetalSetup.md

* Adding IPEX version and removing *s

* Update models main tables (#836)

*update main readmes

* Adding jemalloc instructions and environment variables (#838)

* DLRM hybrid gradient product (#814)

* enable hybrid mergedembedding

* Hybrid Merge embedding

* refine code

* Update model file

* Fix data loader issue for distributed trianing

* Update the print info

* Fix lr issue for sparse table
both 2/8 ranks get convergenced with 0.75 epochs

Co-authored-by: root <[email protected]>

* update the TTT evaluation method by excluding dataloader & metric evaluation (#844)

Co-authored-by: Zhang, Liangang <[email protected]>

* PyTorch: resnet50 distributed training using lars optimizer (#826)

* modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850)

* modify recall/precision/f1/ap 's eval as optional (#856)

* Port dataloader optimization for distributed training of dlrm (#847)

* update the TTT evaluation method by excluding dataloader & metric evaluation

* port dataloader optimization for distributed training of dlrm

* modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850)

* modify recall/precision/f1/ap 's eval as optional (#856)

* port dataloader optimization for distributed training of dlrm

* delete local bs computation in evaluation stage

* modify the TTT output name

Co-authored-by: Zhang, Liangang <[email protected]>

* Update horovod version to fix run time failure due to Status call (#859)

* fix regression for dlrm single node training (#864)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update pytorch model zoo table of BF32 with landing zoo models (#865)

* Added SNYK scan (#855)

* Update SSD-ResNet34 code in start.sh(#862)

* Add Distilbert base model for inference (Tensorflow) to model zoo (#815)

* Add fp32 inference for distilbert base model

* Fix Bert spec file (#873)

* 1) Add torch.profiler (#871)

2) change the distributed_training.sh for dlrm to diamond cluster

* Update Wide & Deep docs (#875)

* The copy of #867(Porting evaluation iteration overlapping) (#876)

* port evaluation overlapping

* remove debug code

* remove debug code

* remove unused code

* remove unused code

* add resnet50 distributed training script (#879)

* add resnet50 distributed training script

* collect TTT

Co-authored-by: XiaobingSuper <[email protected]>

* reduce redundant bus traffic (#880)

* Port all_to_all index overlapping with interaction and top mlp. (#878)

* port all_to_all index overlapping with interaction and top mlp

* fix seg fault

* Add int8 support for distilbert (#823)

* Add fp32 inference for distilbert base model
Co-authored-by: syedshahbaaz <[email protected]>

* Update DIEN inference docs & quickstart scripts (#869)

* Update DIEN docs
* update for spr ww42
Co-authored-by: WafaaT <[email protected]>

* Update ResNet50v1.5 docs (#820)

* Update and Validate ResNet50v1.5 Inference and training model for TF SPR
* Update and validate docs for TF SPR

Co-authored-by: WafaaT <[email protected]>

* Update Wide & Deep using Large Dataset docs (#877)

* Vruddarr/tf bfloat32 precision check (#893)

* Update Wide and Deep Large Dataset Training Model docs (#881)

* Vruddarr/tf update image recognition models docs (#816)

* Update Inceptionv3,DenseNet 169, Inceptionv4, ResNet50, ResNet101, MobileNet V1 quickstart scripts and docs

* Update and validate MobileNet v1 for TF SPR

Co-authored-by: WafaaT <[email protected]>

* Fix BFloat32 precision check code for Resnet50v1.5 training model (#894)

* Update 3DUNet MLperf for SPR (#889)

* Updated Bert Large SPR READMEs (#887)

* Included tensorflow and keras versions

* updated to downloaded bert checkpoints

* Fix typos in MobilenetV1 scripts (#899)

* modify time function to solve int8 benchmark issue on windows (#898)

* modify time function to solve int8 benchmark issue on windows

* Replace the time.time function calls to time.perf_counter to improve the time statistic resolution. Updated for the additional 5 models

Co-authored-by: Ying <[email protected]>

* Update DIEN Training docs (#882)

* Adding permissions to scripts in DIEN and correcting pb file paths in README_SPR_baremetal (#901)

* Adding SPR_baremetal_readme and fixing model paths in the tables (#904)

* fix acc test for single node (#903)

* fix acc test for single node

* Update dlrm_s_pytorch.py

Co-authored-by: Weizhuo Zhang <[email protected]>

* commit cherry-picks from r2.9 (#900)

* update tbb files (#843)

* fix vulnerability issues reported by snyk scans (#848)

* upgrade for ipex 1.13

* Update Pillow to '>=9.3.0' (#884)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* fix some bugs for p99 (#909)

* Update tensorflow benchmarks to use latest horovod commit (#908)

* Update start.sh

* Update start.sh

* Update to use shortened commit hash

* do not convert data to bf16 while using fp32 and bf32 (#911)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update SSD-Resnet34 training docs for SPR task (#914)

* Update SSD-Resnet34 training & docs for SPR

* Vruddarr/tf update ssd mobilenet docs (#846)

* Update quick start scripts and spec file to run for all precisions

* Update and validate SSD-Mobilenet docs for TF SPR

Co-authored-by: WafaaT <[email protected]>

* fix print issue (#915)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update rfcn docs to use same quick start scripts (#897)

* Update rfcn docs to use same quick start scripts

Co-authored-by: WafaaT <[email protected]>

* Sharvils/spr ssd training (#917)

* Dockerfile updated

* Update SSD-ResNet34 Inference docs (#866)

* Update ResNet34 Inference to use same scripts & docs for all precisions

* Update for SPR WW42

Co-authored-by: WafaaT <[email protected]>

* Update transformer_mlperf scripts and README fro SPR WW42 (#891)


Co-authored-by: Wafaa Taie <[email protected]>

* Update TF models spec files for SPR WW42 (#919)

* update TF models spec files for spr ww42

* update docker partial for tf addons version

* workaround rdma config for spr (#925)

* remove supported OS checks (#926)

* Update Model paths in main readme (#928)

* Remove Linux/windows OS platform support checks (#927)

* update resnet50 distributed training script (#923)

* resnet50 distributed training: use logical core for ccl (#930)

* Update bert scripts to add same quick start scripts to all precisions (#910)

* Update MobilenetV1 SPR docs (#931)

* Update Resnet50v1_5_SPR_docs (#934)

* Update SSD-Mobilenet SPR docs (#935)

* Update Resenet50v1.5 inference SPR docs (#933)

* Fix DIEN inference.sh script and add pretrained model env var in mobilenetv1  SPR baremetal readme (#939)

* Update DIEN Inference and Training SPR docs (#937)

* Update SSD-Resnet34 training SPR docs (#936)

* Update SSD-Resnet34 Inference SPR docs (#938)

* Update README_SPR_baremetal.md
remove steps and warm_up steps env vars

Co-authored-by: Wafaa Taie <[email protected]>

* BERT training dockerfile fixed (#921)

* BERT repo version fixed for SPR container (#920)

* Update spr baremetal instructions for 3dunet, bert large and transformer mlperf (#932)

* Update Transformer MLPerf inference docs for pre-trained models (#940)

* Fix Language Translation BERT quickstart scripts (#941)

* fix scripts to detect the number of cores

* Update mlperf_gnmt docs (#945)

* Updating Transformer_LT_official scripts (#913)

* Add support for dGPU models (#840) (#948)

* Add support for dGPU models (#840)

* upgrade Pillow version for Yolov4

* Update main README.md (#947)

* update main readme

* edit transformer_mlperf and bert SPR docs

* remove workflows

* Fix CVEs based on Snyk scans in TL notebooks (#951)

* fix snyk critical issues in TL jupyter notebooks

* Remove INC dependency for Snyk issues (#953)

* removed neuralcompressorfor to avoid vulnerability in Snyk scans

* Remove pointers to BERT Large int8 docs (#952)

* fix int8 model link (#958)

* Fixed num_intra_threads for bfloat16 (#959) (#960)

* Fixed num_intra_threads for bfloat16

* Modified open mpi instructions

* Added kmp_blocktime for bfloat16

Co-authored-by: mahathis <[email protected]>

* Fix syntax error and pythonpath in ssd-resnet34 training (#962) (#965)

Co-authored-by: Veena2207 <[email protected]>

* fix training bkms (#967) (#968)

* fix T5 inference script (#969)

---------

Signed-off-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: Chunyuan WU <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: zhuhaozhe <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: Dina Suehiro Jones <[email protected]>
Co-authored-by: Wang, Chuanqi <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: Weizhuo Zhang <[email protected]>
Co-authored-by: Melanie Buehler <[email protected]>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: leslie-fang-intel <[email protected]>
Co-authored-by: xiaofeij <[email protected]>
Co-authored-by: jiayisunx <[email protected]>
Co-authored-by: Sean-Michael Riesterer <[email protected]>
Co-authored-by: liangan1 <[email protected]>
Co-authored-by: blzheng <[email protected]>
Co-authored-by: Om Thakkar <[email protected]>
Co-authored-by: mahathis <[email protected]>
Co-authored-by: Srini511 <[email protected]>
Co-authored-by: Clayne Robison <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: ltsai1 <[email protected]>
Co-authored-by: Jitendra Patil <[email protected]>
Co-authored-by: Kanvi Khanna <[email protected]>
Co-authored-by: Rahul Nair <[email protected]>
Co-authored-by: Veena2207 <[email protected]>
Co-authored-by: jojivk-intel-nervana <[email protected]>
Co-authored-by: xiangdong <[email protected]>
Co-authored-by: Huang, Zhiwei <[email protected]>
Co-authored-by: gera-aldama <[email protected]>
Co-authored-by: Sharvil Shah <[email protected]>
Co-authored-by: wyang2 <[email protected]>
Co-authored-by: Yimei Sun <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: tangleintel <[email protected]>
Co-authored-by: Syed Shahbaaz Ahmed <[email protected]>
Co-authored-by: Er-Xin (Edwin) Shang <[email protected]>
Co-authored-by: Ying <[email protected]>
Co-authored-by: sevdeawesome <[email protected]>
Co-authored-by: DiweiSun <[email protected]>
* [RNN-T training] Enable FP32 gemm using oneDNN (#531)

* Update the Readme guide for distilbert (#534)

* Update the Readme guide for distilbert

* Fix accuracy grep bug, and grep accuracy for distilbert

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update end2end public dockerfile to look for IPEX in the conda directory (#535)

* Notebook to script conversion example (#516)

* Add notebook script conversion example

* Fixed doc

* Replaces custom preprocessor with built-in one

* Changed tag to remove_for_custom_dataset

* Add URL check prior to calling urlretrieve (#538)

* Add URL check prior to calling urlretrieve

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* disable for ssd since fused cat cat kernel is slow (#537)

* fix bug when adding steps in rnnt inference (#528)

* Fix and updates for TensorFlow WW18-2022 SPR (#542)

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for PyTorch WW14-2022 SPR (#543)

* Fix and updates for PyTorch WW14-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix and updates for TensorFlow WW18-2022 SPR

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix TensorFlow SPR nightly versions

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update pre-trained models download URLs

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Intall Python 3.8 development tools

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix OpenMPI install and setup

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update to Horovod commit 11c1389 to fix TF v2.9 + Horovod install failure (#519)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Horovod Installaion for SPR and CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix Python3.8 version for CentOS

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a typo in TensorFlow 3d-unet partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix a broken partial

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add TCMalloc to TF base container for SPR and remove OpenSSL

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Updates required to the base image

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Remove some repositories

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'matplotlib' for '3d-unet'

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* switch to build OpenMPI due to issue in Market Place provided version

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYTORCH_WHEEL and IPEX_WHEEL arg values

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix PYT resnet50 quickstart scripts for both Linux and Windows (#547)

* fix quickstart scripts, detect platform type, update to run with pytorch only

* Fix SPR PyTorch MaskRCNN inference documentation for CHECKPOINT_DIR (#548)

* Enable bert large multi stream inference (#554)

* test bert multi stream module

* enable input split and output concat for accuracy run

* change the default num_streams batchsize cores to 56

* change ssd multi stream throughput to 1 core 1 batch

* change the default parameter for rn50 ssd multi stream module

* modify enable_ipex_for_squad.diff to align new multistream hint implementation

* enable warmup and multi socket support

* change default parameter for rn50 ssd multi stream inference

* Add train-no-eval for rn50 pytorch (#555)

* PyTorch SPR BERT large training updates (h5py and dataset instructions) and update LD_PRELOAD for SPR entrypoints (#550)

* Add h5py install to bert training dockerfile

* documentation updates

* update docs, and add input_preprocessing to the wrapper package

* Update LD_PRELOAD trailing :

* Fix syntax

* removing unnecessary change

* Update DLRM entrypoint

* Update docs to note that phase2 has bert_config.json in the CHECKPOINT_DIR

* Fix syntax

* increase shm-size to 10g

* [RNN-T training] Update scripts -- run on 1S (#561)

* Update maskrcnn training script to run on 1s (#562)

* use single node to do ssd-rn34 training (#563)

* Update training.sh (#564)

* Update training.sh (#565)

Use tcmalloc instead of jemalloc

* use single node to do resnet50 training (#568)

* add numactl -C and remove jit warm in main thread (#569)

* Update unit-test.yml (#546)

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Update unit-test.yml

* Fixed make command, updated pip install.

Fixed make command to run from the root directory. Replaced pip install tox with a pip install -r requirements-tests.txt to install all dependencies for the tests.

* Add tox to test dependencies. 

Added tox to the dependencies so that the Workflow and others may install it with pip install -r requirements-test.txt and be covered for running make lint and make unit-test.

* Update unit-test.yml

Changed 'make unit-test' to 'make unit_test' as that is the actual target defined in the Makefile.

* Update unit-test.yml

Changed apt-get install command.

* re-enable int8 for api change (#579)

* saperate fully convergency test from training test (#581)

Co-authored-by: jianan-gu <[email protected]>

* ssd enable new int8 (#580)

* v1

* enable new int8 method

* Revert "ssd enable new int8 (#580)" (#584)

This reverts commit 9eb3211.

* Revert "re-enable int8 for api change (#579)" (#583)

This reverts commit 0bded92.

* Update training script using 1s (#560)

* Enable checkpoint during training for bert-large (#573)

* minor fix

* Add readme for enabling checkpoint

* update phase1 to enable checkpoint by default

* Update README.md

* Enable ssd bf32 inference training (#589)

* enable ssd bf32 inference

* enable ssd bf32 train

* enable RNN-T bf32 inference (#591)

* Enable bf32 for bert and distilbert for inference (#593)

* enable bf32 distilbert

* enable bert bf32

* Enable RNN-T bf32 training (#594)

* enable maskrcnn bf32 inference and training (#595)

* enable resnet50 and resnext101 bf16 path (#596)

* enable bert bf32 train (#600)

* update resnet int8 path using new int8 api (#603)

* re-enable int8 for api change (#604)

Co-authored-by: jianan-gu <[email protected]>

* Leslie/ssd enable new int8 (#605)

* v1

* enable new int8 method

* update json file

* add rn50 int8 weight sharing

Co-authored-by: Jiang, Xiaofei <[email protected]>

* update ssd training bs to the multily of core numbers (#606)

* enable bf32 for dlrm (#607)

Co-authored-by: jianan-gu <[email protected]>

* Update IPEX new int8 API enabling for distilbert/bert-large (#608)

* enable distilbert

* enable bert

* fix max-ind-range and add memory info (#609)

Co-authored-by: jianan-gu <[email protected]>

* Remove debug code (#610)

* update training steps (#611)

* fix bandit scan fails (#612)

* PYT Image recognition models support on Windows (#549)

* fix all image recognition scripts to run on windows and linux with PYT, and only linux with IPEX

* [RNN-T training] fix bandit scan fails (#614)

* RNN-T inference: fix IMZ Bandit scan fails (#615)

* Update unit-test.yml (#570)

Changed the docker user credential to utilize GitHub Secret.

* MaskRCNN: fix IMZ Bandit scan fails (#623)

* Fix for horovod-related failures in TF nightly runs (#613)

* cpp17 horovod failure fix

* minor debugging changes

* minor fixes - directory name

* cleanup

* addressing reviewer comments

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34 (#624)

* Minor fix for Horovod install and adding 'tf_slim' for SSD ResNet34

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Set 'HOROVOD_WITH_MPI=1' explicitly

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* update GCC version to GCC 9

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Add 'horovodrun --check-build' for sanity check

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* removo force install inside Docker

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* [RNN-T training] Fix ddp sample number issue (#625)

* update BF32 usage (#627)

* resnet50 training: add warm up before collecting time (#628)

* image to bf16 (#629)

* Update end2end DLSA dockerfile due to SPR wheel path update and removing int8 patch (#631)

* Update mlpc path for SPR wheels

* remove patch

* Update Horovod commit id for BareMetal, Docker will be updated next (#630)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* fix dlrm convergence and change training performance BS to 32K (#633)

Co-authored-by: jianan-gu <[email protected]>

* [RNN-T training] Merge sh files to one (#635)

* update torch-ccl into 1.12 (#636)

* Liangan1/update torch ccl version (#637)

* Update torch_ccl version

* resnet50_distributed_training: don't set MASTER_ADDR by user (#638)

* Update torch_ccl in script (#639)

* Enable offline download distilbert (#632)

* enable offline download distilbert

* add convert

* Update README.md

* add accuracy.py

* add file

* refine download

* refine path

* refine path

* add license

* Update dlrm_s_pytorch.py (#643)

* Update README.md (#649)

* init pytorch T5 language model (#648)

* init pytorch T5 language model

* update README.md

* update doc

* update fpn models (#650)

* pytorch resnet50: directly call ipex.quantization (#653)

* fix int8 accuracy (#655)

Co-authored-by: Zhang, Weizhuo <[email protected]>

* Made fixes to the broken links (#652)

* Update Security Center URL (#657)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Weizhuoz/fix for pt 1.12 (#656)

* fix vgg11_bn accuracy syntax error

* remove exact_match from roberta-base

* modify maskrcnn BS to 2*num_cores

* Update dlrm_s_pytorch.py (#660)

* Update dlrm_s_pytorch.py

Reduce int8 memory usage.

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Update dlrm_s_pytorch.py

* Add BF32 DDP for bert-large (#663)

* Update run_ddp_bert_pretrain_phase1.sh

* Update run_ddp_bert_pretrain_phase2.sh

* Update README.md

* move OMP_NUM_THREADS=1 into dlrm_s_pytorch.py (#664)

minor changes

* remove rn50 ao (#665)

* Re-organize models list to be grouped by framework  (#654)

* re-organize models list to be grouped by framework

* update tensorflow ssd-resnet34 training dataset

* add T5 in benchmark/README.md

* mannuel set torch num threads only for int8 (#666)

* Update inference_performance.sh (#669)

* improve ssdrn34 perf. (#671)

* improve ssdrn34 perf.

* minor update.

* Fix linting

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Fix unit tests too

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Use IPEX Pytorch whls instead of building IPEX from source (#674)

* Use IPEX Pytorch whls instead of building IPEX from source

* Corrected the link to install pytorch/IPEX

* Corrected the link to install pytorch/IPEX

* Updated the link with latest tutorial to install pytorch/IPEX

* Update docs/general/pytorch/BareMetalSetup.md

Co-authored-by: Clayne Robison <[email protected]>

* Update docs/general/pytorch/BareMetalSetup.md

Co-authored-by: Clayne Robison <[email protected]>

* Made the suggested tweaks in the names

* Adding condition to install jemalloc and tcmalloc

Co-authored-by: Clayne Robison <[email protected]>

* Added condition to install jemalloc, tcmalloc, vision and torch-ccl

* Added some tweaks

Co-authored-by: Clayne Robison <[email protected]>

* Lpot2inc (#446)

* draft for lpot quantization and perf analysis jupyter notebook

* update with formal name of model zoo, correct wrong words, add license in python file

* rm empty line

* renmae LPOT to INC in text and code, and use new api

* Update README.md

* Update set_env.sh

* Update README.md

* Update ut.sh

* Update local_banchmark.sh

* Create local_benchmark.sh

* Update README.md

* Update inc_for_tensorflow.ipynb

* Update ut.sh

* Update README.md

* rename to local_benchmark.sh

* Update ut.sh

* Update ut.sh

* Update run_jupyter.sh

* Delete lpot_for_tensorflow.ipynb

* Delete lpot_quantize_model.py

* Update README.md

* Update README.md

* Update README.md

* Update inc_for_tensorflow.ipynb

* Update README.md

* Update README.md

* Update inc_for_tensorflow.ipynb

* Update requirements.txt

Co-authored-by: ltsai1 <[email protected]>

* Sriniva2/ssd rn34 (#682)

* improve ssdrn34 perf.

* minor update.

* enabling synthetic data.

* Update base_benchmark_util.py

* Fix linting error

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Co-authored-by: Abolfazl Shahbazi <[email protected]>

* Add doc updates for '--synthetic-data' option (#683)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Change checkpoint setting for Bert train phase 1 (#602)

* Change checkpoint setting for Bert train phase 1

* fix model and config saving

* fix error when runing gpu path (#686)

* fix load pretrained model error when using torch_ccl (#688)

* update py version in base spec (#678) (#690)

* TF addons upgrade to 0.17.1 (#689) (#691)

* updated tf adons version

* remove comment

* Update Dockerfiles prior to IMZ 2.8 release (#693)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update Documents prior to IMZ 2.8 release (#694)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update README.md (#697)

* change numpy version requirement (#703)

* Remove MiniGo training from IMZ (#644)

* remove MiniGo training scripts and unit test

* [RNN-T] [Inference] optimize the batch decoder (#711)

* reduce fill_ OP in rnnt embedding kernel

* optimize add between int and log to reduce dtype conversion

* rnnt: support dump tracing file and print profile table (#712)

* add support for open SUSE leap operating system (#708)

* rnnt inference: pre convert data to bf16 (#713)

* remove squeeze/slice/transpose (#714)

* update resnet50 training code (#710)

* update resnet50 training code

* not using ipex optimize for resnet50 training

* use ipex.optimize() on the whole model (#718)

* resnet50 bf32: calling ipex.optimize to enable bf32 path (#719)

* Added batch size as an env variable to the quickstart scripts (#676)

* WIP: Adding batch size as an environment variable to the quickstart scripts

* Added instructions in README.md for all workloads

* Update README.md

* Corrected typo in launch_benchmark

* Made corrections to .docs and ran model-builder

* Delete .README.md.swp

* Delete .fp32_accuracy.sh.swp

* Update quickstart/image_segmentation/tensorflow/3d_unet_mlperf/inference/cpu/inference_throughput.sh

Co-authored-by: Clayne Robison <[email protected]>

* Update quickstart/language_translation/tensorflow/transformer_mlperf/inference/cpu/inference_realtime.sh

Co-authored-by: Clayne Robison <[email protected]>

* Update benchmarks/launch_benchmark.py

Co-authored-by: Clayne Robison <[email protected]>

* Made corrections to batch-size parameter

* Made changes in launch_benchmark for batch-size arg

* Made modifications to the README's

* Resolved merge conflict by keeping README.md file.

* Modified readme for windows

* Resolved merge conflict by keeping README.md file.

* Corrected SPR run.sh scripts

* Removed echo from run.sh

Co-authored-by: Clayne Robison <[email protected]>

* Added batchsize as an env variable to quickstart scripts (#680)

* Added batchsize as an env variable to quickstart scripts

* Made modifications to .docs and scripts

* Made modifications to README

* Resolved merge conflict by incorporating both suggestions.

* Made corrections in README.md

* Made corrections in README.md

* Undo changes in training.sh file

* updated readme: nit fix (#723)

Co-authored-by: Rahul Nair <[email protected]>

* compute throughput by test_mini_batch_size (#740)

* pytorch resnet50: fix bf32 training path error (#739)

* Fix a subtle 'E275' style issue that causes unknown behavior (#742)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* rearrange the paragraphs and fix Markdown headers (#744)

* Align Transformers version for BERT models (#738)

* align transformer version(4.18) for bert models

* change scripts to legacy

* redo calibration

* patch fix

* Update README.md (#746)

* Add support for stock PYT- object detection models (#732)

* stock PYT and windows support for object detection models

* Weizhuoz/reduce model zoo steps (#762)

* reduce steps for bert-base, roberta, fpn models

* modify max_iter for fpn models

* reduce all img classification models steps

* update new config for bert models (#763)

* Addin Scipy for TensorFlow serving SSD-MobileNet model (#764)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* Update TF ResNet50v1.5 inference for SPR (baremetal) (#749)

* Added matplotlib dependency to image_segmentation requirements (#768)

* Update readmes for the path to output directory (#769)

* update wide & deep readme for the path to pretrained model directory (#771)

* add a check for ubuntu 22.04 support (#721)

* Changes to add bfloat16 support for DIEN training (#679)

* Changes to add bfloat16 support for DIEN training
* Some for for reporting performance
* Fixes for dien training and unit tests

* updated tpp file withr2.8 approvals (#773)

* Add Windows stock PyTorch support for TransNet v2 (#779)

* update TransNet v2 to work with stock pytorch
* update Windows.md path in all relevant docs

* add P99 metric for LZ models (#780)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Rn50 training multiple epoches output 1 KPI and add training_steps argument. (#775)

* enable --training_steps and 1 training KPI output with multiple epoches

* add prefix

* update print freq

* fix display bug

* enable PyTorch resnet50 fp16 path (#783)

* enable PyTorch resnet50 fp16 path

* fix conflict

* Extract p99 metric from log to summary (#784)

* enable fp16 bert train and inference (#782)

* Vruddarr/pt update windows readmes (#778)

* remove bfloat16 experimental support note (#786)

* Update IPEX installation path (#788)

* Clean up _pycache_ files, remove symlinks, and add license headers for dien training bf16 (#787)

* update readme for jemalloc and iomp path (#789)

* update readme for jemalloc and iomp path

* Updated IOMP path as path to the intel-openmp directory

* PyTorch: fix resnext101 running script (#795)

* Update 3dunet mlperf bash scripts and README (#797)

* update 3dunet mlperf doc to use quickstart scripts, rename quickstart scripts for multi-instance

* fix tests job (#803)

* rnnt inference: align replace lstm API due to IPEX change (#802)

* Adding quick start scripts to MobileNetV1 bfloat16 precision (#793)

* Adding quick start scripts to MobileNetV1 bfloat16 precision

* Adding executable permissions to files

* Adding aikit.md to docs file

* updated the comments on readme

Co-authored-by: veena.mounika.ruddarraju <[email protected]>

* Adding quick start scripts to ssd-mobilenet bfloat16 precision (#798)

* Adding quick start scripts to ssd-mobilenet bfloat16 precision

* changed file permissions

* Updated comments on readme file

Co-authored-by: veena.mounika.ruddarraju <[email protected]>

* Update T5 model with windows quick start scripts (#790)

* Update T5 model with windows quick start scripts

* Updated Readme by specifying values to environment variables

* Update inference int8 readme and script of 4 CV models using INC (#698)

* update docs to add INC int8 models as an option
* add instructions for how to quantize a fp32 model using INC

* rnnt: fix stft due to PyTorch API change (#811)

* rnnt training: fix stft due to PyTorch API change (#813)

* Update BareMetalSetup.md (#817)

* Gerardod/build container (#807)

First phase of GHA WF to build the image of a Model Zoo workload container and push it to CAAS.

* Sharvils/tf workload (#808)

* TFv2.10 support added. Horovod version updated.

* Vruddarr/tf add language translation bert fp32 quick start scripts (#804)

* Adding quick start scripts to language translation BERT FP32 model

* Corrected typo errors

* Changed path to the Readme

* Adding spec file <bert-fp32-inference_spec.yml>

* Update spec file and model link in Readme tables

* Update Readme path in windows.md

* Updated TL notebooks for SPR Launch (#810)

* Updates for TL PyTorch notebook

* Edits for two more TL notebooks

* Reverting previous change for virtualenv

* Removed --no-deps and some nonexistent links

* Added TFHub cache dir

* Updated TL notebook README for legal/branding

* Update typo in Readme (#821)

* PyTorch: using ipex.optimize for bf16 training (#824)

* Fix CVEs for Pillow and notebook packages (#831)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* add intel-alphafold2 optimized w/ IPEX from realm of AIDD (#737)

* add alphafold2 from AIDD realm

* Remove unused variable in mlperf 3DUnet performance run (#832)

* Update Model Zoo name, Python version and message for IPEX (#833)

Co-authored-by: veena.mounika.ruddarraju <[email protected]>

* Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updt… (#830)

* Update instruction for Miniconda, Jemalloc, PyTorch and IPEX and updting the readme by replacing conda with Miniconda.

* Adding comment to install torch in BareMetalSetup.md

* Update models main tables (#836)

*update main readmes

* Adding jemalloc instructions and environment variables (#838)

* DLRM hybrid gradient product (#814)

* enable hybrid mergedembedding

* Hybrid Merge embedding

* refine code

* Update model file

* Fix data loader issue for distributed trianing

* Update the print info

* Fix lr issue for sparse table
both 2/8 ranks get convergenced with 0.75 epochs

Co-authored-by: root <[email protected]>

* update the TTT evaluation method by excluding dataloader & metric evaluation (#844)

Co-authored-by: Zhang, Liangang <[email protected]>

* PyTorch: resnet50 distributed training using lars optimizer (#826)

* modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850)

* modify recall/precision/f1/ap 's eval as optional (#856)

* Port dataloader optimization for distributed training of dlrm (#847)

* update the TTT evaluation method by excluding dataloader & metric evaluation

* port dataloader optimization for distributed training of dlrm

* modify dlrm's sklearn metric eval func to ipex's multi-thread version (#850)

* modify recall/precision/f1/ap 's eval as optional (#856)

* port dataloader optimization for distributed training of dlrm

* delete local bs computation in evaluation stage

* modify the TTT output name

Co-authored-by: Zhang, Liangang <[email protected]>

* Update horovod version to fix run time failure due to Status call (#859)

* fix regression for dlrm single node training (#864)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update pytorch model zoo table of BF32 with landing zoo models (#865)

* Added SNYK scan (#855)

* Update SSD-ResNet34 code in start.sh(#862)

* Add Distilbert base model for inference (Tensorflow) to model zoo (#815)

* Add fp32 inference for distilbert base model

* Fix Bert spec file (#873)

* 1) Add torch.profiler (#871)

2) change the distributed_training.sh for dlrm to diamond cluster

* Update Wide & Deep docs (#875)

* The copy of #867(Porting evaluation iteration overlapping) (#876)

* port evaluation overlapping

* remove debug code

* remove debug code

* remove unused code

* remove unused code

* add resnet50 distributed training script (#879)

* add resnet50 distributed training script

* collect TTT

Co-authored-by: XiaobingSuper <[email protected]>

* reduce redundant bus traffic (#880)

* Port all_to_all index overlapping with interaction and top mlp. (#878)

* port all_to_all index overlapping with interaction and top mlp

* fix seg fault

* Add int8 support for distilbert (#823)

* Add fp32 inference for distilbert base model
Co-authored-by: syedshahbaaz <[email protected]>

* Update DIEN inference docs & quickstart scripts (#869)

* Update DIEN docs
* update for spr ww42
Co-authored-by: WafaaT <[email protected]>

* Update ResNet50v1.5 docs (#820)

* Update and Validate ResNet50v1.5 Inference and training model for TF SPR
* Update and validate docs for TF SPR

Co-authored-by: WafaaT <[email protected]>

* Update Wide & Deep using Large Dataset docs (#877)

* Vruddarr/tf bfloat32 precision check (#893)

* Update Wide and Deep Large Dataset Training Model docs (#881)

* Vruddarr/tf update image recognition models docs (#816)

* Update Inceptionv3,DenseNet 169, Inceptionv4, ResNet50, ResNet101, MobileNet V1 quickstart scripts and docs

* Update and validate MobileNet v1 for TF SPR

Co-authored-by: WafaaT <[email protected]>

* Fix BFloat32 precision check code for Resnet50v1.5 training model (#894)

* Update 3DUNet MLperf for SPR (#889)

* Updated Bert Large SPR READMEs (#887)

* Updated Bert Large SPR READMEs

* Included tensorflow and keras versions

* Updated bert large README for spr

* Updated scripts and README as per reviews

* Update SPR quickstart description

* updated to downloaded bert checkpoints

* Fix typos in MobilenetV1 scripts (#899)

* modify time function to solve int8 benchmark issue on windows (#898)

* modify time function to solve int8 benchmark issue on windows

* Replace the time.time function calls to time.perf_counter to improve the time statistic resolution. Updated for the additional 5 models

Co-authored-by: Ying <[email protected]>

* Update DIEN Training docs (#882)

* Adding permissions to scripts in DIEN and correcting pb file paths in README_SPR_baremetal (#901)

* Adding SPR_baremetal_readme and fixing model paths in the tables (#904)

* fix acc test for single node (#903)

* fix acc test for single node

* Update dlrm_s_pytorch.py

Co-authored-by: Weizhuo Zhang <[email protected]>

* commit cherry-picks from r2.9 (#900)

* update tbb files (#843)

* fix vulnerability issues reported by snyk scans (#848)

* upgrade for ipex 1.13

* Update Pillow to '>=9.3.0' (#884)

Signed-off-by: Abolfazl Shahbazi <[email protected]>

Signed-off-by: Abolfazl Shahbazi <[email protected]>

* fix some bugs for p99 (#909)

* Update tensorflow benchmarks to use latest horovod commit (#908)

* Update start.sh

* Update start.sh

* Update to use shortened commit hash

* do not convert data to bf16 while using fp32 and bf32 (#911)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update SSD-Resnet34 training docs for SPR task (#914)

* Update SSD-Resnet34 training & docs for SPR

* Vruddarr/tf update ssd mobilenet docs (#846)

* Update quick start scripts and spec file to run for all precisions

* Update and validate SSD-Mobilenet docs for TF SPR

Co-authored-by: WafaaT <[email protected]>

* fix print issue (#915)

Co-authored-by: Weizhuo Zhang <[email protected]>

* Update rfcn docs to use same quick start scripts (#897)

* Update rfcn docs to use same quick start scripts

Co-authored-by: WafaaT <[email protected]>

* Sharvils/spr ssd training (#917)

* Dockerfile updated

* Update SSD-ResNet34 Inference docs (#866)

* Update ResNet34 Inference to use same scripts & docs for all precisions

* Update for SPR WW42

Co-authored-by: WafaaT <[email protected]>

* Update transformer_mlperf scripts and README fro SPR WW42 (#891)


Co-authored-by: Wafaa Taie <[email protected]>

* Update TF models spec files for SPR WW42 (#919)

* update TF models spec files for spr ww42

* update docker partial for tf addons version

* workaround rdma config for spr (#925)

* remove supported OS checks (#926)

* Update Model paths in main readme (#928)

* Remove Linux/windows OS platform support checks (#927)

* update resnet50 distributed training script (#923)

* resnet50 distributed training: use logical core for ccl (#930)

* Update bert scripts to add same quick start scripts to all precisions (#910)

* Update MobilenetV1 SPR docs (#931)

* Update Resnet50v1_5_SPR_docs (#934)

* Update SSD-Mobilenet SPR docs (#935)

* Update Resenet50v1.5 inference SPR docs (#933)

* Fix DIEN inference.sh script and add pretrained model env var in mobilenetv1  SPR baremetal readme (#939)

* Update DIEN Inference and Training SPR docs (#937)

* Update SSD-Resnet34 training SPR docs (#936)

* Update SSD-Resnet34 Inference SPR docs (#938)

* Update README_SPR_baremetal.md
remove steps and warm_up steps env vars

Co-authored-by: Wafaa Taie <[email protected]>

* BERT training dockerfile fixed (#921)

* BERT repo version fixed for SPR container (#920)

* Update spr baremetal instructions for 3dunet, bert large and transformer mlperf (#932)

* Update Transformer MLPerf inference docs for pre-trained models (#940)

* Fix Language Translation BERT quickstart scripts (#941)

* fix scripts to detect the number of cores

* Update mlperf_gnmt docs (#945)

* Updating Transformer_LT_official scripts (#913)

* Add support for dGPU models (#840) (#948)

* Add support for dGPU models (#840)

* upgrade Pillow version for Yolov4

* Update main README.md (#947)

* update main readme

* edit transformer_mlperf and bert SPR docs

* remove workflows

* Fix CVEs based on Snyk scans in TL notebooks (#951)

* fix snyk critical issues in TL jupyter notebooks

* Remove INC dependency for Snyk issues (#953)

* removed neuralcompressorfor to avoid vulnerability in Snyk scans

* Remove pointers to BERT Large int8 docs (#952)

* fix int8 model link (#958)

* Fixed num_intra_threads for bfloat16 (#959) (#960)

* Fixed num_intra_threads for bfloat16

* Modified open mpi instructions

* Added kmp_blocktime for bfloat16

Co-authored-by: mahathis <[email protected]>

* Fix syntax error and pythonpath in ssd-resnet34 training (#962) (#965)

Co-authored-by: Veena2207 <[email protected]>

* fix training bkms (#967) (#968)

* fix T5 inference script (#969)

* Fix resnet50v1.5 weightsharing for int8 (#996)

* Corrected typo in SPR quickstart scripts (#991)

* fix model_init for int8 weightsharing

---------

Co-authored-by: mahathis <[email protected]>

* TF SPR DevCatalog READMEs (#983)

* add image recognition devcats

* add tf object detection devcats

* add TF language translation devcats

* add tf image segmentation devcats

* add tf language modeling devcats

* add recommendation tf devcats

* fix swapped containers and precision in run command

* add README_SPR to all getting started links and correct script names

* rename files and point getting started to itself

* fix last link

* fix minor error (#994)

* Update TF SPR ww42 containers partials, spec-files and dockerfiles  (#998)

TF SPR Containers Built and Validated

* Sharvils/tf devcats fixes (#995)

Minor fixes to SPR TF DevCatalogs
---------

Co-authored-by: sharvil.shah

* SPR PyTorch DevCatalogs (#993)

Added Devcatalog files targeting SPR container launch

* Delete SPR containers README_SPR.md (#999)

* delete README_SPR.md

* remove references in spec-files

* fix for auto-merge

---------

Signed-off-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: YanbingJiang <[email protected]>
Co-authored-by: jianan-gu <[email protected]>
Co-authored-by: Weizhuo Zhang <[email protected]>
Co-authored-by: Dina Suehiro Jones <[email protected]>
Co-authored-by: Melanie Buehler <[email protected]>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
Co-authored-by: leslie-fang-intel <[email protected]>
Co-authored-by: xiaofeij <[email protected]>
Co-authored-by: jiayisunx <[email protected]>
Co-authored-by: zhuhaozhe <[email protected]>
Co-authored-by: XiaobingZhang <[email protected]>
Co-authored-by: Sean-Michael Riesterer <[email protected]>
Co-authored-by: liangan1 <[email protected]>
Co-authored-by: Chunyuan WU <[email protected]>
Co-authored-by: blzheng <[email protected]>
Co-authored-by: Om Thakkar <[email protected]>
Co-authored-by: mahathis <[email protected]>
Co-authored-by: Srini511 <[email protected]>
Co-authored-by: Clayne Robison <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: ltsai1 <[email protected]>
Co-authored-by: Jitendra Patil <[email protected]>
Co-authored-by: Kanvi Khanna <[email protected]>
Co-authored-by: Rahul Nair <[email protected]>
Co-authored-by: Veena2207 <[email protected]>
Co-authored-by: jojivk-intel-nervana <[email protected]>
Co-authored-by: xiangdong <[email protected]>
Co-authored-by: Huang, Zhiwei <[email protected]>
Co-authored-by: gera-aldama <[email protected]>
Co-authored-by: Sharvil Shah <[email protected]>
Co-authored-by: wyang2 <[email protected]>
Co-authored-by: Yimei Sun <[email protected]>
Co-authored-by: root <[email protected]>
Co-authored-by: tangleintel <[email protected]>
Co-authored-by: Syed Shahbaaz Ahmed <[email protected]>
Co-authored-by: Er-Xin (Edwin) Shang <[email protected]>
Co-authored-by: Ying <[email protected]>
Co-authored-by: sevdeawesome <[email protected]>
Co-authored-by: DiweiSun <[email protected]>
Co-authored-by: Tyler Titsworth <[email protected]>
Co-authored-by: Srikanth Ramakrishna <[email protected]>
@WafaaT
Copy link
Contributor

WafaaT commented May 5, 2023

@claynerobison this PR still needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants