From 93e33f8083b7511647ef8f39270a13a14e1c28a7 Mon Sep 17 00:00:00 2001 From: Tuomas Rossi Date: Tue, 19 Nov 2024 14:28:16 +0200 Subject: [PATCH 01/28] Update installations --- .../Install_oneAPI_Mahti.md | 22 +++++++------------ .../computing_platform/Usage_oneAPI_Mahti.md | 6 ++--- 2 files changed, 11 insertions(+), 17 deletions(-) diff --git a/exercises/computing_platform/Install_oneAPI_Mahti.md b/exercises/computing_platform/Install_oneAPI_Mahti.md index 94431c1..6bdde11 100644 --- a/exercises/computing_platform/Install_oneAPI_Mahti.md +++ b/exercises/computing_platform/Install_oneAPI_Mahti.md @@ -1,29 +1,23 @@ # Installing oneAPI on Mahti -Load cuda: +Download the oneAPI base toolkit: - module load cuda - -Download the oneAPI basekit: - - wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/163da6e4-56eb-4948-aba3-debcec61c064/l_BaseKit_p_2024.0.1.46_offline.sh - chmod +x l_BaseKit_p_2024.0.1.46_offline.sh + wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh Install: - ./l_BaseKit_p_2024.0.1.46_offline.sh -a -s --eula accept --download-cache /scratch/project_2008874/$USER/oneapi_tmp/ --install-dir /scratch/project_2008874/$USER/intel/oneapi + sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --silent --cli --eula accept --download-cache $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi -Get the cuda plugin (the link below might be changed in the future): +Get the cuda plugin: - curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=nvidia&version=2024.0.1&filters[]=12.0&filters[]=linux" + curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=nvidia&version=2025.0.0&filters[]=12.0&filters[]=linux" Install: - ./oneapi-for-nvidia-gpus-2024.0.1-cuda-12.0-linux.sh -y --extract-folder /scratch/project_2008874/$USER/oneapi_tmp/ --install-dir /scratch/project_2008874/$USER/intel/oneapi + sh ./oneapi-for-nvidia-gpus-2025.0.0-cuda-12.0-linux.sh -y --extract-folder $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi ## References -* [Intel oneAPI installation guide](https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2024-0/install-with-command-line.html) -* [Intel oneAPI download page](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?operatingsystem=linux&distributions=offline) -* [Codeplay oneAPI for NVIDIA GPUs](https://developer.codeplay.com/products/oneapi/nvidia/2024.0.1/guides/get-started-guide-nvidia) +* [Intel oneAPI download page](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?packages=oneapi-toolkit&oneapi-toolkit-os=linux&oneapi-lin=offline) +* [Codeplay oneAPI for NVIDIA GPUs](https://developer.codeplay.com/products/oneapi/nvidia/2025.0.0/guides/get-started-guide-nvidia#installation) diff --git a/exercises/computing_platform/Usage_oneAPI_Mahti.md b/exercises/computing_platform/Usage_oneAPI_Mahti.md index 18aad94..d888677 100644 --- a/exercises/computing_platform/Usage_oneAPI_Mahti.md +++ b/exercises/computing_platform/Usage_oneAPI_Mahti.md @@ -2,16 +2,16 @@ Set the environments paths: - . /projappl/project_2008874/intel/oneapi/setvars.sh --include-intel-llvm + source /projappl/project_2012125/intel/oneapi/setvars.sh --include-intel-llvm ml cuda/11.5.0 openmpi/4.1.2-cuda Compile for nvidia and cpu targets: - clang++ -std=c++17 -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64_x86_64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 .cpp + icpx -fuse-ld=lld -std=c++17 -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64_x86_64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 sycl_code.cpp Run as an usual gpu program: - srun --partition=gputest --account=project_2008874 --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --gres=gpu:a100:1 --time=00:15:00 ./a.out + srun --partition=gputest --account=project_2012125 --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --gres=gpu:a100:1 --time=00:15:00 ./a.out ## The Intel® DPC++ Compatibility Tool From af07ddc4d3e898772463d222e84b53bf13ea319e Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Wed, 20 Nov 2024 11:54:20 +0200 Subject: [PATCH 02/28] Update Exercises_Instructions.md --- Exercises_Instructions.md | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/Exercises_Instructions.md b/Exercises_Instructions.md index 8200b9e..38bd10e 100644 --- a/Exercises_Instructions.md +++ b/Exercises_Instructions.md @@ -38,21 +38,21 @@ The Intel DevCloud can be acces via the [web interface](https://console.cloud.in ### Disk area -The (computing and storage) resources can be accessed on on supercomputers via project-based allocation system, where users are granted access based on the specific needs and goals of their projects. Running applications and storage area are directly linked ot this projects. For this event we have been granted access to the training `project_2008874` on Mahti and `project_462000456` on LUMI. +The (computing and storage) resources can be accessed on on supercomputers via project-based allocation system, where users are granted access based on the specific needs and goals of their projects. Running applications and storage area are directly linked ot this projects. For this event we have been granted access to the training `project_2012125` on Mahti and `project_462000456` on LUMI. All the exercises in the supercomputers have to be carried out in the **scratch** disk area. The name of the scratch directory can be queried with the commands `csc-workspaces` on Mahti and `lumi-workspaces` onLUMI. As the base directory is shared between members of the project, you should create your own directory: on Mahti ``` -cd /scratch/project_2008874 +cd /scratch/project_2012125 mkdir -p $USER cd $USER ``` on LUMI ``` -cd /scratch/project_462000456 +cd /scratch/project_462000752 mkdir -p $USER cd $USER ``` @@ -122,14 +122,14 @@ In order to use the intel SYCL compiler one has to set the environment varibles on Mahti: ``` -. /projappl/project_2008874/intel/oneapi/setvars.sh --include-intel-llvm +. /projappl/project_2012125/intel/oneapi/setvars.sh --include-intel-llvm module load cuda # This is needed for compiling sycl code for nvidia gpus module load openmpi/4.1.2-cuda # This is neeeded for using CUDA aware MPI ``` on LUMI: ``` -. /projappl/project_462000456/intel/oneapi/setvars.sh --include-intel-llvm +. /projappl/project_462000752/intel/oneapi/setvars.sh --include-intel-llvm module load LUMI/22.08 module load partition/G @@ -158,7 +158,7 @@ module load cuda # This is needed for compiling sycl code for nvidia gpus module load openmpi/4.1.2-cuda # This is neeeded for using CUDA aware MPI ``` ``` -/projappl/project_2008874/AdaptiveCpp/bin/acpp -fuse-ld=lld -O3 -L/appl/spack/v017/install-tree/gcc-8.5.0/gcc-11.2.0-zshp2k/lib64 .cpp +/projappl/project_2012125/AdaptiveCpp/bin/acpp -fuse-ld=lld -O3 -L/appl/spack/v017/install-tree/gcc-8.5.0/gcc-11.2.0-zshp2k/lib64 .cpp ``` on LUMI: ``` @@ -172,7 +172,7 @@ export LD_PRELOAD=/pfs/lustrep4/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/llvm/lib ``` ``` - /projappl/project_462000456/AdaptiveCpp/bin/acpp -O3 .cpp + /projappl/project_462000752/AdaptiveCpp/bin/acpp -O3 .cpp ``` In general one can set specific targets via the `--acpp-targets` flag, but we set-up AdaptiveCpp so that on Mahti the `acpp` compiler will automatically generate code for CPU and Nvidia GPUs, while on LUMI for CPU and AMD GPUs. @@ -198,12 +198,12 @@ icpx -fuse-ld=lld -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backen ``` or ``` -/projappl/project_2008874/AdaptiveCpp/bin/acpp -fuse-ld=lld -O3 -L/appl/spack/v017/install-tree/gcc-8.5.0/gcc-11.2.0-zshp2k/lib64 `mpicxx --showme:compile` `mpicxx --showme:link` .cpp +/projappl/project_2012125/AdaptiveCpp/bin/acpp -fuse-ld=lld -O3 -L/appl/spack/v017/install-tree/gcc-8.5.0/gcc-11.2.0-zshp2k/lib64 `mpicxx --showme:compile` `mpicxx --showme:link` .cpp ``` Similarly on LUMI. First we set up the envinronment and load the modules as indicated above ``` -. /projappl/project_462000456/intel/oneapi/setvars.sh --include-intel-llvm +. /projappl/project_462000752/intel/oneapi/setvars.sh --include-intel-llvm module load LUMI/22.08 module load partition/G @@ -219,7 +219,7 @@ icpx -fsycl -fsycl-targets=amdgcn-amd-amdhsa,spir64_x86_64 -Xsycl-target-backend Or with AdaptiveCpp: ``` export LD_PRELOAD=/pfs/lustrep4/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/llvm/lib/libomp.so -/projappl/project_462000456/AdaptiveCpp/bin/acpp -O3 `CC --cray-print-opts=cflags` .cpp `CC --cray-print-opts=libs` +/projappl/project_462000752/AdaptiveCpp/bin/acpp -O3 `CC --cray-print-opts=cflags` .cpp `CC --cray-print-opts=libs` ``` ## Running applications in supercomputers @@ -242,7 +242,7 @@ Use [`SYCL_PI_TRACE`](https://intel.github.io/llvm-docs/EnvironmentVariables.htm ``` #!/bin/bash #SBATCH --job-name=example -#SBATCH --account=project_2008874 +#SBATCH --account=project_2012125 #SBATCH --partition=medium #SBATCH --reservation=hlgp-cpu-f2024 #SBATCH --time=00:05:00 @@ -258,7 +258,7 @@ The output of job will be in file `slurm-xxxxx.out`. You can check the status of `scancel JOBID`. The reservation `hlgp-cpu-f2024` for partition `medium` is available during the training days and it -is accessible only if the users are part of `project_2008874`. +is accessible only if the users are part of `project_2012125`. Some applications use MPI, in this case the number of node and number of tasks per node will have to be adjusted accordingly. @@ -270,7 +270,7 @@ single GPU with single MPI task and a single thread use: ``` #!/bin/bash #SBATCH --job-name=example -#SBATCH --account=project_2008874 +#SBATCH --account=project_2012125 #SBATCH --partition=gpusmall #SBATCH --reservation=hlgp-gpu-f2024-thu #SBATCH --nodes=1 @@ -282,7 +282,7 @@ srun my_gpu_exe ``` The reservation `hlgp-gpu-f2024-wed` is valid on Wednesday, 15:00 to 17:00. On Thursday we will use `hlgp-gpu-f2024-thu` , while on Friday `hlgp-gpu-f2024-fri`. Outside the course hours, you can use gputest partition instead without the reservation argument, ie, ``` -srun --account=project_2008874 --nodes=1 --partition=gputest --gres=gpu:a100:1 --time=00:05:00 ./my_gpu_exe +srun --account=project_2012125 --nodes=1 --partition=gputest --gres=gpu:a100:1 --time=00:05:00 ./my_gpu_exe ``` @@ -295,7 +295,7 @@ LUMI is similar to Mahti. ``` #!/bin/bash #SBATCH --job-name=example -#SBATCH --account=project_462000456 +#SBATCH --account=project_462000752 #SBATCH --partition=standard ##SBATCH --reservation=hlgp-cpu-f2024 # The reservation does not work #SBATCH --time=00:05:00 @@ -312,7 +312,7 @@ srun my_cpu_exe ``` #!/bin/bash #SBATCH --job-name=example -#SBATCH --account=project_462000456 +#SBATCH --account=project_462000752 #SBATCH --partition=standard-g #SBATCH --reservation=hlgp-gpu-f2024 #SBATCH --time=00:05:00 From a5c4e433e50d8c4d09a1474715ddb7dde6a7466a Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Fri, 22 Nov 2024 15:03:14 +0200 Subject: [PATCH 03/28] Update Exercises_Instructions.md --- Exercises_Instructions.md | 29 +++++++++++++---------------- 1 file changed, 13 insertions(+), 16 deletions(-) diff --git a/Exercises_Instructions.md b/Exercises_Instructions.md index 38bd10e..e4a69b8 100644 --- a/Exercises_Instructions.md +++ b/Exercises_Instructions.md @@ -131,10 +131,9 @@ on LUMI: ``` . /projappl/project_462000752/intel/oneapi/setvars.sh --include-intel-llvm -module load LUMI/22.08 +module load LUMI module load partition/G -module load rocm/5.3.3 -module load cce/16.0.1 +module load rocm/6.0.3 export MPICH_GPU_SUPPORT_ENABLED=1 # Needed for GPU aware MPI ``` After this one can load other modules that might be needed for compiling the codes. With the environment set-up we can compile and run the SYCL codes. @@ -162,13 +161,12 @@ module load openmpi/4.1.2-cuda # This is neeeded for using CUDA aware MPI ``` on LUMI: ``` -module load LUMI/22.08 +module load LUMI module load partition/G -module load rocm/5.3.3 -module load cce/16.0.1 +module load rocm/6.0.3 export MPICH_GPU_SUPPORT_ENABLED=1 -export LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-22.08/G/EB/Boost/1.79.0-cpeCray-22.08/lib:$LD_LIBRARY_PATH -export LD_PRELOAD=/pfs/lustrep4/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/llvm/lib/libomp.so +#export LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-22.08/G/EB/Boost/1.79.0-cpeCray-22.08/lib:$LD_LIBRARY_PATH ??? +#export LD_PRELOAD=/pfs/lustrep4/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/llvm/lib/libomp.so ?????? ``` ``` @@ -205,10 +203,9 @@ Similarly on LUMI. First we set up the envinronment and load the modules as indi ``` . /projappl/project_462000752/intel/oneapi/setvars.sh --include-intel-llvm -module load LUMI/22.08 +module load LUMI module load partition/G -module load rocm/5.3.3 -module load cce/16.0.1 +module load rocm/6.0.3 export MPICH_GPU_SUPPORT_ENABLED=1 ``` Now compile with intel compilers: @@ -218,7 +215,7 @@ icpx -fsycl -fsycl-targets=amdgcn-amd-amdhsa,spir64_x86_64 -Xsycl-target-backend ``` Or with AdaptiveCpp: ``` -export LD_PRELOAD=/pfs/lustrep4/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/llvm/lib/libomp.so +#export LD_PRELOAD=/pfs/lustrep4/appl/lumi/SW/LUMI-22.08/G/EB/rocm/5.3.3/llvm/lib/libomp.so /projappl/project_462000752/AdaptiveCpp/bin/acpp -O3 `CC --cray-print-opts=cflags` .cpp `CC --cray-print-opts=libs` ``` @@ -233,7 +230,7 @@ The `job.sh` file contains all the necessary information (number of nodes, tasks Use [`SYCL_PI_TRACE`](https://intel.github.io/llvm-docs/EnvironmentVariables.html#sycl-pi-trace-options) to enable runtime tracing (e.g. device discovery): - export SYCL_PI_TRACE=1 + export SYCL_UR_TRACE=1 ### Running on Mahti @@ -272,7 +269,7 @@ single GPU with single MPI task and a single thread use: #SBATCH --job-name=example #SBATCH --account=project_2012125 #SBATCH --partition=gpusmall -#SBATCH --reservation=hlgp-gpu-f2024-thu +#SBATCH --reservation=hlgp-gpu-f2024-thu ????????? #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:05:00 @@ -297,7 +294,7 @@ LUMI is similar to Mahti. #SBATCH --job-name=example #SBATCH --account=project_462000752 #SBATCH --partition=standard -##SBATCH --reservation=hlgp-cpu-f2024 # The reservation does not work +##SBATCH --reservation=hlgp-cpu-f2024 ??????# The reservation does not work #SBATCH --time=00:05:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 @@ -314,7 +311,7 @@ srun my_cpu_exe #SBATCH --job-name=example #SBATCH --account=project_462000752 #SBATCH --partition=standard-g -#SBATCH --reservation=hlgp-gpu-f2024 +#SBATCH --reservation=hlgp-gpu-f2024 ?????? #SBATCH --time=00:05:00 #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 From 6595b29e32e3fd4084126c51ccc00e213560a88d Mon Sep 17 00:00:00 2001 From: Tuomas Rossi Date: Fri, 22 Nov 2024 14:46:20 +0200 Subject: [PATCH 04/28] Add CI job for pages --- .github/workflows/pages-html.yml | 25 +++++++++++++ .github/workflows/pages-pdf.yml | 17 +++++++++ .github/workflows/pages.yml | 62 ++++++++++++++++++++++++++++++++ about.yml | 3 ++ 4 files changed, 107 insertions(+) create mode 100644 .github/workflows/pages-html.yml create mode 100644 .github/workflows/pages-pdf.yml create mode 100644 .github/workflows/pages.yml create mode 100644 about.yml diff --git a/.github/workflows/pages-html.yml b/.github/workflows/pages-html.yml new file mode 100644 index 0000000..6db3d0a --- /dev/null +++ b/.github/workflows/pages-html.yml @@ -0,0 +1,25 @@ +name: Deploy HTML slides to Pages + +on: + # Runs on pushes targeting the default branch + push: + branches: + - "main" + paths: + - "docs/**" + - ".github/workflows/pages.yml" + + # Allows you to run this workflow manually from the Actions tab + workflow_dispatch: + +# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages +permissions: + contents: read + pages: write + id-token: write + +jobs: + pages-html: + uses: ./.github/workflows/pages.yml + with: + include_pdf: false diff --git a/.github/workflows/pages-pdf.yml b/.github/workflows/pages-pdf.yml new file mode 100644 index 0000000..9244020 --- /dev/null +++ b/.github/workflows/pages-pdf.yml @@ -0,0 +1,17 @@ +name: Deploy HTML and PDF slides to Pages + +on: + # Allows you to run this workflow manually from the Actions tab + workflow_dispatch: + +# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages +permissions: + contents: read + pages: write + id-token: write + +jobs: + pages-pdf: + uses: ./.github/workflows/pages.yml + with: + include_pdf: true diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml new file mode 100644 index 0000000..8230bae --- /dev/null +++ b/.github/workflows/pages.yml @@ -0,0 +1,62 @@ +# Script based on examples in https://github.com/actions/starter-workflows/tree/main/pages +name: Deploy slides to Pages + +on: + workflow_call: + inputs: + include_pdf: + required: true + type: boolean + +# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages +permissions: + contents: read + pages: write + id-token: write + +# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued. +# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete. +concurrency: + group: "pages" + cancel-in-progress: false + +jobs: + build: + timeout-minutes: 30 + runs-on: ubuntu-latest + container: + image: ghcr.io/csc-training/slidefactory:3.1.0-beta.6 + steps: + - name: Checkout + uses: actions/checkout@v4 + - name: Setup Pages + id: pages + uses: actions/configure-pages@v4 + - name: Build slides + env: + INCLUDE_PDF: ${{ inputs.include_pdf }} + run: | + git config --global --add safe.directory $PWD + GIT_SHORT_SHA=$(git rev-parse --short $GITHUB_SHA) + GIT_DATE=$(git show -s --format=%ci $GITHUB_SHA) + + ARGS="" + [[ "$INCLUDE_PDF" == "true" ]] && ARGS="--with-pdf" + + slidefactory pages about.yml build --info_content "Updated for [$GIT_SHORT_SHA]($GITHUB_SERVER_URL/$GITHUB_REPOSITORY/commit/$GITHUB_SHA) ($GIT_DATE)" $ARGS + + - name: Upload artifact + uses: actions/upload-pages-artifact@v3 + with: + path: ./build + + deploy: + environment: + name: github-pages + url: ${{ steps.deployment.outputs.page_url }} + runs-on: ubuntu-latest + needs: build + steps: + - name: Deploy to GitHub Pages + id: deployment + uses: actions/deploy-pages@v4 diff --git a/about.yml b/about.yml new file mode 100644 index 0000000..f0b9fb7 --- /dev/null +++ b/about.yml @@ -0,0 +1,3 @@ +# This file is used in the generation of the web page +title: High-Level GPU Programming +slidesdir: docs From 0d5cecf618389a7cf7657e7e8376904c923a70bf Mon Sep 17 00:00:00 2001 From: Tuomas Rossi Date: Fri, 22 Nov 2024 15:15:19 +0200 Subject: [PATCH 05/28] Move list of topics --- docs/List-of-topics.md => List-of-topics.md | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename docs/List-of-topics.md => List-of-topics.md (100%) diff --git a/docs/List-of-topics.md b/List-of-topics.md similarity index 100% rename from docs/List-of-topics.md rename to List-of-topics.md From 114b9fdd72a680d7704b948a01898c8bc931ef93 Mon Sep 17 00:00:00 2001 From: Tuomas Rossi Date: Fri, 22 Nov 2024 15:18:36 +0200 Subject: [PATCH 06/28] Ignore build directory --- .gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/.gitignore b/.gitignore index e5b559e..b41c394 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,4 @@ a.out *.qdrep *.sqlite heat +build/ From 78280ac94722956e4090d49bb48ad84039577f4e Mon Sep 17 00:00:00 2001 From: Tuomas Rossi Date: Fri, 22 Nov 2024 16:01:24 +0200 Subject: [PATCH 07/28] Update oneAPI instructions --- README_setup.md | 74 +++++++++++++++++++ .../computing_platform/Install_oneAPI_LUMI.md | 34 --------- .../Install_oneAPI_Mahti.md | 23 ------ .../computing_platform/Usage_oneAPI_LUMI.md | 23 ------ .../computing_platform/Usage_oneAPI_Mahti.md | 23 ------ 5 files changed, 74 insertions(+), 103 deletions(-) create mode 100644 README_setup.md delete mode 100644 exercises/computing_platform/Install_oneAPI_LUMI.md delete mode 100644 exercises/computing_platform/Install_oneAPI_Mahti.md delete mode 100644 exercises/computing_platform/Usage_oneAPI_LUMI.md delete mode 100644 exercises/computing_platform/Usage_oneAPI_Mahti.md diff --git a/README_setup.md b/README_setup.md new file mode 100644 index 0000000..4c67d54 --- /dev/null +++ b/README_setup.md @@ -0,0 +1,74 @@ +# Usage + +## OneAPI on Mahti + +Set up the environment: + + source /projappl/project_2012125/intel/oneapi/setvars.sh --include-intel-llvm + ml cuda/11.5.0 openmpi/4.1.2-cuda + +Compile for nvidia and cpu targets: + + icpx -fuse-ld=lld -std=c++20 -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64_x86_64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 sycl_code.cpp + +Run as an usual gpu program: + + srun -A project_2012125 -p gputest --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --gres=gpu:a100:1 --time=00:15:00 ./a.out + +## OneAPI on LUMI + +Set up the environment: + + source /projappl/project_462000752/intel/oneapi/setvars.sh --include-intel-llvm + ml rocm/6.0.3 + export MPICH_GPU_SUPPORT_ENABLED=1 + +Compile for amd and cpu targets: + + icpx -fuse-ld=lld -std=c++20 -O3 -fsycl -fsycl-targets=amdgcn-amd-amdhsa,spir64_x86_64 -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a sycl_code.cpp + +Run as an usual gpu program: + + srun -A project_462000752 -p dev-g --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --gpus-per-node=1 --time=00:15:00 ./a.out + + +# Installations + +*Here are instructions how the modules used above were installed.* + +## OneAPI on Mahti + +Download [Intel oneAPI base toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?packages=oneapi-toolkit&oneapi-toolkit-os=linux&oneapi-lin=offline): + + wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh + +Install: + + sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --silent --cli --eula accept --download-cache $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi + +Get [Codeplay oneAPI for NVIDIA GPUs](https://developer.codeplay.com/products/oneapi/nvidia/2025.0.0/guides/get-started-guide-nvidia#installation): + + curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=nvidia&version=2025.0.0&filters[]=12.0&filters[]=linux" + +Install: + + sh ./oneapi-for-nvidia-gpus-2025.0.0-cuda-12.0-linux.sh -y --extract-folder $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi + +## OneAPI on LUMI + +Download [Intel oneAPI base toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?packages=oneapi-toolkit&oneapi-toolkit-os=linux&oneapi-lin=offline): + + wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh + +Install: + + sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --silent --cli --eula accept --download-cache $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi + +Get [Codeplay oneAPI for AMD GPUs](https://developer.codeplay.com/products/oneapi/amd/2025.0.0/guides/get-started-guide-amd#installation): + + curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=amd&version=2025.0.0&filters[]=6.0.2&filters[]=linux" + +Install: + + sh ./oneapi-for-amd-gpus-2025.0.0-rocm-6.0.2-linux.sh -y --extract-folder $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi + diff --git a/exercises/computing_platform/Install_oneAPI_LUMI.md b/exercises/computing_platform/Install_oneAPI_LUMI.md deleted file mode 100644 index a433406..0000000 --- a/exercises/computing_platform/Install_oneAPI_LUMI.md +++ /dev/null @@ -1,34 +0,0 @@ - - - -Download and intall the oneapi basekit: - -``` -wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/163da6e4-56eb-4948-aba3-debcec61c064/l_BaseKit_p_2024.0.1.46_offline.sh - -module load LUMI/22.08 -module load partition/G -module load rocm/5.3.3 -module load cce/16.0.1 -chmod +x l_BaseKit_p_2024.0.1.46_offline.sh -./l_BaseKit_p_2024.0.1.46_offline.sh -a -s --eula accept --download-cache /scratch/project_462000456/cristian/tttt/ --install-dir /scratch/project_462000456/intel/oneapi -``` - -Now get the hip plugin (the link below might be changed in the future): - -Get and install the plug-in: -``` -curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=amd&version=2024.0.1&filters[]=5.4.3&filters[]=linux" -sh oneapi-for-amd-gpus-2024.0.1-rocm-5.4.3-linux.sh -y --extract-folder /scratch/project_462000456/tttt/ --install-dir /scratch/project_462000456/intel/oneapi -``` - -## Usage - -``` -. /scratch/project_462000456/intel/oneapi/setvars.sh --include-intel-llvm -module load LUMI/22.08 -module load partition/G -module load rocm/5.3.3 -module load cce/16.0.1 -icpx -fsycl -fsycl-targets=amdgcn-amd-amdhsa,spir64_x86_64 -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a .cpp -``` diff --git a/exercises/computing_platform/Install_oneAPI_Mahti.md b/exercises/computing_platform/Install_oneAPI_Mahti.md deleted file mode 100644 index 6bdde11..0000000 --- a/exercises/computing_platform/Install_oneAPI_Mahti.md +++ /dev/null @@ -1,23 +0,0 @@ -# Installing oneAPI on Mahti - -Download the oneAPI base toolkit: - - wget https://registrationcenter-download.intel.com/akdlm/IRC_NAS/96aa5993-5b22-4a9b-91ab-da679f422594/intel-oneapi-base-toolkit-2025.0.0.885_offline.sh - -Install: - - sh ./intel-oneapi-base-toolkit-2025.0.0.885_offline.sh -a --silent --cli --eula accept --download-cache $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi - -Get the cuda plugin: - - curl -LOJ "https://developer.codeplay.com/api/v1/products/download?product=oneapi&variant=nvidia&version=2025.0.0&filters[]=12.0&filters[]=linux" - -Install: - - sh ./oneapi-for-nvidia-gpus-2025.0.0-cuda-12.0-linux.sh -y --extract-folder $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi - -## References - -* [Intel oneAPI download page](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?packages=oneapi-toolkit&oneapi-toolkit-os=linux&oneapi-lin=offline) -* [Codeplay oneAPI for NVIDIA GPUs](https://developer.codeplay.com/products/oneapi/nvidia/2025.0.0/guides/get-started-guide-nvidia#installation) - diff --git a/exercises/computing_platform/Usage_oneAPI_LUMI.md b/exercises/computing_platform/Usage_oneAPI_LUMI.md deleted file mode 100644 index 5039681..0000000 --- a/exercises/computing_platform/Usage_oneAPI_LUMI.md +++ /dev/null @@ -1,23 +0,0 @@ -## Usage - -``` -module load LUMI/22.08 -module load partition/G -module load rocm/5.3.3 -module load cce/16.0.1 -. /scratch/project_462000456//intel/oneapi/setvars.sh --include-intel-llvm -export MPICH_GPU_SUPPORT_ENABLED=1 -``` - -``` -icpx -fsycl -fsycl-targets=amdgcn-amd-amdhsa,spir64_x86_64 -Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx90a .cpp -``` - -## Running -CPU -``` -srun -p debug --exclusive -n 1 --cpus-per-task=128 --time=00:05:00 --account=project_462000456 ./a.out -``` -``` -#SBATCH -``` diff --git a/exercises/computing_platform/Usage_oneAPI_Mahti.md b/exercises/computing_platform/Usage_oneAPI_Mahti.md deleted file mode 100644 index d888677..0000000 --- a/exercises/computing_platform/Usage_oneAPI_Mahti.md +++ /dev/null @@ -1,23 +0,0 @@ -# Using oneAPI on Mahti - -Set the environments paths: - - source /projappl/project_2012125/intel/oneapi/setvars.sh --include-intel-llvm - ml cuda/11.5.0 openmpi/4.1.2-cuda - -Compile for nvidia and cpu targets: - - icpx -fuse-ld=lld -std=c++17 -O3 -fsycl -fsycl-targets=nvptx64-nvidia-cuda,spir64_x86_64 -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80 sycl_code.cpp - -Run as an usual gpu program: - - srun --partition=gputest --account=project_2012125 --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --gres=gpu:a100:1 --time=00:15:00 ./a.out - - -## The Intel® DPC++ Compatibility Tool - -The Intel® DPC++ Compatibility Tool (syclomatic) is included in the oneAPI basekit. For migrating cuda to sycl use (for example): - - dpct --in-root=./ src/vector_add.cu - -See [the heat equation exercise](sycl/10-heat-equation-from-cuda/) for a complete example. From 4e65be0994345dac9d0602d23488b02ede73de41 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 13:04:32 +0200 Subject: [PATCH 08/28] Update Install_AdaptiveCpp_LUMI.md --- .../computing_platform/Install_AdaptiveCpp_LUMI.md | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md index 79dd941..622c01c 100644 --- a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md @@ -7,18 +7,20 @@ git clone https://github.com/AdaptiveCpp/AdaptiveCpp.git ``` Load the modules needed: ``` -module load LUMI/22.08 +module load LUMI/24.03 module load partition/G -module load Boost/1.79.0-cpeCray-22.08 -module load rocm/5.3.3 -module load cce/16.0.1 +module load Boost +module load rocm/6.0.3 +module load cce/17.0.1 ``` Compile with both cpu and gpu (mi250x) acceleration: ``` -cd /scratch/project_462000456/AdaptiveCpp +git clone https://github.com/AdaptiveCpp/AdaptiveCpp.git +cd AdaptiveCpp mkdir build cd build -cmake -DCMAKE_INSTALL_PREFIX=//AdaptiveCpp/ -DROCM_PATH=$ROCM_PATH -DWITH_CPU_BACKEND=ON -DWITH_CUDA_BACKEND=OFF -DWITH_ROCM_BACKEND=ON -DDEFAULT_GPU_ARCH=gfx90a -DWITH_ACCELERATED_CPU=ON -DWITH_SSCP_COMPILER=OFF -DWITH_OPENCL_BACKEND=OFF -DWITH_LEVEL_ZERO_BACKEND=OFF .. +cmake -DCMAKE_INSTALL_PREFIX=/projappl/project_462000752/ACPP/ -DROCM_PATH=$ROCM_PATH -DWITH_CPU_BACKEND=ON -DWITH_CUDA_BACKEND=OFF -DWITH_ROCM_BACKEND=ON -DACPP_TARGETS="gfx90a" -DWITH_ACCELERATED_CPU=ON -DWITH_SSCP_COMPILER=OFF -DWITH_OPENCL_BACKEND=OFF -DWITH_LEVEL_ZERO_BACKEND=OFF -DBOOST_ROOT=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/ .. +make -j 64 make install ``` From dbb825420b9dc1611862f05840c579304850bcd9 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 13:04:47 +0200 Subject: [PATCH 09/28] Update Install_AdaptiveCpp_LUMI.md --- exercises/computing_platform/Install_AdaptiveCpp_LUMI.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md index 622c01c..296644f 100644 --- a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md @@ -9,9 +9,7 @@ Load the modules needed: ``` module load LUMI/24.03 module load partition/G -module load Boost module load rocm/6.0.3 -module load cce/17.0.1 ``` Compile with both cpu and gpu (mi250x) acceleration: ``` From c35aed86a983ade1481153415a2983d370600cbe Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 13:09:41 +0200 Subject: [PATCH 10/28] Update Install_AdaptiveCpp_LUMI.md --- .../computing_platform/Install_AdaptiveCpp_LUMI.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md index 296644f..2a14e13 100644 --- a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md @@ -1,10 +1,5 @@ ## Install from source: -Clone and switch to the appropriate version: -``` -git clone https://github.com/AdaptiveCpp/AdaptiveCpp.git -# git switch --detach v0.9.4 # use this only if there are problems with the latest version -``` Load the modules needed: ``` module load LUMI/24.03 @@ -26,11 +21,10 @@ make install Load the modules needed: ``` -module load LUMI/22.08 +module load LUMI/24.03 module load partition/G -module load Boost/1.79.0-cpeCray-22.08 -module load rocm/5.3.3 -module load cce/16.0.1 +module load rocm/6.0.3 +/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/ ``` Compile `sycl` codes: ``` From a06c6858dfb9b1a6830a53825d9781411395c7d4 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 13:10:44 +0200 Subject: [PATCH 11/28] Update Usage_AdaptiveCpp_LUMI.md --- exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md index 5f7e260..fe83492 100644 --- a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md @@ -2,13 +2,12 @@ Load the modules needed: ``` -module load LUMI/22.08 +module load LUMI/24.03 module load partition/G -module load Boost/1.79.0-cpeCray-22.08 -module load rocm/5.3.3 -module load cce/16.0.1 +module load rocm/6.0.3 +/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/ ``` Compile `sycl` codes: ``` -/scratch/project_462000456/AdaptiveCpp/bin/syclcc -O2 --hipsycl-targets="omp.accelerated;hip:gfx90a" .cpp +/projappl/project_462000752/ACPP/bin/acpp -O2 --hipsycl-targets="omp.accelerated;hip:gfx90a" .cpp ``` From d17881b41ec37d9647acbc5a4544aaa864e34b75 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 13:18:52 +0200 Subject: [PATCH 12/28] Update Usage_AdaptiveCpp_LUMI.md --- exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md index fe83492..a9d7bf6 100644 --- a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md @@ -6,6 +6,9 @@ module load LUMI/24.03 module load partition/G module load rocm/6.0.3 /appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/ +export LD_PRELOAD=/projappl/project_462000752/ACPP/lib/hipSYCL/librt-backend-omp.so +PATH=/projappl/project_462000752/ACPP/bin/:$PATH +LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/:$LD_LIBRARY_PATH ``` Compile `sycl` codes: ``` From d4dc392c39f24c3d75fb5812565c0d7e29507e2a Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 13:20:28 +0200 Subject: [PATCH 13/28] Update Usage_AdaptiveCpp_LUMI.md --- exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md index a9d7bf6..0f2bab5 100644 --- a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md @@ -5,10 +5,9 @@ Load the modules needed: module load LUMI/24.03 module load partition/G module load rocm/6.0.3 -/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/ export LD_PRELOAD=/projappl/project_462000752/ACPP/lib/hipSYCL/librt-backend-omp.so -PATH=/projappl/project_462000752/ACPP/bin/:$PATH -LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/:$LD_LIBRARY_PATH +export PATH=/projappl/project_462000752/ACPP/bin/:$PATH +export LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/:$LD_LIBRARY_PATH ``` Compile `sycl` codes: ``` From 0d468fcd3e1183b78b5d0acb4cadba73ad6f64fd Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 14:22:41 +0200 Subject: [PATCH 14/28] Update README_setup.md --- README_setup.md | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/README_setup.md b/README_setup.md index 4c67d54..a064fde 100644 --- a/README_setup.md +++ b/README_setup.md @@ -31,7 +31,25 @@ Run as an usual gpu program: srun -A project_462000752 -p dev-g --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --gpus-per-node=1 --time=00:15:00 ./a.out +## AdaptiveCpp on LUMI +Set up the environment: + + module load LUMI/24.03 + module load partition/G + module load rocm/6.0.3 + export LD_PRELOAD=/projappl/project_462000752/ACPP/lib/hipSYCL/librt-backend-omp.so + export PATH=/projappl/project_462000752/ACPP/bin/:$PATH + export LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/:$LD_LIBRARY_PATH + +Compile for amd and cpu targets: + + acpp -O2 --hipsycl-targets="omp.accelerated;hip:gfx90a" .cpp + +Run as an usual gpu program: + + srun -A project_462000752 -p dev-g --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --gpus-per-node=1 --time=00:15:00 ./a.out + # Installations *Here are instructions how the modules used above were installed.* From 14838ce6f5cd151fb64ca74788bb6d98805acb4d Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 14:27:30 +0200 Subject: [PATCH 15/28] Update Usage_AdaptiveCpp_LUMI.md --- exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md index 0f2bab5..d6cec06 100644 --- a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md @@ -5,9 +5,9 @@ Load the modules needed: module load LUMI/24.03 module load partition/G module load rocm/6.0.3 -export LD_PRELOAD=/projappl/project_462000752/ACPP/lib/hipSYCL/librt-backend-omp.so export PATH=/projappl/project_462000752/ACPP/bin/:$PATH export LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/:$LD_LIBRARY_PATH +export LD_PRELOAD=/opt/rocm-6.0.3/llvm/lib/libomp.so ``` Compile `sycl` codes: ``` From a8cfb4e42298d5cdcf7b5135c66413173a8031d0 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 14:28:01 +0200 Subject: [PATCH 16/28] Update Install_AdaptiveCpp_LUMI.md --- exercises/computing_platform/Install_AdaptiveCpp_LUMI.md | 1 + 1 file changed, 1 insertion(+) diff --git a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md index 2a14e13..6d2c9a4 100644 --- a/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Install_AdaptiveCpp_LUMI.md @@ -25,6 +25,7 @@ module load LUMI/24.03 module load partition/G module load rocm/6.0.3 /appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/ +export LD_PRELOAD=/opt/rocm-6.0.3/llvm/lib/libomp.so ``` Compile `sycl` codes: ``` From 2ce00cf6a8e9e033306ce2cc36e0d274048c28e3 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 14:28:26 +0200 Subject: [PATCH 17/28] Update README_setup.md --- README_setup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README_setup.md b/README_setup.md index a064fde..4a692d7 100644 --- a/README_setup.md +++ b/README_setup.md @@ -38,9 +38,9 @@ Set up the environment: module load LUMI/24.03 module load partition/G module load rocm/6.0.3 - export LD_PRELOAD=/projappl/project_462000752/ACPP/lib/hipSYCL/librt-backend-omp.so export PATH=/projappl/project_462000752/ACPP/bin/:$PATH export LD_LIBRARY_PATH=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/lib64/:$LD_LIBRARY_PATH + export LD_PRELOAD=/opt/rocm-6.0.3/llvm/lib/libomp.so Compile for amd and cpu targets: From 955e2add3685e9784249783062346865dcff7dac Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 14:35:09 +0200 Subject: [PATCH 18/28] Update Usage_AdaptiveCpp_LUMI.md --- exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md index d6cec06..9aa9eec 100644 --- a/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md +++ b/exercises/computing_platform/Usage_AdaptiveCpp_LUMI.md @@ -11,5 +11,5 @@ export LD_PRELOAD=/opt/rocm-6.0.3/llvm/lib/libomp.so ``` Compile `sycl` codes: ``` -/projappl/project_462000752/ACPP/bin/acpp -O2 --hipsycl-targets="omp.accelerated;hip:gfx90a" .cpp +/projappl/project_462000752/ACPP/bin/acpp -O2 --acpp-targets="omp.accelerated;hip:gfx90a" .cpp ``` From 0a9462e6ed5d8990a9393dad10e86f0b78d9ca80 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 14:35:46 +0200 Subject: [PATCH 19/28] Update README_setup.md --- README_setup.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README_setup.md b/README_setup.md index 4a692d7..e765da1 100644 --- a/README_setup.md +++ b/README_setup.md @@ -44,7 +44,7 @@ Set up the environment: Compile for amd and cpu targets: - acpp -O2 --hipsycl-targets="omp.accelerated;hip:gfx90a" .cpp + acpp -O2 --acpp-targets="omp.accelerated;hip:gfx90a" .cpp Run as an usual gpu program: From 21c9259a3127acf4103aeb879c722994d40f29f6 Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sat, 23 Nov 2024 14:42:43 +0200 Subject: [PATCH 20/28] Update README_setup.md --- README_setup.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/README_setup.md b/README_setup.md index e765da1..8318203 100644 --- a/README_setup.md +++ b/README_setup.md @@ -90,3 +90,22 @@ Install: sh ./oneapi-for-amd-gpus-2025.0.0-rocm-6.0.2-linux.sh -y --extract-folder $SCRATCH/$USER/oneapi_tmp --install-dir $PROJAPPL/intel/oneapi +## AdaptiveCpp on LUMI + + +Load the modules needed: +``` +module load LUMI/24.03 +module load partition/G +module load rocm/6.0.3 +``` +Clone repository and Compile with both cpu and rocm support: +``` +git clone https://github.com/AdaptiveCpp/AdaptiveCpp.git +cd AdaptiveCpp +mkdir build +cd build +cmake -DCMAKE_INSTALL_PREFIX=/projappl/project_462000752/ACPP/ -DROCM_PATH=$ROCM_PATH -DWITH_CPU_BACKEND=ON -DWITH_CUDA_BACKEND=OFF -DWITH_ROCM_BACKEND=ON -DACPP_TARGETS="gfx90a" -DWITH_ACCELERATED_CPU=ON -DWITH_SSCP_COMPILER=OFF -DWITH_OPENCL_BACKEND=OFF -DWITH_LEVEL_ZERO_BACKEND=OFF -DBOOST_ROOT=/appl/lumi/SW/LUMI-24.03/G/EB/Boost/1.83.0-cpeGNU-24.03/ .. +make -j 64 +make install +``` From 50bf485b06fea0dfc2937bf7524fd472fa5fcf0c Mon Sep 17 00:00:00 2001 From: Cristian-Vasile Achim <66278390+csccva@users.noreply.github.com> Date: Sun, 24 Nov 2024 14:54:49 +0200 Subject: [PATCH 21/28] Update Exercises_Instructions.md --- Exercises_Instructions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Exercises_Instructions.md b/Exercises_Instructions.md index e4a69b8..71a70f1 100644 --- a/Exercises_Instructions.md +++ b/Exercises_Instructions.md @@ -228,7 +228,7 @@ The `job.sh` file contains all the necessary information (number of nodes, tasks ### Useful environment variables -Use [`SYCL_PI_TRACE`](https://intel.github.io/llvm-docs/EnvironmentVariables.html#sycl-pi-trace-options) to enable runtime tracing (e.g. device discovery): +Use [`SYCL_UR_TRACE`](https://intel.github.io/llvm-docs/EnvironmentVariables.html#sycl-pi-trace-options) to enable runtime tracing (e.g. device discovery): export SYCL_UR_TRACE=1 From 390058daa6b50d20fe219bc9a5868402192d743e Mon Sep 17 00:00:00 2001 From: csccva Date: Mon, 25 Nov 2024 15:13:15 +0200 Subject: [PATCH 22/28] Sycl Exercises! --- exercises/sycl/02-vector_add/Readme.md | 84 +++++++++++++++++++ .../solution/vector_add_buffer.cpp | 43 ++++++++++ .../solution/vector_add_usm_device.cpp | 53 ++++++++++++ .../solution/vector_add_usm_managed.cpp | 47 +++++++++++ .../sycl/02-vector_add/vector_add_buffer.cpp | 12 ++- .../sycl/02-vector_add/vector_add_usm.cpp | 55 ------------ .../02-vector_add/vector_add_usm_device.cpp | 53 ++++++++++++ .../02-vector_add/vector_add_usm_managed.cpp | 47 +++++++++++ 8 files changed, 332 insertions(+), 62 deletions(-) create mode 100644 exercises/sycl/02-vector_add/Readme.md create mode 100644 exercises/sycl/02-vector_add/solution/vector_add_buffer.cpp create mode 100644 exercises/sycl/02-vector_add/solution/vector_add_usm_device.cpp create mode 100644 exercises/sycl/02-vector_add/solution/vector_add_usm_managed.cpp delete mode 100644 exercises/sycl/02-vector_add/vector_add_usm.cpp create mode 100644 exercises/sycl/02-vector_add/vector_add_usm_device.cpp create mode 100644 exercises/sycl/02-vector_add/vector_add_usm_managed.cpp diff --git a/exercises/sycl/02-vector_add/Readme.md b/exercises/sycl/02-vector_add/Readme.md new file mode 100644 index 0000000..32776d8 --- /dev/null +++ b/exercises/sycl/02-vector_add/Readme.md @@ -0,0 +1,84 @@ +# Vector Addition +The task is to compute the element-wise addition of two vectors (C = A + B) in parallel. + +A skeleton code is provided in vector_add_<..>.cpp. You need to complete the missing parts to calculate the result in parallel. Try running the program on both CPU and GPU devices. + +A typical application running on an accelerator follows these steps: + + 1. Initialize data on the host. + 1. Create a queue and associate it with the desired device. + 1. Manage memory on the device by creating necessary constructs. + 1. Launch the kernel. + 1. Retrieve and verify the results on the host. + +In this exercise, we will explore various memory models. + +## Memory management using Buffers and Accessors + +Use the skeleton provided in `vector_add_buffer.cpp`. Look for the **//TODO** lines. + +### Step 1: Define a Queue +Start by defining a **queue** and selecting the appropriate device selector. SYCL provides predefined selectors, such as: default, gpu, cpu, accelerator or you can use the procedure from the [previous exercise](../01-info/enumerate_device.cpp). + +### Step 2: Create Buffers +Next, create buffers to encapsulate the data. For a one-dimensional array of length `N`, with pointer `P`, a buffer can be constructed as follows: + +``` +sycl::buffer a_buf(P, sycl::range<1>(N)); +``` +### Step 3: Create Accessors +Accessors provide a mechanism to access data inside the buffers. Accessors on the device must be created within command groups. There are two ways to create accessors. Using the `sycl::accessor` class constructor + +``` + sycl::accessor a{a_buf, h, sycl::read_write}; +``` +or using the buffer `.getaccess<...>(h)` member function: +``` +a = a_buf.get_access(h); +``` +**Important** Use appropriate access modes for your data: + - **Input Buffers:** Use `sycl::access::mode::read` to avoid unnecessary device-to-host data transfers. + - **Output Buffers:** Use `sycl::access::mode::write` to avoid unnecessary host-to-device data transfers. + - **Input/Ouput Buffers:** Use `sycl::access::mode::read_write` for the variables that are input, but they also get modified during the compuations. + +### Step 4: Submit the Task +Once accessors are ready, submit the task to the device using the `.parallel_for()` member function. The basic submission: + +``` + h.parallel_for(sycl::range{N}, [=](sycl::id<1> idx) { + c[idx] = a[idx] + b[idx]; + }); +``` +Here: + - `sycl::range{N}` or `sycl::range(N)` specify number of work-items be launched +- `sycl::id<1>` represents the index used within the kernel. + +#### Using **item** class instead of **id** +Modify the lambda function to use the **sycl::item** class instead of the **id** class. In this case the index `idx` is obtained from the `.get_id()` member. + +#### Using ND-Range +This basic launching serves our purpose for this simpler example, however it is useful to test also the **ND-RANGE**. In case we specify to the runtime the total size of the grid of work-items and size of a work-group as well: + +``` + h.parallel_for(sycl::nd_range<1>(sycl::range<1>(((N+local_size-1)/local_size)*local_size), sycl::range<1>(local_size)), [=](sycl::nd_itemi<1> item) { + auto idx=item.get_global_id(0); + c[idx] = a[idx] + b[idx]; + }); +``` +**Note** that **ND-RANGE** requires that the total number of work-items to be divisible by the size of the work-group. + +### Step 5: Retrieve Data +The final task in this exercise is to move the checking of the results within the scope of the buffers (before the ending curly bracket) and add the appropriate method to access this data. + +By default, buffers are automatically synchronized with the host when they go out of scope. However, if you need to access data within the buffer’s scope, use **host accessors**. Host accessors can also be created in two ways: +Similar to the device, it is possible to define host accessors in two ways. By using the accessor class constructor +``` +host_accessor c{c_buf, sycl::access::mode::read}; +``` +or by using the `.get_access` memebr function of the buffer +``` +auto = c_buf.get_access(); +``` + +## Memory management with Unified Shared Memory + diff --git a/exercises/sycl/02-vector_add/solution/vector_add_buffer.cpp b/exercises/sycl/02-vector_add/solution/vector_add_buffer.cpp new file mode 100644 index 0000000..fcd26c8 --- /dev/null +++ b/exercises/sycl/02-vector_add/solution/vector_add_buffer.cpp @@ -0,0 +1,43 @@ +// Copyright (C) 2023 Intel Corporation + +// SPDX-License-Identifier: MIT +#include +#include +using namespace sycl; + +int main() { + // Set up queue on any available device + queue q; + + // Initialize input and output memory on the host + constexpr size_t N = 256; + std::vector a(N), b(N), c(N); + std::fill(a.begin(), a.end(), 1); + std::fill(b.begin(), b.end(), 2); + std::fill(c.begin(), c.end(), 0); + + { + // Create buffers for data and sum + buffer a_buf(a.data(), range<1>(N)); + buffer b_buf(b.data(), range<1>(N)); + buffer c_buf(c.data(), range<1>(N)); + + // Submit the kernel to the queue + q.submit([&](handler& h) { + accessor a{a_buf, h}; + accessor b{b_buf, h}; + accessor c{c_buf, h}; + + h.parallel_for(range{N}, [=](id<1> idx) { + c[idx] = a[idx] + b[idx]; + }); + }); + } + + // Check that all outputs match expected value + bool passed = std::all_of(c.begin(), c.end(), + [](int i) { return (i == 3); }); + std::cout << ((passed) ? "SUCCESS" : "FAILURE") + << std::endl; + return (passed) ? 0 : 1; +} diff --git a/exercises/sycl/02-vector_add/solution/vector_add_usm_device.cpp b/exercises/sycl/02-vector_add/solution/vector_add_usm_device.cpp new file mode 100644 index 0000000..c0505a2 --- /dev/null +++ b/exercises/sycl/02-vector_add/solution/vector_add_usm_device.cpp @@ -0,0 +1,53 @@ +// Copyright (C) 2023 Intel Corporation +// SPDX-License-Identifier: MIT +#include +#include +using namespace sycl; + +int main() { + // Set up queue on any available device + queue q; + + // Initialize input and output memory on the host + constexpr size_t N = 256; + std::vector a(N), b(N), c(N); + std::fill(a.begin(), a.end(), 1); + std::fill(b.begin(), b.end(), 2); + std::fill(c.begin(), c.end(), 0); + + // Allocate the memory using universal share memory + int* a_usm = malloc_device(N, q); + int* b_usm = malloc_device(N, q); + int* c_usm = malloc_device(N, q); + + // Copy data from host to USM + q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); + q.memcpy(b_usm, b.data(), N * sizeof(int)).wait(); + q.memcpy(c_usm, c.data(), N * sizeof(int)).wait(); + + // Submit the kernel to the queue + q.submit([&](handler& h) { + + h.parallel_for(range{N}, [=](id<1> idx) { + c_usm[idx] = a_usm[idx] + b_usm[idx]; + }); + + }); + + // Wait for the kernel to finish + q.wait(); + + // Copy data from USM to host + q.memcpy(c.data(), c_usm, N * sizeof(int)).wait(); + + // Free USM allocations + free(a_usm, q); + free(b_usm, q); + free(c_usm, q); + + // Check that all outputs match the expected value + bool passed = std::all_of(c.begin(), c.end(), + [](int i) { return (i == 3); }); + std::cout << ((passed) ? "SUCCESS" : "FAILURE") << std::endl; + return (passed) ? 0 : 1; +} diff --git a/exercises/sycl/02-vector_add/solution/vector_add_usm_managed.cpp b/exercises/sycl/02-vector_add/solution/vector_add_usm_managed.cpp new file mode 100644 index 0000000..32fb3b7 --- /dev/null +++ b/exercises/sycl/02-vector_add/solution/vector_add_usm_managed.cpp @@ -0,0 +1,47 @@ +// Copyright (C) 2023 Intel Corporation +// SPDX-License-Identifier: MIT +#include +#include +using namespace sycl; + +int main() { + // Set up queue on any available device + queue q; + + // Initialize input and output memory on the host + constexpr size_t N = 256; + + // Allocate memory using malloc_managed + int* a = malloc_managed(N, q); + int* b = malloc_managed(N, q); + int* c = malloc_managed(N, q); + + // Initialize input memory on the host + std::fill(a, a + N, 1); + std::fill(b, b + N, 2); + std::fill(c, c + N, 0); + + // Submit the kernel to the queue + q.submit([&](handler& h) { + + h.parallel_for(range{N}, [=](id<1> idx) { + c[idx] = a[idx] + b[idx]; + }); + + }); + + // Wait for the kernel to finish + q.wait(); + + // Check that all outputs match the expected value + bool passed = std::all_of(c.begin(), c.end(), + [](int i) { return (i == 3); }); + std::cout << ((passed) ? "SUCCESS" : "FAILURE") << std::endl; + + // Free managed allocations + free(a, q); + free(b, q); + free(c, q); + + return (passed) ? 0 : 1; +} diff --git a/exercises/sycl/02-vector_add/vector_add_buffer.cpp b/exercises/sycl/02-vector_add/vector_add_buffer.cpp index 94e061a..3ebdfaa 100644 --- a/exercises/sycl/02-vector_add/vector_add_buffer.cpp +++ b/exercises/sycl/02-vector_add/vector_add_buffer.cpp @@ -7,32 +7,30 @@ using namespace sycl; int main() { // Set up queue on any available device + //TODO queue q; // Initialize input and output memory on the host constexpr size_t N = 256; - std::vector a(N), b(N), c(N); + std::vector a_host(N), b_host(N), c_host(N); std::fill(a.begin(), a.end(), 1); std::fill(b.begin(), b.end(), 2); std::fill(c.begin(), c.end(), 0); { - // Create buffers for data and sum - buffer a_buf(a.data(), range<1>(N)); - buffer b_buf(b.data(), range<1>(N)); - buffer c_buf(c.data(), range<1>(N)); + // Create buffers for the + // TODO // Submit the kernel to the queue q.submit([&](handler& h) { + // Create an accessor accessor a{a_buf, h}; accessor b{b_buf, h}; accessor c{c_buf, h}; - // BEGIN CODE SNIP h.parallel_for(range{N}, [=](id<1> idx) { c[idx] = a[idx] + b[idx]; }); - // END CODE SNIP }); } diff --git a/exercises/sycl/02-vector_add/vector_add_usm.cpp b/exercises/sycl/02-vector_add/vector_add_usm.cpp deleted file mode 100644 index 4ba2ca5..0000000 --- a/exercises/sycl/02-vector_add/vector_add_usm.cpp +++ /dev/null @@ -1,55 +0,0 @@ -// Copyright (C) 2023 Intel Corporation -// SPDX-License-Identifier: MIT -#include -#include -using namespace sycl; - -int main() { - // Set up queue on any available device - queue q; - - // Initialize input and output memory on the host - constexpr size_t N = 256; - std::vector a(N), b(N), c(N); - std::fill(a.begin(), a.end(), 1); - std::fill(b.begin(), b.end(), 2); - std::fill(c.begin(), c.end(), 0); - - { - // Allocate the memory using universal share memory - int* a_usm = malloc_device(N, q); - int* b_usm = malloc_device(N, q); - int* c_usm = malloc_device(N, q); - - // Copy data from host to USM - q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); - q.memcpy(b_usm, b.data(), N * sizeof(int)).wait(); - q.memcpy(c_usm, c.data(), N * sizeof(int)).wait(); - - // Submit the kernel to the queue - q.submit([&](handler& h) { - // BEGIN CODE SNIP - h.parallel_for(range{N}, [=](id<1> idx) { - c_usm[idx] = a_usm[idx] + b_usm[idx]; - }); - // END CODE SNIP - }); - - // Wait for the kernel to finish - q.wait(); - - // Copy data from USM to host - q.memcpy(c.data(), c_usm, N * sizeof(int)).wait(); - - // Free USM allocations - free(a_usm, q); - free(b_usm, q); - free(c_usm, q); - } - - // Check that all outputs match the expected value - bool passed = std::all_of(c.begin(), c.end(), - [](int i) { return (i == 3); }); - std::cout << ((passed) ? "SUCCESS" : "FAILURE") << std::endl; - return (passed) ? 0 : 1; -} diff --git a/exercises/sycl/02-vector_add/vector_add_usm_device.cpp b/exercises/sycl/02-vector_add/vector_add_usm_device.cpp new file mode 100644 index 0000000..c0505a2 --- /dev/null +++ b/exercises/sycl/02-vector_add/vector_add_usm_device.cpp @@ -0,0 +1,53 @@ +// Copyright (C) 2023 Intel Corporation +// SPDX-License-Identifier: MIT +#include +#include +using namespace sycl; + +int main() { + // Set up queue on any available device + queue q; + + // Initialize input and output memory on the host + constexpr size_t N = 256; + std::vector a(N), b(N), c(N); + std::fill(a.begin(), a.end(), 1); + std::fill(b.begin(), b.end(), 2); + std::fill(c.begin(), c.end(), 0); + + // Allocate the memory using universal share memory + int* a_usm = malloc_device(N, q); + int* b_usm = malloc_device(N, q); + int* c_usm = malloc_device(N, q); + + // Copy data from host to USM + q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); + q.memcpy(b_usm, b.data(), N * sizeof(int)).wait(); + q.memcpy(c_usm, c.data(), N * sizeof(int)).wait(); + + // Submit the kernel to the queue + q.submit([&](handler& h) { + + h.parallel_for(range{N}, [=](id<1> idx) { + c_usm[idx] = a_usm[idx] + b_usm[idx]; + }); + + }); + + // Wait for the kernel to finish + q.wait(); + + // Copy data from USM to host + q.memcpy(c.data(), c_usm, N * sizeof(int)).wait(); + + // Free USM allocations + free(a_usm, q); + free(b_usm, q); + free(c_usm, q); + + // Check that all outputs match the expected value + bool passed = std::all_of(c.begin(), c.end(), + [](int i) { return (i == 3); }); + std::cout << ((passed) ? "SUCCESS" : "FAILURE") << std::endl; + return (passed) ? 0 : 1; +} diff --git a/exercises/sycl/02-vector_add/vector_add_usm_managed.cpp b/exercises/sycl/02-vector_add/vector_add_usm_managed.cpp new file mode 100644 index 0000000..32fb3b7 --- /dev/null +++ b/exercises/sycl/02-vector_add/vector_add_usm_managed.cpp @@ -0,0 +1,47 @@ +// Copyright (C) 2023 Intel Corporation +// SPDX-License-Identifier: MIT +#include +#include +using namespace sycl; + +int main() { + // Set up queue on any available device + queue q; + + // Initialize input and output memory on the host + constexpr size_t N = 256; + + // Allocate memory using malloc_managed + int* a = malloc_managed(N, q); + int* b = malloc_managed(N, q); + int* c = malloc_managed(N, q); + + // Initialize input memory on the host + std::fill(a, a + N, 1); + std::fill(b, b + N, 2); + std::fill(c, c + N, 0); + + // Submit the kernel to the queue + q.submit([&](handler& h) { + + h.parallel_for(range{N}, [=](id<1> idx) { + c[idx] = a[idx] + b[idx]; + }); + + }); + + // Wait for the kernel to finish + q.wait(); + + // Check that all outputs match the expected value + bool passed = std::all_of(c.begin(), c.end(), + [](int i) { return (i == 3); }); + std::cout << ((passed) ? "SUCCESS" : "FAILURE") << std::endl; + + // Free managed allocations + free(a, q); + free(b, q); + free(c, q); + + return (passed) ? 0 : 1; +} From 19ea833313372f9211c0e220b11628d4ea68aace Mon Sep 17 00:00:00 2001 From: csccva Date: Mon, 25 Nov 2024 15:21:08 +0200 Subject: [PATCH 23/28] Sycl Exercises! --- exercises/sycl/02-vector_add/Readme.md | 70 ++++++++++++++++++- .../sycl/02-vector_add/vector_add_buffer.cpp | 13 ++-- 2 files changed, 73 insertions(+), 10 deletions(-) diff --git a/exercises/sycl/02-vector_add/Readme.md b/exercises/sycl/02-vector_add/Readme.md index 32776d8..87b21e6 100644 --- a/exercises/sycl/02-vector_add/Readme.md +++ b/exercises/sycl/02-vector_add/Readme.md @@ -39,7 +39,7 @@ a = a_buf.get_access(h); **Important** Use appropriate access modes for your data: - **Input Buffers:** Use `sycl::access::mode::read` to avoid unnecessary device-to-host data transfers. - **Output Buffers:** Use `sycl::access::mode::write` to avoid unnecessary host-to-device data transfers. - - **Input/Ouput Buffers:** Use `sycl::access::mode::read_write` for the variables that are input, but they also get modified during the compuations. + - **Input/Ouput Buffers:** Use `sycl::access::mode::read_write` for the variables that are input, but they also get modified during the computaions. ### Step 4: Submit the Task Once accessors are ready, submit the task to the device using the `.parallel_for()` member function. The basic submission: @@ -75,10 +75,76 @@ Similar to the device, it is possible to define host accessors in two ways. By u ``` host_accessor c{c_buf, sycl::access::mode::read}; ``` -or by using the `.get_access` memebr function of the buffer +or by using the `.get_access` member function of the buffer ``` auto = c_buf.get_access(); ``` ## Memory management with Unified Shared Memory +### **malloc_device** + +Use the skeleton provided in `vector_add_buffer.cpp`. Look for the **//TODO** lines. + +### Step 1: Define a Queue +Same as using buffers + +### Step 2: Create Buffers +Next, create buffers to encapsulate the data. For a one-dimensional array of length `N`, with pointer `P`, a buffer can be constructed as follows: + +``` +sycl::buffer a_buf(P, sycl::range<1>(N)); +``` +### Step 3: Create Accessors +Accessors provide a mechanism to access data inside the buffers. Accessors on the device must be created within command groups. There are two ways to create accessors. Using the `sycl::accessor` class constructor + +``` + sycl::accessor a{a_buf, h, sycl::read_write}; +``` +or using the buffer `.getaccess<...>(h)` member function: +``` +a = a_buf.get_access(h); +``` +**Important** Use appropriate access modes for your data: + - **Input Buffers:** Use `sycl::access::mode::read` to avoid unnecessary device-to-host data transfers. + - **Output Buffers:** Use `sycl::access::mode::write` to avoid unnecessary host-to-device data transfers. + - **Input/Ouput Buffers:** Use `sycl::access::mode::read_write` for the variables that are input, but they also get modified during the computaions. + +### Step 4: Submit the Task +Once accessors are ready, submit the task to the device using the `.parallel_for()` member function. The basic submission: + +``` + h.parallel_for(sycl::range{N}, [=](sycl::id<1> idx) { + c[idx] = a[idx] + b[idx]; + }); +``` +Here: + - `sycl::range{N}` or `sycl::range(N)` specify number of work-items be launched +- `sycl::id<1>` represents the index used within the kernel. + +#### Using **item** class instead of **id** +Modify the lambda function to use the **sycl::item** class instead of the **id** class. In this case the index `idx` is obtained from the `.get_id()` member. + +#### Using ND-Range +This basic launching serves our purpose for this simpler example, however it is useful to test also the **ND-RANGE**. In case we specify to the runtime the total size of the grid of work-items and size of a work-group as well: + +``` + h.parallel_for(sycl::nd_range<1>(sycl::range<1>(((N+local_size-1)/local_size)*local_size), sycl::range<1>(local_size)), [=](sycl::nd_itemi<1> item) { + auto idx=item.get_global_id(0); + c[idx] = a[idx] + b[idx]; + }); +``` +**Note** that **ND-RANGE** requires that the total number of work-items to be divisible by the size of the work-group. + +### Step 5: Retrieve Data +The final task in this exercise is to move the checking of the results within the scope of the buffers (before the ending curly bracket) and add the appropriate method to access this data. + +By default, buffers are automatically synchronized with the host when they go out of scope. However, if you need to access data within the buffer’s scope, use **host accessors**. Host accessors can also be created in two ways: +Similar to the device, it is possible to define host accessors in two ways. By using the accessor class constructor +``` +host_accessor c{c_buf, sycl::access::mode::read}; +``` +or by using the `.get_access` member function of the buffer +``` +auto = c_buf.get_access(); +``` diff --git a/exercises/sycl/02-vector_add/vector_add_buffer.cpp b/exercises/sycl/02-vector_add/vector_add_buffer.cpp index 3ebdfaa..3e66b08 100644 --- a/exercises/sycl/02-vector_add/vector_add_buffer.cpp +++ b/exercises/sycl/02-vector_add/vector_add_buffer.cpp @@ -23,15 +23,12 @@ int main() { // Submit the kernel to the queue q.submit([&](handler& h) { - // Create an accessor - accessor a{a_buf, h}; - accessor b{b_buf, h}; - accessor c{c_buf, h}; + // Create accessors + //TODO - h.parallel_for(range{N}, [=](id<1> idx) { - c[idx] = a[idx] + b[idx]; - }); - }); + h.parallel_for( + //TODO + ); } // Check that all outputs match expected value From b32e4a36d843a8daaeade71f0c4a080702bf2e0b Mon Sep 17 00:00:00 2001 From: csccva Date: Mon, 25 Nov 2024 15:50:40 +0200 Subject: [PATCH 24/28] Vector Add done --- exercises/sycl/02-vector_add/Readme.md | 101 +++++++++++++----- .../sycl/02-vector_add/vector_add_buffer.cpp | 3 + .../02-vector_add/vector_add_usm_device.cpp | 25 ++--- .../02-vector_add/vector_add_usm_managed.cpp | 19 ++-- 4 files changed, 96 insertions(+), 52 deletions(-) diff --git a/exercises/sycl/02-vector_add/Readme.md b/exercises/sycl/02-vector_add/Readme.md index 87b21e6..9f07e8d 100644 --- a/exercises/sycl/02-vector_add/Readme.md +++ b/exercises/sycl/02-vector_add/Readme.md @@ -21,7 +21,7 @@ Use the skeleton provided in `vector_add_buffer.cpp`. Look for the **//TODO** li Start by defining a **queue** and selecting the appropriate device selector. SYCL provides predefined selectors, such as: default, gpu, cpu, accelerator or you can use the procedure from the [previous exercise](../01-info/enumerate_device.cpp). ### Step 2: Create Buffers -Next, create buffers to encapsulate the data. For a one-dimensional array of length `N`, with pointer `P`, a buffer can be constructed as follows: +Next, create buffers to encapsulate the data. For a one-dimensional array of integers of length `N`, with pointer `P`, a buffer can be constructed as follows: ``` sycl::buffer a_buf(P, sycl::range<1>(N)); @@ -84,38 +84,90 @@ auto = c_buf.get_access(); ### **malloc_device** -Use the skeleton provided in `vector_add_buffer.cpp`. Look for the **//TODO** lines. +Use the skeleton provided in `vector_add_usm_device.cpp`. Look for the **//TODO** lines. ### Step 1: Define a Queue Same as using buffers -### Step 2: Create Buffers -Next, create buffers to encapsulate the data. For a one-dimensional array of length `N`, with pointer `P`, a buffer can be constructed as follows: +### Step 2: Allocate Memory on the Device Using `malloc_device` +Instead of creating buffers, allocate memory directly on the device using `sycl::malloc_device`. For a one-dimensional array of integers of length N, memory can be allocated as follows: ``` -sycl::buffer a_buf(P, sycl::range<1>(N)); +int* a_usm = sycl::malloc_device(N, q); ``` -### Step 3: Create Accessors -Accessors provide a mechanism to access data inside the buffers. Accessors on the device must be created within command groups. There are two ways to create accessors. Using the `sycl::accessor` class constructor +### Step 3: Copy Data to the Device +You need to copy the data from the host to the device memory. Use sycl::memcpy to transfer data from the host memory to device memory before launching the kernel: ``` - sycl::accessor a{a_buf, h, sycl::read_write}; +q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); +``` + +### Step 4: Submit the Task +Once memory is allocated and data is copied, submit the task to the device using the `.parallel_for()` member function. The basic submission: + ``` -or using the buffer `.getaccess<...>(h)` member function: + h.parallel_for(sycl::range{N}, [=](sycl::id<1> idx) { + c_usm[idx] = a_usm[idx] + b_usm[idx]; + }); +``` +Here: + - `sycl::range{N}` or `sycl::range(N)` specify number of work-items be launched +- `sycl::id<1>` represents the index used within the kernel. + +#### Using **item** class instead of **id** +Modify the lambda function to use the **sycl::item** class instead of the **id** class. In this case the index `idx` is obtained from the `.get_id()` member. + +#### Using ND-Range +This basic launching serves our purpose for this simpler example, however it is useful to test also the **ND-RANGE**. In case we specify to the runtime the total size of the grid of work-items and size of a work-group as well: + ``` -a = a_buf.get_access(h); + h.parallel_for(sycl::nd_range<1>(sycl::range<1>(((N+local_size-1)/local_size)*local_size), sycl::range<1>(local_size)), [=](sycl::nd_itemi<1> item) { + auto idx=item.get_global_id(0); + c_usm[idx] = a_usm[idx] + b_usm[idx]; + }); +``` +**Note** that **ND-RANGE** requires that the total number of work-items to be divisible by the size of the work-group. + +### Step 5: Retrieve Data + +After the kernel execution is complete, you need to copy the result back from the device to the host. Use `sycl::memcpy` again to transfer the result: ``` -**Important** Use appropriate access modes for your data: - - **Input Buffers:** Use `sycl::access::mode::read` to avoid unnecessary device-to-host data transfers. - - **Output Buffers:** Use `sycl::access::mode::write` to avoid unnecessary host-to-device data transfers. - - **Input/Ouput Buffers:** Use `sycl::access::mode::read_write` for the variables that are input, but they also get modified during the computaions. +q.memcpy(c.data(), c_usm, N * sizeof(int)).wait(); +``` +### Step 6: Free Device Memory + +Once you're done with the device memory, free the allocated memory using `sycl::free`: + +``` +sycl::free(a_usm, q); +``` +This ensures that the allocated memory is properly released on the device. + + + +### **malloc_shared** + +Use the skeleton provided in `vector_add_usm_managed.cpp`. Look for the **//TODO** lines. + +### Step 1: Define a Queue +Same as before + +### Step 2: Allocate Memory on the Device Using `malloc_managed` +Allocate memory that can be migrated between host and device using `sycl::malloc_managed`. For a one-dimensional array of integers of length N, memory can be allocated as follows: + +``` +int* a = sycl::malloc_managed(N, q); +``` +Step 3: Initialize Data on Host + +This part is already in the skeleton, it is done using `std::fill`. Though if you have time you can replace it with a **for loop**. ### Step 4: Submit the Task -Once accessors are ready, submit the task to the device using the `.parallel_for()` member function. The basic submission: +Once memory is allocated and data is copied, submit the task to the device using the `.parallel_for()` member function. The basic submission: ``` h.parallel_for(sycl::range{N}, [=](sycl::id<1> idx) { - c[idx] = a[idx] + b[idx]; + c[idx] = a[idx] + b_idx]; }); ``` Here: @@ -136,15 +188,14 @@ This basic launching serves our purpose for this simpler example, however it is ``` **Note** that **ND-RANGE** requires that the total number of work-items to be divisible by the size of the work-group. -### Step 5: Retrieve Data -The final task in this exercise is to move the checking of the results within the scope of the buffers (before the ending curly bracket) and add the appropriate method to access this data. +### Step 5: Synchronize and Check Results + +Since `malloc_managed` migrates data automatically between the host and device, no explicit memory transfer is required. Ensure the queue finishes execution before accessing the results using `q.wait()`; +### Step 6: Free Device Memory + +Once you're done with the device memory, free the allocated memory using `sycl::free`: -By default, buffers are automatically synchronized with the host when they go out of scope. However, if you need to access data within the buffer’s scope, use **host accessors**. Host accessors can also be created in two ways: -Similar to the device, it is possible to define host accessors in two ways. By using the accessor class constructor -``` -host_accessor c{c_buf, sycl::access::mode::read}; -``` -or by using the `.get_access` member function of the buffer ``` -auto = c_buf.get_access(); +sycl::free(a_usm, q); ``` +This ensures that the allocated memory is properly released on the device. \ No newline at end of file diff --git a/exercises/sycl/02-vector_add/vector_add_buffer.cpp b/exercises/sycl/02-vector_add/vector_add_buffer.cpp index 3e66b08..265907e 100644 --- a/exercises/sycl/02-vector_add/vector_add_buffer.cpp +++ b/exercises/sycl/02-vector_add/vector_add_buffer.cpp @@ -29,6 +29,9 @@ int main() { h.parallel_for( //TODO ); + + //TODO after the submission works + //Checking inside the scope of the buffers } // Check that all outputs match expected value diff --git a/exercises/sycl/02-vector_add/vector_add_usm_device.cpp b/exercises/sycl/02-vector_add/vector_add_usm_device.cpp index c0505a2..7a78cbe 100644 --- a/exercises/sycl/02-vector_add/vector_add_usm_device.cpp +++ b/exercises/sycl/02-vector_add/vector_add_usm_device.cpp @@ -6,7 +6,7 @@ using namespace sycl; int main() { // Set up queue on any available device - queue q; + //TODO // Initialize input and output memory on the host constexpr size_t N = 256; @@ -15,22 +15,18 @@ int main() { std::fill(b.begin(), b.end(), 2); std::fill(c.begin(), c.end(), 0); - // Allocate the memory using universal share memory - int* a_usm = malloc_device(N, q); - int* b_usm = malloc_device(N, q); - int* c_usm = malloc_device(N, q); + // Allocate the memory using malloc_device + //TODO // Copy data from host to USM - q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); - q.memcpy(b_usm, b.data(), N * sizeof(int)).wait(); - q.memcpy(c_usm, c.data(), N * sizeof(int)).wait(); + //TODO // Submit the kernel to the queue q.submit([&](handler& h) { - h.parallel_for(range{N}, [=](id<1> idx) { - c_usm[idx] = a_usm[idx] + b_usm[idx]; - }); + h.parallel_for( + //TODO + ); }); @@ -38,13 +34,12 @@ int main() { q.wait(); // Copy data from USM to host - q.memcpy(c.data(), c_usm, N * sizeof(int)).wait(); + //TODO // Free USM allocations - free(a_usm, q); - free(b_usm, q); - free(c_usm, q); + //TODO + // Check that all outputs match the expected value bool passed = std::all_of(c.begin(), c.end(), [](int i) { return (i == 3); }); diff --git a/exercises/sycl/02-vector_add/vector_add_usm_managed.cpp b/exercises/sycl/02-vector_add/vector_add_usm_managed.cpp index 32fb3b7..4ba158a 100644 --- a/exercises/sycl/02-vector_add/vector_add_usm_managed.cpp +++ b/exercises/sycl/02-vector_add/vector_add_usm_managed.cpp @@ -6,15 +6,13 @@ using namespace sycl; int main() { // Set up queue on any available device - queue q; + //TODO // Initialize input and output memory on the host constexpr size_t N = 256; // Allocate memory using malloc_managed - int* a = malloc_managed(N, q); - int* b = malloc_managed(N, q); - int* c = malloc_managed(N, q); + //TODO // Initialize input memory on the host std::fill(a, a + N, 1); @@ -24,10 +22,9 @@ int main() { // Submit the kernel to the queue q.submit([&](handler& h) { - h.parallel_for(range{N}, [=](id<1> idx) { - c[idx] = a[idx] + b[idx]; - }); - + h.parallel_for( + //TODO + ); }); // Wait for the kernel to finish @@ -38,10 +35,8 @@ int main() { [](int i) { return (i == 3); }); std::cout << ((passed) ? "SUCCESS" : "FAILURE") << std::endl; - // Free managed allocations - free(a, q); - free(b, q); - free(c, q); + // Free the memory + //TODO return (passed) ? 0 : 1; } From 1e57becb767fca3021c15b1ed73c849d4948ee5a Mon Sep 17 00:00:00 2001 From: csccva Date: Mon, 25 Nov 2024 15:53:17 +0200 Subject: [PATCH 25/28] Vector Add done --- exercises/sycl/02-vector_add/Readme.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/exercises/sycl/02-vector_add/Readme.md b/exercises/sycl/02-vector_add/Readme.md index 9f07e8d..74531b6 100644 --- a/exercises/sycl/02-vector_add/Readme.md +++ b/exercises/sycl/02-vector_add/Readme.md @@ -13,7 +13,7 @@ A typical application running on an accelerator follows these steps: In this exercise, we will explore various memory models. -## Memory management using Buffers and Accessors +## I. Memory management using Buffers and Accessors Use the skeleton provided in `vector_add_buffer.cpp`. Look for the **//TODO** lines. @@ -80,9 +80,9 @@ or by using the `.get_access` member function of the buffer auto = c_buf.get_access(); ``` -## Memory management with Unified Shared Memory +## II. Memory management with Unified Shared Memory -### **malloc_device** +### IIa) **malloc_device** Use the skeleton provided in `vector_add_usm_device.cpp`. Look for the **//TODO** lines. @@ -145,7 +145,7 @@ This ensures that the allocated memory is properly released on the device. -### **malloc_shared** +### IIb) **malloc_shared** Use the skeleton provided in `vector_add_usm_managed.cpp`. Look for the **//TODO** lines. From 1adc8338dbd8505ee7c2fc41c9a68e00ae8b62cd Mon Sep 17 00:00:00 2001 From: csccva Date: Mon, 25 Nov 2024 15:54:49 +0200 Subject: [PATCH 26/28] Vector Add done --- exercises/sycl/02-vector_add/Readme.md | 50 ++------------------------ 1 file changed, 2 insertions(+), 48 deletions(-) diff --git a/exercises/sycl/02-vector_add/Readme.md b/exercises/sycl/02-vector_add/Readme.md index 74531b6..f444f35 100644 --- a/exercises/sycl/02-vector_add/Readme.md +++ b/exercises/sycl/02-vector_add/Readme.md @@ -103,30 +103,7 @@ q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); ``` ### Step 4: Submit the Task -Once memory is allocated and data is copied, submit the task to the device using the `.parallel_for()` member function. The basic submission: - -``` - h.parallel_for(sycl::range{N}, [=](sycl::id<1> idx) { - c_usm[idx] = a_usm[idx] + b_usm[idx]; - }); -``` -Here: - - `sycl::range{N}` or `sycl::range(N)` specify number of work-items be launched -- `sycl::id<1>` represents the index used within the kernel. - -#### Using **item** class instead of **id** -Modify the lambda function to use the **sycl::item** class instead of the **id** class. In this case the index `idx` is obtained from the `.get_id()` member. - -#### Using ND-Range -This basic launching serves our purpose for this simpler example, however it is useful to test also the **ND-RANGE**. In case we specify to the runtime the total size of the grid of work-items and size of a work-group as well: - -``` - h.parallel_for(sycl::nd_range<1>(sycl::range<1>(((N+local_size-1)/local_size)*local_size), sycl::range<1>(local_size)), [=](sycl::nd_itemi<1> item) { - auto idx=item.get_global_id(0); - c_usm[idx] = a_usm[idx] + b_usm[idx]; - }); -``` -**Note** that **ND-RANGE** requires that the total number of work-items to be divisible by the size of the work-group. +Same as using buffers. ### Step 5: Retrieve Data @@ -163,30 +140,7 @@ Step 3: Initialize Data on Host This part is already in the skeleton, it is done using `std::fill`. Though if you have time you can replace it with a **for loop**. ### Step 4: Submit the Task -Once memory is allocated and data is copied, submit the task to the device using the `.parallel_for()` member function. The basic submission: - -``` - h.parallel_for(sycl::range{N}, [=](sycl::id<1> idx) { - c[idx] = a[idx] + b_idx]; - }); -``` -Here: - - `sycl::range{N}` or `sycl::range(N)` specify number of work-items be launched -- `sycl::id<1>` represents the index used within the kernel. - -#### Using **item** class instead of **id** -Modify the lambda function to use the **sycl::item** class instead of the **id** class. In this case the index `idx` is obtained from the `.get_id()` member. - -#### Using ND-Range -This basic launching serves our purpose for this simpler example, however it is useful to test also the **ND-RANGE**. In case we specify to the runtime the total size of the grid of work-items and size of a work-group as well: - -``` - h.parallel_for(sycl::nd_range<1>(sycl::range<1>(((N+local_size-1)/local_size)*local_size), sycl::range<1>(local_size)), [=](sycl::nd_itemi<1> item) { - auto idx=item.get_global_id(0); - c[idx] = a[idx] + b[idx]; - }); -``` -**Note** that **ND-RANGE** requires that the total number of work-items to be divisible by the size of the work-group. +Same as using buffers. ### Step 5: Synchronize and Check Results From ad919023143f26a2b6f1761591c3919c9a014713 Mon Sep 17 00:00:00 2001 From: csccva Date: Mon, 25 Nov 2024 15:56:10 +0200 Subject: [PATCH 27/28] Vector Add done --- exercises/sycl/02-vector_add/Readme.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/exercises/sycl/02-vector_add/Readme.md b/exercises/sycl/02-vector_add/Readme.md index f444f35..2c6c912 100644 --- a/exercises/sycl/02-vector_add/Readme.md +++ b/exercises/sycl/02-vector_add/Readme.md @@ -70,8 +70,9 @@ This basic launching serves our purpose for this simpler example, however it is ### Step 5: Retrieve Data The final task in this exercise is to move the checking of the results within the scope of the buffers (before the ending curly bracket) and add the appropriate method to access this data. -By default, buffers are automatically synchronized with the host when they go out of scope. However, if you need to access data within the buffer’s scope, use **host accessors**. Host accessors can also be created in two ways: -Similar to the device, it is possible to define host accessors in two ways. By using the accessor class constructor +By default, buffers are automatically synchronized with the host when they go out of scope. However, if you need to access data within the buffer’s scope, use **host accessors**. + +Similar to the device accessors, it is possible to define host accessors in two ways. By using the accessor class constructor ``` host_accessor c{c_buf, sycl::access::mode::read}; ``` From 44e61e1af2d67cb7f9c082226f56ae2b91879915 Mon Sep 17 00:00:00 2001 From: csccva Date: Mon, 25 Nov 2024 15:58:45 +0200 Subject: [PATCH 28/28] Vector Add done --- exercises/sycl/02-vector_add/Readme.md | 44 +++++++++++++------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/exercises/sycl/02-vector_add/Readme.md b/exercises/sycl/02-vector_add/Readme.md index 2c6c912..8fc753b 100644 --- a/exercises/sycl/02-vector_add/Readme.md +++ b/exercises/sycl/02-vector_add/Readme.md @@ -23,18 +23,18 @@ Start by defining a **queue** and selecting the appropriate device selector. SY ### Step 2: Create Buffers Next, create buffers to encapsulate the data. For a one-dimensional array of integers of length `N`, with pointer `P`, a buffer can be constructed as follows: -``` -sycl::buffer a_buf(P, sycl::range<1>(N)); +```cpp + sycl::buffer a_buf(P, sycl::range<1>(N)); ``` ### Step 3: Create Accessors Accessors provide a mechanism to access data inside the buffers. Accessors on the device must be created within command groups. There are two ways to create accessors. Using the `sycl::accessor` class constructor -``` +```cpp sycl::accessor a{a_buf, h, sycl::read_write}; ``` or using the buffer `.getaccess<...>(h)` member function: -``` -a = a_buf.get_access(h); +```cpp + auto a = a_buf.get_access(h); ``` **Important** Use appropriate access modes for your data: - **Input Buffers:** Use `sycl::access::mode::read` to avoid unnecessary device-to-host data transfers. @@ -44,14 +44,14 @@ a = a_buf.get_access(h); ### Step 4: Submit the Task Once accessors are ready, submit the task to the device using the `.parallel_for()` member function. The basic submission: -``` +```cpp h.parallel_for(sycl::range{N}, [=](sycl::id<1> idx) { c[idx] = a[idx] + b[idx]; }); ``` Here: - `sycl::range{N}` or `sycl::range(N)` specify number of work-items be launched -- `sycl::id<1>` represents the index used within the kernel. + - `sycl::id<1>` represents the index used within the kernel. #### Using **item** class instead of **id** Modify the lambda function to use the **sycl::item** class instead of the **id** class. In this case the index `idx` is obtained from the `.get_id()` member. @@ -59,7 +59,7 @@ Modify the lambda function to use the **sycl::item** class instead of the **id* #### Using ND-Range This basic launching serves our purpose for this simpler example, however it is useful to test also the **ND-RANGE**. In case we specify to the runtime the total size of the grid of work-items and size of a work-group as well: -``` +```cpp h.parallel_for(sycl::nd_range<1>(sycl::range<1>(((N+local_size-1)/local_size)*local_size), sycl::range<1>(local_size)), [=](sycl::nd_itemi<1> item) { auto idx=item.get_global_id(0); c[idx] = a[idx] + b[idx]; @@ -73,12 +73,12 @@ The final task in this exercise is to move the checking of the results within t By default, buffers are automatically synchronized with the host when they go out of scope. However, if you need to access data within the buffer’s scope, use **host accessors**. Similar to the device accessors, it is possible to define host accessors in two ways. By using the accessor class constructor -``` -host_accessor c{c_buf, sycl::access::mode::read}; +```cpp + host_accessor c{c_buf, sycl::access::mode::read}; ``` or by using the `.get_access` member function of the buffer -``` -auto = c_buf.get_access(); +```cpp + auto c = c_buf.get_access(); ``` ## II. Memory management with Unified Shared Memory @@ -93,14 +93,14 @@ Same as using buffers ### Step 2: Allocate Memory on the Device Using `malloc_device` Instead of creating buffers, allocate memory directly on the device using `sycl::malloc_device`. For a one-dimensional array of integers of length N, memory can be allocated as follows: -``` -int* a_usm = sycl::malloc_device(N, q); +```cpp + int* a_usm = sycl::malloc_device(N, q); ``` ### Step 3: Copy Data to the Device You need to copy the data from the host to the device memory. Use sycl::memcpy to transfer data from the host memory to device memory before launching the kernel: -``` -q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); +```cpp + q.memcpy(a_usm, a.data(), N * sizeof(int)).wait(); ``` ### Step 4: Submit the Task @@ -109,15 +109,15 @@ Same as using buffers. ### Step 5: Retrieve Data After the kernel execution is complete, you need to copy the result back from the device to the host. Use `sycl::memcpy` again to transfer the result: -``` -q.memcpy(c.data(), c_usm, N * sizeof(int)).wait(); +```cpp + q.memcpy(c.data(), c_usm, N * sizeof(int)).wait(); ``` ### Step 6: Free Device Memory Once you're done with the device memory, free the allocated memory using `sycl::free`: -``` -sycl::free(a_usm, q); +```cpp + sycl::free(a_usm, q); ``` This ensures that the allocated memory is properly released on the device. @@ -133,7 +133,7 @@ Same as before ### Step 2: Allocate Memory on the Device Using `malloc_managed` Allocate memory that can be migrated between host and device using `sycl::malloc_managed`. For a one-dimensional array of integers of length N, memory can be allocated as follows: -``` +```cpp int* a = sycl::malloc_managed(N, q); ``` Step 3: Initialize Data on Host @@ -150,7 +150,7 @@ Since `malloc_managed` migrates data automatically between the host and device, Once you're done with the device memory, free the allocated memory using `sycl::free`: -``` +```cpp sycl::free(a_usm, q); ``` This ensures that the allocated memory is properly released on the device. \ No newline at end of file