Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mlperf deepcam test #40

Open
JPRichings opened this issue Aug 23, 2024 · 1 comment
Open

Fix mlperf deepcam test #40

JPRichings opened this issue Aug 23, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@JPRichings
Copy link
Contributor

JPRichings commented Aug 23, 2024

Test written with hard coded paths to python environments that need to moved to a central location or build as part of the test.

@RuiApostolo
Copy link
Contributor

I've ran the deepcam tests with the most recent commit (9f80583). On ARCHER2, the sanity checks fail because the pytorch version is too low for power metrics. On Cirrus, the cpu tests aren't being run because there's not valid program environment on the test, but the gpu test passes.

ARCHER2:

rapostol@ln03:~/work/epcc-reframe> reframe -R -r -C configuration/archer2.py -c tests/mlperf/deepcam/                                            
[ReFrame Setup]
  version:           4.2.1
  command:           '/work/y07/shared/utils/core/reframe/4.2.1/bin/reframe -R -r -C configuration/archer2.py -c tests/mlperf/deepcam/'
  launched by:       rapostol@ln03
  working directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe'
  settings files:    '<builtin>', 'configuration/archer2.py'
  check search path: (R) '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/tests/mlperf/deepcam'
  stage directory:   '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage'
  output directory:  '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/output'
  log files:         '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log'

[==========] Running 2 check(s)
[==========] Started on Thu Sep 26 17:28:33 2024

[----------] start processing checks
[ RUN      ] DeepCamCPUCheck /226c6510 @archer2:compute+PrgEnv-gnu
[ RUN      ] DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
[     FAIL ] (1/2) DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
==> test failed during 'sanity': test staged in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute-gpu-torch/rocm-PrgEnv-gnu/DeepCamGPUBenchmark_8347fc39'
[     FAIL ] (2/2) DeepCamCPUCheck /226c6510 @archer2:compute+PrgEnv-gnu
==> test failed during 'sanity': test staged in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/DeepCamCPUCheck'
[----------] all spawned checks have finished

[  FAILED  ] Ran 2/2 test case(s) from 2 check(s) (2 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Sep 26 17:31:37 2024
==================================================================================================================================================================================================
SUMMARY OF FAILURES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for DeepCamCPUCheck (run: 1/1)
  * Description: DeepCam CPU Benchmark
  * System partition: archer2:compute
  * Environment: PrgEnv-gnu
  * Stage directory: /mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/DeepCamCPUCheck
  * Node list:
  * Job type: batch job (id=7686033)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /226c6510 -p PrgEnv-gnu --system archer2:compute -r'
  * Reason: sanity error: pattern 'Processing Speed' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---

Lmod is automatically replacing "cce/15.0.0" with "gcc/11.2.0".


Lmod is automatically replacing "PrgEnv-cray/8.3.3" with "PrgEnv-gnu/8.3.3".


Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.23

--- rfm_job.err ---
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for DeepCamGPUBenchmark %num_gpus=4 (run: 1/1)
  * Description: Deepcam GPU Benchmark
  * System partition: archer2:compute-gpu-torch
  * Environment: rocm-PrgEnv-gnu
  * Stage directory: /mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute-gpu-torch/rocm-PrgEnv-gnu/DeepCamGPUBenchmark_8347fc39
  * Node list:
  * Job type: batch job (id=7686034)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /8347fc39 -p rocm-PrgEnv-gnu --system archer2:compute-gpu-torch -r'
  * Reason: sanity error: pattern 'Processing Speed' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
Torch Version Too Low for GPU Power Metrics
Torch Version Too Low for GPU Power Metrics
Torch Version Too Low for GPU Power Metrics2.0.0a0+git96ca226Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226


2.0.0a0+git96ca226
2.0.0a0+git96ca226
:::MLLOG {"namespace": "deepcam", "time_ms": 1727368225217, "event_type": "POINT_IN_TIME", "key": "opt_name", "value": "ADAM", "metadata": {"file": "/work/z043/shared/chris-ml-intern/ML_HPC/gc.py", "lineno": 115}}
:::MLLOG {"namespace": "deepcam", "time_ms": 1727368225694, "event_type": "POINT_IN_TIME", "key": "opt_adam_epsilon", "value": 1e-06, "metadata": {"file": "/work/z043/shared/chris-ml-intern/ML_HPC/gc.py", "lineno": 117}}
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---

Lmod is automatically replacing "cce/15.0.0" with "gcc/11.2.0".


Lmod is automatically replacing "PrgEnv-cray/8.3.3" with "PrgEnv-gnu/8.3.3".


Due to MODULEPATH changes, the following have been reloaded:
  1) cray-mpich/8.1.23

--- rfm_job.err ---
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log'

And Cirrus:

rapostol@cirrus-login2:~/work/reframe_dev$ reframe -r -R -C configuration/cirrus.py -c tests/mlperf/deepcam/                                     
[ReFrame Setup]
  version:           4.6.0-dev.1
  command:           '/work/y07/shared/cirrus-software/reframe/bin/reframe -r -R -C configuration/cirrus.py -c tests/mlperf/deepcam/'
  launched by:       rapostol@cirrus-login2
  working directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev'
  settings files:    '<builtin>', 'configuration/cirrus.py'
  check search path: (R) '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/tests/mlperf/deepcam'
  stage directory:   '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/stage'
  output directory:  '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/output'
  log files:         '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'

[==========] Running 1 check(s)
[==========] Started on Thu Sep 26 17:27:09 2024+0100

[----------] start processing checks
[ RUN      ] DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @cirrus:compute-gpu+Default
1
[       OK ] (1/1) DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @cirrus:compute-gpu+Default
P: Throughput: 15.37205534047916 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 33.30719208717346 s (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 24.079823663000003 s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Sep 26 17:30:26 2024+0100
Log file(s) saved in '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'




resnet:
rapostol@cirrus-login2:~/work/reframe_dev$ reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/
[ReFrame Setup]
  version:           4.6.0-dev.1
  command:           '/work/y07/shared/cirrus-software/reframe/bin/reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/'
  launched by:       rapostol@cirrus-login2
  working directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev'
  settings files:    '<builtin>', 'configuration/cirrus.py'
  check search path: (R) '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/tests/mlperf/resnet50'
  stage directory:   '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/stage'
  output directory:  '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/output'
  log files:         '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'

[==========] Running 1 check(s)
[==========] Started on Fri Sep 27 11:10:07 2024+0100

[----------] start processing checks
[ RUN      ] ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
[       OK ] (1/1) ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
P: Throughput: 45.628501892089844 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 44.92460060119629 s (r:0, l:None, u:None)
P: Delta Loss: 0.08161067962646484  (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 0.2923150888 s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Fri Sep 27 11:19:33 2024+0100
Log file(s) saved in '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants