-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix mlperf deepcam test #40
Labels
bug
Something isn't working
Comments
I've ran the ARCHER2: rapostol@ln03:~/work/epcc-reframe> reframe -R -r -C configuration/archer2.py -c tests/mlperf/deepcam/
[ReFrame Setup]
version: 4.2.1
command: '/work/y07/shared/utils/core/reframe/4.2.1/bin/reframe -R -r -C configuration/archer2.py -c tests/mlperf/deepcam/'
launched by: rapostol@ln03
working directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe'
settings files: '<builtin>', 'configuration/archer2.py'
check search path: (R) '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/tests/mlperf/deepcam'
stage directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage'
output directory: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/output'
log files: '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log'
[==========] Running 2 check(s)
[==========] Started on Thu Sep 26 17:28:33 2024
[----------] start processing checks
[ RUN ] DeepCamCPUCheck /226c6510 @archer2:compute+PrgEnv-gnu
[ RUN ] DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
[ FAIL ] (1/2) DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @archer2:compute-gpu-torch+rocm-PrgEnv-gnu
==> test failed during 'sanity': test staged in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute-gpu-torch/rocm-PrgEnv-gnu/DeepCamGPUBenchmark_8347fc39'
[ FAIL ] (2/2) DeepCamCPUCheck /226c6510 @archer2:compute+PrgEnv-gnu
==> test failed during 'sanity': test staged in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/DeepCamCPUCheck'
[----------] all spawned checks have finished
[ FAILED ] Ran 2/2 test case(s) from 2 check(s) (2 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Sep 26 17:31:37 2024
==================================================================================================================================================================================================
SUMMARY OF FAILURES
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for DeepCamCPUCheck (run: 1/1)
* Description: DeepCam CPU Benchmark
* System partition: archer2:compute
* Environment: PrgEnv-gnu
* Stage directory: /mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute/PrgEnv-gnu/DeepCamCPUCheck
* Node list:
* Job type: batch job (id=7686033)
* Dependencies (conceptual): []
* Dependencies (actual): []
* Maintainers: []
* Failing phase: sanity
* Rerun with '-n /226c6510 -p PrgEnv-gnu --system archer2:compute -r'
* Reason: sanity error: pattern 'Processing Speed' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
Lmod is automatically replacing "cce/15.0.0" with "gcc/11.2.0".
Lmod is automatically replacing "PrgEnv-cray/8.3.3" with "PrgEnv-gnu/8.3.3".
Due to MODULEPATH changes, the following have been reloaded:
1) cray-mpich/8.1.23
--- rfm_job.err ---
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
FAILURE INFO for DeepCamGPUBenchmark %num_gpus=4 (run: 1/1)
* Description: Deepcam GPU Benchmark
* System partition: archer2:compute-gpu-torch
* Environment: rocm-PrgEnv-gnu
* Stage directory: /mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/stage/archer2/compute-gpu-torch/rocm-PrgEnv-gnu/DeepCamGPUBenchmark_8347fc39
* Node list:
* Job type: batch job (id=7686034)
* Dependencies (conceptual): []
* Dependencies (actual): []
* Maintainers: []
* Failing phase: sanity
* Rerun with '-n /8347fc39 -p rocm-PrgEnv-gnu --system archer2:compute-gpu-torch -r'
* Reason: sanity error: pattern 'Processing Speed' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
Torch Version Too Low for GPU Power Metrics
Torch Version Too Low for GPU Power Metrics
Torch Version Too Low for GPU Power Metrics2.0.0a0+git96ca226Torch Version Too Low for GPU Power Metrics
2.0.0a0+git96ca226
2.0.0a0+git96ca226
2.0.0a0+git96ca226
:::MLLOG {"namespace": "deepcam", "time_ms": 1727368225217, "event_type": "POINT_IN_TIME", "key": "opt_name", "value": "ADAM", "metadata": {"file": "/work/z043/shared/chris-ml-intern/ML_HPC/gc.py", "lineno": 115}}
:::MLLOG {"namespace": "deepcam", "time_ms": 1727368225694, "event_type": "POINT_IN_TIME", "key": "opt_adam_epsilon", "value": 1e-06, "metadata": {"file": "/work/z043/shared/chris-ml-intern/ML_HPC/gc.py", "lineno": 117}}
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
Lmod is automatically replacing "cce/15.0.0" with "gcc/11.2.0".
Lmod is automatically replacing "PrgEnv-cray/8.3.3" with "PrgEnv-gnu/8.3.3".
Due to MODULEPATH changes, the following have been reloaded:
1) cray-mpich/8.1.23
--- rfm_job.err ---
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Log file(s) saved in '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.out', '/mnt/lustre/a2fs-work2/work/z19/z19/rapostol/epcc-reframe/reframe.log' And Cirrus: rapostol@cirrus-login2:~/work/reframe_dev$ reframe -r -R -C configuration/cirrus.py -c tests/mlperf/deepcam/
[ReFrame Setup]
version: 4.6.0-dev.1
command: '/work/y07/shared/cirrus-software/reframe/bin/reframe -r -R -C configuration/cirrus.py -c tests/mlperf/deepcam/'
launched by: rapostol@cirrus-login2
working directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev'
settings files: '<builtin>', 'configuration/cirrus.py'
check search path: (R) '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/tests/mlperf/deepcam'
stage directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/stage'
output directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/output'
log files: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'
[==========] Running 1 check(s)
[==========] Started on Thu Sep 26 17:27:09 2024+0100
[----------] start processing checks
[ RUN ] DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @cirrus:compute-gpu+Default
1
[ OK ] (1/1) DeepCamGPUBenchmark %num_gpus=4 /8347fc39 @cirrus:compute-gpu+Default
P: Throughput: 15.37205534047916 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 33.30719208717346 s (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 24.079823663000003 s (r:0, l:None, u:None)
[----------] all spawned checks have finished
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Thu Sep 26 17:30:26 2024+0100
Log file(s) saved in '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'
resnet:
rapostol@cirrus-login2:~/work/reframe_dev$ reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/
[ReFrame Setup]
version: 4.6.0-dev.1
command: '/work/y07/shared/cirrus-software/reframe/bin/reframe -r -R -C configuration/cirrus.py -c tests/mlperf/resnet50/'
launched by: rapostol@cirrus-login2
working directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev'
settings files: '<builtin>', 'configuration/cirrus.py'
check search path: (R) '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/tests/mlperf/resnet50'
stage directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/stage'
output directory: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/output'
log files: '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log'
[==========] Running 1 check(s)
[==========] Started on Fri Sep 27 11:10:07 2024+0100
[----------] start processing checks
[ RUN ] ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
[ OK ] (1/1) ResNet50GPUBenchmark %num_gpus=4 /4647c8f0 @cirrus:compute-gpu+Default
P: Throughput: 45.628501892089844 inputs/s (r:0, l:None, u:None)
P: Epoch Length: 44.92460060119629 s (r:0, l:None, u:None)
P: Delta Loss: 0.08161067962646484 (r:0, l:None, u:None)
P: Communication Time: 0.0 s (r:0, l:None, u:None)
P: Total IO Time: 0.2923150888 s (r:0, l:None, u:None)
[----------] all spawned checks have finished
[ PASSED ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Fri Sep 27 11:19:33 2024+0100
Log file(s) saved in '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.out', '/mnt/lustre/e1000/home/z04/z04/rapostol/reframe_dev/reframe.log' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Test written with hard coded paths to python environments that need to moved to a central location or build as part of the test.
The text was updated successfully, but these errors were encountered: