-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] Refactor MIGraphX model testing with Jenkins credential access. #1671
base: develop
Are you sure you want to change the base?
[CI] Refactor MIGraphX model testing with Jenkins credential access. #1671
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall note: we want this to not be a shell script that's manually managing Docker containers
""" | ||
sh """ #!/bin/bash -x | ||
|
||
if [ \$(docker ps -a -q -f name=migraphx) ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... Jenkins should be managing the Docker starts/stops, not us
echo "sshfs is already installed." | ||
fi | ||
|
||
if ! command -v sshpass &> /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the Dockerfile
sshfs_port="22" | ||
|
||
known_hosts_file="/tmp/known_hosts" | ||
ssh-keyscan -p "$sshfs_port" "$sshfs_host" > "$known_hosts_file" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I'd argue having the public key in by Jenkins credential is a better idea than trusting a keyscan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Tuna folks like to use Jenkins environment variables for these things, so they can set them up in the job and keep them out of github. IT may also complain, even just about visible hostnames.
Our codecov stage uses withCredentials
to fetch a Jenkins credential into an environment varlable.
My unpublished efforts to use an ssh tunnel for a Tuna database connection settled on sshagent (credentials: ['rocmlir-ixt-rack-15']) { ... ssh commands ... }
But note that ssh is weird on our CI nodes -- lockharts are different from 10.216.64.100 hosts are different from the rest.
echo "int8:$int8" | ||
echo "checkFor:$checkFor" | ||
|
||
docker pull rocm/mlir-migraphx-ci:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a stage{}
in Jenkins etc.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #1671 +/- ##
========================================
Coverage 78.04% 78.04%
========================================
Files 100 100
Lines 27998 27998
Branches 4091 4091
========================================
Hits 21851 21851
+ Misses 4489 4481 -8
- Partials 1658 1666 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
2661e23
to
d31ca83
Compare
} | ||
if (int8 == "true") { | ||
datatypes.add("--int8") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we test fp8 as well?
d31ca83
to
ab720b7
Compare
case "rocm-framework-38.amd.com": | ||
return (params.MI200 == true) | ||
case "aus-navi3x-13.amd.com": | ||
case "ixt-hq-ubb4-31": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of a mi200 machine that has a gpu problem. Omkar gave me a mi250 machine to use. Currently busy with another CI so I'm waiting to finish the job, and run the CI to test the models.
if (fp16 == "true") { | ||
datatypes.add("--fp16") | ||
} | ||
if (fp8 == "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added testing for fp8. So far it has been successfully completed on navi2x, navi3x and mi300 machines.
ab720b7
to
e8996f1
Compare
This commit implements server access via Jenkins credentials, replacing the need for the previously used model-testing.sh script. The new testing framework now supports five architectures: MI200, MI300, Navi2x, Navi3x, and Navi4x.
It allows for performance (perf) and verification (verify) testing, with performance set as the default option. For each architecture, dedicated log files are generated, containing detailed information about model performances.
The testing framework is configured to execute batch size tests of 1, 32, and 64 for each model, ensuring comprehensive coverage. Additionally, tests are scheduled to run once a week, providing regular updates on model performance metrics.
This update enhances the efficiency and maintainability of our model testing process, streamlining access and reporting for different architectures.
Currently, there are issues with mounting models on the Navi4x machine, and efforts are underway to resolve this. All other architectures are functioning correctly.