Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Refactor MIGraphX model testing with Jenkins credential access. #1671

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

stefankoncarevic
Copy link
Contributor

This commit implements server access via Jenkins credentials, replacing the need for the previously used model-testing.sh script. The new testing framework now supports five architectures: MI200, MI300, Navi2x, Navi3x, and Navi4x.

It allows for performance (perf) and verification (verify) testing, with performance set as the default option. For each architecture, dedicated log files are generated, containing detailed information about model performances.

The testing framework is configured to execute batch size tests of 1, 32, and 64 for each model, ensuring comprehensive coverage. Additionally, tests are scheduled to run once a week, providing regular updates on model performance metrics.

This update enhances the efficiency and maintainability of our model testing process, streamlining access and reporting for different architectures.

Currently, there are issues with mounting models on the Navi4x machine, and efforts are underway to resolve this. All other architectures are functioning correctly.

Copy link
Collaborator

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall note: we want this to not be a shell script that's manually managing Docker containers

"""
sh """ #!/bin/bash -x

if [ \$(docker ps -a -q -f name=migraphx) ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... Jenkins should be managing the Docker starts/stops, not us

echo "sshfs is already installed."
fi

if ! command -v sshpass &> /dev/null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the Dockerfile

sshfs_port="22"

known_hosts_file="/tmp/known_hosts"
ssh-keyscan -p "$sshfs_port" "$sshfs_host" > "$known_hosts_file"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I'd argue having the public key in by Jenkins credential is a better idea than trusting a keyscan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Tuna folks like to use Jenkins environment variables for these things, so they can set them up in the job and keep them out of github. IT may also complain, even just about visible hostnames.

Our codecov stage uses withCredentials to fetch a Jenkins credential into an environment varlable.

My unpublished efforts to use an ssh tunnel for a Tuna database connection settled on sshagent (credentials: ['rocmlir-ixt-rack-15']) { ... ssh commands ... }

But note that ssh is weird on our CI nodes -- lockharts are different from 10.216.64.100 hosts are different from the rest.

echo "int8:$int8"
echo "checkFor:$checkFor"

docker pull rocm/mlir-migraphx-ci:latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a stage{} in Jenkins etc.

Copy link

codecov bot commented Oct 7, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.04%. Comparing base (45d830d) to head (e8996f1).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #1671   +/-   ##
========================================
  Coverage    78.04%   78.04%           
========================================
  Files          100      100           
  Lines        27998    27998           
  Branches      4091     4091           
========================================
  Hits         21851    21851           
+ Misses        4489     4481    -8     
- Partials      1658     1666    +8     
Flag Coverage Δ
mfma 78.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@stefankoncarevic stefankoncarevic force-pushed the migraphx-model-refactor branch 2 times, most recently from 2661e23 to d31ca83 Compare October 24, 2024 09:41
}
if (int8 == "true") {
datatypes.add("--int8")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we test fp8 as well?

case "rocm-framework-38.amd.com":
return (params.MI200 == true)
case "aus-navi3x-13.amd.com":
case "ixt-hq-ubb4-31":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a mi200 machine that has a gpu problem. Omkar gave me a mi250 machine to use. Currently busy with another CI so I'm waiting to finish the job, and run the CI to test the models.

if (fp16 == "true") {
datatypes.add("--fp16")
}
if (fp8 == "true") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added testing for fp8. So far it has been successfully completed on navi2x, navi3x and mi300 machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants