Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add /verifyruns cog #36

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

add /verifyruns cog #36

wants to merge 3 commits into from

Conversation

b9r5
Copy link
Collaborator

@b9r5 b9r5 commented Nov 24, 2024

Description

Adds a /verifyrun command that does the same work as the smoke test but is easier to use.

image

Checklist

Before submitting this PR, ensure the following steps have been completed:

  • Run the smoke test on your own server.
    • Run the cluster bot on your server:
      python discord-bot.py
    • Start a training run by with the slash command /run.
      You may need to exercise some judgement about the script and GPU type.
    • Wait for the training run to complete.
    • Copy the URL for the thread started by the cluster bot in response to
      your /run message ("Cluster Bot started a thread: ..."):
      • Click on the 3 dots (...) to the cluster bot's message.
      • Select Copy Message Link.
    • Using the copied URL, run the smoke test:
      python discord-bot-smoke-test.py copied_url
    • Verify that the smoke test script responds with:
      All tests passed!
      
    For more information on running a cluster bot on your own server, see
    README.md.

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some time we could save in having a single command verify all of modal, nvidia and amd one after the other and letting us know how many of those 3 succeeded. Also as far as the UX goes, why not have verify actually trigger the bot 3 times instead of having a human manually verify in a thread

This cog provides functionality to verify that either a GitHub Actions or
Modal run completed successfully by checking for specific message patterns
in a Discord thread. It supports verification of two types of runs:
1. GitHub Actions runs - Identified by "GitHub Action triggered!" message
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this seems to test a trigger but we've had issues where this wouldn't have helped like for example when we had timeout issues with both the NVIDIA and AMD runner

@b9r5 b9r5 marked this pull request as draft November 27, 2024 16:03
@b9r5
Copy link
Collaborator Author

b9r5 commented Nov 27, 2024

OK, thanks, Mark. I'll switch /verifyrun to start all 3 types of runs. And I'll make sure I'm not waiting for "GitHub Action triggered!" messages that may never come.

I switched this PR to draft for now because I'm in the middle of re-doing this.

@b9r5
Copy link
Collaborator Author

b9r5 commented Nov 29, 2024

Latest changes launch 3 run jobs.

Screenshot below. I find the bot's messages like "Created thread ⁠GitHub Job (AMD) - 2024-11-29 1… for your GitHub job" a little distracting, given that the threads also show up as separate lines in the UI. Should we remove those "created thread" messages?

@msaroufim, I don't know if this addresses your comment from above ("we've had issues where this wouldn't have helped"). If a GitHub workflow times out, then I think the GitHub cog will just return, and the thread won't have the happy path messages in it, so verification will fail. So I think it addresses your comment, but I'm not fully sure.

Screenshot from 2024-11-29 11-58-22

@b9r5 b9r5 marked this pull request as ready for review November 29, 2024 20:07
@b9r5 b9r5 changed the title add /verifyrun cog add /verifyruns cog Nov 29, 2024
@msaroufim
Copy link
Member

Screenshot below. I find the bot's messages like "Created thread ⁠GitHub Job (AMD) - 2024-11-29 1… for your GitHub job" a little distracting, given that the threads also show up as separate lines in the UI. Should we remove those "created thread" messages?

Yeah should remove those in another PR

@msaroufim, I don't know if this addresses your comment from above ("we've had issues where this wouldn't have helped"). If a GitHub workflow times out, then I think the GitHub cog will just return, and the thread won't have the happy path messages in it, so verification will fail. So I think it addresses your comment, but I'm not fully sure.

Yeah i recently implemented timeouts so they should solve that specific problem

@b9r5
Copy link
Collaborator Author

b9r5 commented Nov 29, 2024

Mark, do you think this needs a final message at the end that says "✅ All tests passed"?

@msaroufim
Copy link
Member

Yeah that sounds nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants