Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for what to do next #6

Open
20 of 28 tasks
msaroufim opened this issue Nov 5, 2024 · 0 comments
Open
20 of 28 tasks

Ideas for what to do next #6

msaroufim opened this issue Nov 5, 2024 · 0 comments

Comments

@msaroufim
Copy link
Member

msaroufim commented Nov 5, 2024

Discord based leaderboard

UX

  • How does the leaderboard get rendered @AndreSlavescu
  • Slash commands that autofill that a script is expected and maybe more information like the kernel name or gpu type will be needed @msaroufim
  • Do not create an extra message, just thread user reply @msaroufim
  • Don't render rich links @msaroufim
  • Give clearer feedback on time job took and wait time so far @msaroufim

Leaderboard infra

  • Where do we store all the run data, do we use some DB like postgres, where do we host it
  • Do we run X times and warm up the machines so that the benchmarks are good
  • Let people run code from the leaderboard

GPU infra

  • How do we plug in new gpu - doc only change
  • How do we assign GPU in a round robin way so that a single job won't occupy 8 GPUs @msaroufim - not needed
  • How do we detect malicious jobs and kill them, baseline is probably how long the job takes to run @msaroufim added a 5 min timeout

Startup times

The faster the startup time the more interactive the bot becomes and more popular

  • NVIDIA container setup @msaroufim @liveaverage
  • AMD containers setup @msaroufim @saienduri
  • pytorch takes too long to install - wont fix just use modal instead @msaroufim
  • Killing long running jobs so they don't bork up the queue
  • Setup modal scheduler @msaroufim - main gap right now is passing in configs to remotee machine

Testing infra @S1ro1

Right now this is omega jank

  • We have a staging env but its slow to test - improve the local development experience @b9r5
  • Merging PRs feels very yolo still, I'm never sure something works until I test it so some CI sanity tests would be nice @b9r5
  • Maybe we don't "test" but we fix fast because we work in different timezones
  • Modularize code with Discord cogs so its easier to maintain and isolate breakages @S1ro1

What do people upload

  • numpy script
  • torch script
  • triton script @alexzhang13
  • cuda script - this is the trickiest but we need to on our end have most of the boilerplate including things like launch params. For advanced users we can get launch params from slash @alexzhang13
  • Basic cuda support @msaroufim - still jank and it breaks with things like \n
  • cuda support on modal @S1ro1

Profiling/Ranking

  • Run ncu on user scripts - blocked on getting permissions to run ncu @msaroufim - this is what the PR would look like if we did have permissions 4da129b
  • Run some simple profiling of scripts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant