Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Pipeline which builds & tests the container #4

Open
5 tasks
Tracked by #2
philschmid opened this issue Jan 31, 2024 · 1 comment
Open
5 tasks
Tracked by #2

CI Pipeline which builds & tests the container #4

philschmid opened this issue Jan 31, 2024 · 1 comment
Assignees
Labels
GPU GPU related pytorch Pytorch related Issues training

Comments

@philschmid
Copy link
Member

philschmid commented Jan 31, 2024

To make sure our Hugging Face DLC are well tested, we need to create "integration" tests that run different kinds of training using the container. Those tests should be run automatically or on-demand. We can use Github Actions as CI for running the tests and python + docker to implement the integration tests.

Until #3 is implemented, we can use existing Containers from, e.g. transformers to run the tests. For "tests" script, i think we can use existing "examples/" from transformers or peft trl. We could structure the tests/ folder maybe into:

  • local/ (run on a local machine GPU),
  • vertex (run on Vertex)
  • gke (run on GKE)

Example for a test:
0. build a container

  1. starts a container on a GPU
  2. runs a training using the container (few steps)
  3. validates results
  4. stops the container
    -> repeat 1-4. with other tests.

In addition to "local" tests running on GPU instances, we should also run validation tests for GKE and Vertex AI.

  • We need to implement strong CI tests, which run several tests, including training smaller models like BERT and bigger models Like Llama.
    • We should test and validate PEFT
    • Distributed Training
    • Flash attention support
  • Tests directly running on Vertex AI or GKE using vertex SDK
@philschmid philschmid added the pytorch Pytorch related Issues label Jan 31, 2024
@philschmid
Copy link
Member Author

For access to GCP you can ask @glegendre01.

@philschmid philschmid changed the title [Pytorch][GPU] CI Pipeline which builds & tests the container CI Pipeline which builds & tests the container Jan 31, 2024
@philschmid philschmid added GPU GPU related training labels Jan 31, 2024
@ydshieh ydshieh self-assigned this Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU GPU related pytorch Pytorch related Issues training
Projects
None yet
Development

No branches or pull requests

2 participants