You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To make sure our Hugging Face DLC are well tested, we need to create "integration" tests that run different kinds of training using the container. Those tests should be run automatically or on-demand. We can use Github Actions as CI for running the tests and python + docker to implement the integration tests.
Until #3 is implemented, we can use existing Containers from, e.g. transformers to run the tests. For "tests" script, i think we can use existing "examples/" from transformers or pefttrl. We could structure the tests/ folder maybe into:
local/ (run on a local machine GPU),
vertex (run on Vertex)
gke (run on GKE)
Example for a test:
0. build a container
starts a container on a GPU
runs a training using the container (few steps)
validates results
stops the container
-> repeat 1-4. with other tests.
In addition to "local" tests running on GPU instances, we should also run validation tests for GKE and Vertex AI.
We need to implement strong CI tests, which run several tests, including training smaller models like BERT and bigger models Like Llama.
We should test and validate PEFT
Distributed Training
Flash attention support
Tests directly running on Vertex AI or GKE using vertex SDK
The text was updated successfully, but these errors were encountered:
To make sure our Hugging Face DLC are well tested, we need to create "integration" tests that run different kinds of training using the container. Those tests should be run automatically or on-demand. We can use Github Actions as CI for running the tests and
python
+ docker to implement the integration tests.Until #3 is implemented, we can use existing Containers from, e.g.
transformers
to run the tests. For "tests" script, i think we can use existing "examples/" fromtransformers
orpeft
trl
. We could structure thetests/
folder maybe into:local/
(run on a local machine GPU),vertex
(run on Vertex)gke
(run on GKE)Example for a test:
0. build a container
-> repeat 1-4. with other tests.
In addition to "local" tests running on GPU instances, we should also run validation tests for GKE and Vertex AI.
The text was updated successfully, but these errors were encountered: