ALEPH ALPHA
┏┓ ┓•
┗┓┏┏┓┃┓┏┓┏┓
┗┛┗┗┻┗┗┛┗┗┫
┛
Scaling is a distributed training library and installable dependency designed to scale up neural networks, with a dedicated module for training large language models.
Scaling consists of two primary components, a model-agnostic core module (scaling.core
), which functions as the engine for distributed training workloads, and the scaling.transformer
suite, specifically designed for LLM training.
The Scaling core module features various parallelization and partitioning techniques:
- Data parallelism: Distribute training data across multiple devices.
- Pipeline parallelism: Divide model layers into sequential stages across multiple devices.
- Tensor parallelism: Split individual tensors across multiple devices.
- 3D parallelism: Seamlessly combine data, pipeline, and tensor parallelism.
- ZeRO sharding: Support for optimizer state partitioning (ZeRO-1) in data parallel training regimens.
- Efficient training: Support for modern performance optimizations such as mixed precision training and activation checkpointing.
- Code quality standards: Rigorous typing, Pydantic classes and extensive tests for ease of development and less potential for bugs.
Built upon the Scaling core components, the Transformer module implements a state-of-the-art transformer architecture and training loop. Among the featured architecture options we support:
- Multi-query and grouped-query attention,
- Different MLP types (e.g., SwiGLU),
- Rotary positional embeddings,
- Parameter-efficient fine-tuning methods: Bitfit, Adapters, LoRA.
The installation requires Linux with Python 3.10 and PyTorch 2.4.0. You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Clone this repository and install via poetry:
poetry install
See also the "Development" section below for additional, optional steps.
To install Flash Attention, you need to make sure you have PyTorch installed already.
Simply install the base depenendencies with pip install .
before installing Flash Attention.
Then install Flash Attention with:
poetry run pip install --no-build-isolation flash-attn==2.4.2
Ensure that your environment variables are set correctly.
The CUDA_HOME
variable should point to the location of your CUDA installation.
For additional information or troubleshooting, please refer to the official documentation.
You can then use Flash Attention in your Transformer architecture configuration:
{
"transformer_architecture": {
...
"masked_softmax": {
"kernel": "flash_attention",
"softmax_in_fp32": true,
"deterministic_flash_attn_bwd": false,
"scale": 1.0,
},
...
}
}
Everything you need to start a full distributed transformer training is contained in this example.
You can start a training job by executing:
python3 -m examples.transformer_example.run examples/transformer_example/config.yml
Feel free to experiment with the example config that controls all relevant training parameters and modify it to suit your needs. In particular, update the topology configuration to reflect the amount of GPU devices available. For instance, if you have a single GPU device available, set the topology parameters accordingly.
{
...
"topology": {
...
"model_parallel_size": 1,
"pipe_parallel_size": 1,
"data_parallel_size": 1,
...
}
}
Note: The number of available GPU devices needs to be equal to model_parallel_size * pipe_parallel_size * data_parallel_size
. To control this, you can simply set the CUDA_VISIBLE_DEVICES
environment variable to the desired GPU indices.
If you want to run a large-scale job on a cluster, you can in principle use the same code, but make sure the training script gets executed in parallel. We also provide the tooling to do this. For an in-depth look, check out our more detailed guide on how to train a model on multiple nodes. Scaling also features a basic inference module to generate outputs from model checkpoints.
If you are interested to learn more about how to build your own training library using Scaling, check out our MLP example.
The MNIST MLP classifier is probably the most used example across myriads of Deep Learning tutorials, and Scaling makes no exception.
This is a self-contained codebase built on Scaling that implements a full 3D parallel training loop for MNIST classification that is as simple as it gets while touching upon all important components in scaling.core
.
At the end of the day, our transformer training suite scaling.transformer
is built in the very same fashion.
The MLP example is the best way to start if you want to learn about how to use the building blocks from scaling.core
without getting lost in the details of a complex model architecture.
Please install pre-commit hooks:
pre-commit install
Run mypy to catch typing mistakes:
mypy src
mypy tests
Run tests with:
pytest tests/core
pytest tests/transformer