Skip to content

Commit

Permalink
Merge branch 'main' into chuck/aws-docker
Browse files Browse the repository at this point in the history
  • Loading branch information
j316chuck authored Nov 10, 2023
2 parents 45c196e + 7c4d24a commit 6379634
Show file tree
Hide file tree
Showing 34 changed files with 2,325 additions and 474 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pr-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ jobs:
if: github.repository_owner == 'mosaicml'
with:
container: ${{ matrix.container }}
mcloud-timeout: 1200
mcloud-timeout: 1800
name: ${{ matrix.name }}
pytest-command: ${{ matrix.pytest_command }}
pytest-markers: ${{ matrix.markers }}
Expand Down
49 changes: 38 additions & 11 deletions TUTORIAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,42 @@ Forging LLMs can be quite complicated — you have to get your data prepared, se

This tutorial will provide a brief intro to the repo’s structure and underlying tools (all courtesy of MosaicML, of course), will go over a few example workflows and point you to the related resources within the repo, and will finally cover a number of FAQs that we have encountered since release.

- [LLM Foundry Tutorial](#llm-foundry-tutorial)
- [Intro](#intro)
- [How this repo is structured](#how-this-repo-is-structured)
- [Key components](#key-components)
- [Composer](#composer)
- [StreamingDataset](#streamingdataset)
- [MCLI](#mcli)
- [How the YAMLs work](#how-the-yamls-work)
- [Example Workflows](#example-workflows)
- [Workflow 1: I want to play with a HF model like MPT-7B locally](#workflow-1-i-want-to-play-with-a-hf-model-like-mpt-7b-locally)
- [Workflow 2: I want to deploy an inference endpoint with a HF model like MPT-7B](#workflow-2-i-want-to-deploy-an-inference-endpoint-with-a-hf-model-like-mpt-7b)
- [Workflow 3: I want to finetune a HF model like MPT-7B](#workflow-3-i-want-to-finetune-a-hf-model-like-mpt-7b)
- [Supervised FineTuning and Instruction FineTuning](#supervised-finetuning-and-instruction-finetuning)
- [Domain Adaptation and Sequence Length Adaptation](#domain-adaptation-and-sequence-length-adaptation)
- [Data](#data)
- [Modeling](#modeling)
- [Workflow 4: I want to train a new HF model from scratch](#workflow-4-i-want-to-train-a-new-hf-model-from-scratch)
- [FAQs](#faqs)
- [Why is the script only using 1 out of N GPUs?](#why-is-the-script-only-using-1-out-of-n-gpus)
- [I’m running into an Out-Of-Memory (OOM) error. What do I do?](#im-running-into-an-out-of-memory-oom-error-what-do-i-do)
- [What hardware can I train on?](#what-hardware-can-i-train-on)
- [What hardware can I run eval on?](#what-hardware-can-i-run-eval-on)
- [What is FSDP?](#what-is-fsdp)
- [What are the different attention options `torch` / `flash` / `triton` for MPT and which one should I use?](#what-are-the-different-attention-options-torch--flash--triton-for-mpt-and-which-one-should-i-use)
- [Can I finetune using PEFT / LORA?](#can-i-finetune-using-peft--lora)
- [Can I quantize these models and/or run on CPU?](#can-i-quantize-these-models-andor-run-on-cpu)
- [How do I deploy with ONNX/FasterTransformer?](#how-do-i-deploy-with-onnxfastertransformer)
- [How expensive is it to build LLMs?](#how-expensive-is-it-to-build-llms)
- [Common installation issues](#common-installation-issues)
- [Why is the script only using 1 out of N GPUs?](#why-is-the-script-only-using-1-out-of-n-gpus)
- [I’m running into an Out-Of-Memory (OOM) error. What do I do?](#im-running-into-an-out-of-memory-oom-error-what-do-i-do)
- [What hardware can I train on?](#what-hardware-can-i-train-on)
- [What hardware can I run eval on?](#what-hardware-can-i-run-eval-on)
- [What hardware can I run inference on?](#what-hardware-can-i-run-inference-on)
- [What is FSDP?](#what-is-fsdp)
- [What are the different attention options `torch` / `flash` / `triton` for MPT and which one should I use?](#what-are-the-different-attention-options-torch--flash--triton--for-mpt-and-which-one-should-i-use)
- [Limitations](#limitations)
- [What is `triton-pre-mlir`?](#what-is-triton-pre-mlir)
- [Known issue with sm86+ GPUs](#known-issue-with-sm86-gpus)
- [Support for FlashAttention-2](#support-for-flashattention-2)
- [What kinds of positional embeddings does LLM Foundry support?](#what-kinds-of-positional-embeddings-does-llm-foundry-support)
- [Can I finetune using PEFT / LoRA?](#can-i-finetune-using-peft--lora)
- [Can I quantize these models and/or run on CPU?](#can-i-quantize-these-models-andor-run-on-cpu)
- [How do I deploy with ONNX/FasterTransformer?](#how-do-i-deploy-with-onnxfastertransformer)
- [TransformerEngine and amp\_fp8 support](#transformerengine-and-amp_fp8-support)
- [How expensive is it to build LLMs?](#how-expensive-is-it-to-build-llms)
- [Common installation issues](#common-installation-issues)

Let’s get started!

Expand Down Expand Up @@ -328,6 +343,18 @@ The majority of our training setups use `triton`. -->
Updating to LLVM14 (or LLVM15) cannot be done because there are breaking changes.
What is the result of this? Although sm89+ is not **formally** supported until LLVM15, our testing on H100 GPUs shows that `attn_impl=triton` still works well and still runs fast. The only issue is that when the network is starting to run, LLVM might throw a warning like: `'sm_90' is not a recognized processor for this target (ignoring processor)`. This warning does not seem to affect performance.

#### Support for FlashAttention-2
- [FlashAttention-2](https://arxiv.org/pdf/2307.08691.pdf) improves upon FlashAttention to get even faster attention computation. LLM Foundry supports FlashAttention-2. Please follow the instructions [here](https://github.com/mosaicml/llm-foundry/tree/main/scripts/train#flashattention).

### What kinds of positional embeddings does LLM Foundry support?
Currently we support [Learned Positional Embeddings](https://arxiv.org/pdf/1706.03762.pdf), [Attention with Linear Biases (ALiBi)](https://arxiv.org/pdf/2108.12409.pdf), and [Rotary Positional Embeddings (RoPE)](https://arxiv.org/pdf/2104.09864.pdf). There is also an option to switch off all of these embeddings to get [No Positional Embedding](https://arxiv.org/pdf/2203.16634.pdf).

| Name | YAML Config | Training MFU on MPT-7B trained on 8 A100 80GB GPUs | Notes |
|:-----------------------------------|:------------------------------------------------------------------|:---------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Learned Positional Embeddings | <pre>model:<br> learned_pos_emb:&nbsp;True</pre>| 65.7 | |
| ALiBi | <pre>model:<br> attn_config:<br> alibi:&nbsp;True</pre>| 64.5 | Requires Triton or Torch attention. |
| RoPE (Dao-AILab Implementation) | <pre>model:<br> attn_config:<br> rope:&nbsp;True<br> rope_impl:&nbsp;dail</pre>| 64.5 | Requires a CUDA GPU and the [flash-attn library](https://github.com/Dao-AILab/flash-attention) v2.0.1 or higher to be installed. Please see the instructions in the [paragraph above](#support-for-flashattention-2) on how to install flash-attn v2. Note that the attention implementation can still be `torch`, `triton`, or `flash`. |
| RoPE (Hugging<code>&nbsp;</code>Face Implementation) | <pre>model:<br> attn_config:<br> rope:&nbsp;True<br> rope_impl:&nbsp;hf</pre>| 62.3 | |

### Can I finetune using PEFT / LoRA?
- The LLM Foundry codebase does not directly have examples of PEFT or LORA workflows. However, our MPT model is a subclass of HuggingFace `PretrainedModel`, and https://github.com/mosaicml/llm-foundry/pull/346 added required features to enable HuggingFace’s [PEFT](https://huggingface.co/docs/peft/index) / [LORA](https://huggingface.co/docs/peft/conceptual_guides/lora) workflows for MPT. MPT models with LoRA modules can be trained either using LLM Foundry or Hugging Face's [accelerate](https://huggingface.co/docs/accelerate/index). Within LLM Foundry, run (`scripts/train/train.py`), adding `lora` arguments to the config `.yaml`, like so:
Expand Down
33 changes: 14 additions & 19 deletions llmfoundry/callbacks/hf_checkpointer.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@
from composer.core import Callback, Event, State, Time, TimeUnit
from composer.core.state import fsdp_state_dict_type_context
from composer.loggers import Logger, MLFlowLogger
from composer.loggers.remote_uploader_downloader import RemoteUploaderDownloader
from composer.models import HuggingFaceModel
from composer.utils import dist, format_name_with_dist_and_time, parse_uri
from composer.utils import (dist, format_name_with_dist_and_time,
maybe_create_remote_uploader_downloader_from_uri,
parse_uri)
from composer.utils.misc import create_interval_scheduler
from transformers import PreTrainedModel, PreTrainedTokenizerBase

Expand Down Expand Up @@ -53,12 +54,11 @@ def __init__(
save_interval: Union[str, int, Time],
huggingface_folder_name: str = 'ba{batch}',
precision: str = 'float32',
overwrite: bool = False,
overwrite: bool = True,
mlflow_registered_model_name: Optional[str] = None,
mlflow_logging_config: Optional[dict] = None,
):
self.backend, self.bucket_name, self.save_dir_format_str = parse_uri(
save_folder)
_, _, self.save_dir_format_str = parse_uri(save_folder)
self.overwrite = overwrite
self.precision = precision
self.dtype = {
Expand Down Expand Up @@ -93,13 +93,11 @@ def __init__(
self.save_interval = save_interval
self.check_interval = create_interval_scheduler(
save_interval, include_end_of_training=True)
self.upload_to_object_store = (self.backend != '')
if self.upload_to_object_store:
self.remote_ud = RemoteUploaderDownloader(
bucket_uri=f'{self.backend}://{self.bucket_name}',
num_concurrent_uploads=4)
else:
self.remote_ud = None

self.remote_ud = maybe_create_remote_uploader_downloader_from_uri(
save_folder, loggers=[])
if self.remote_ud is not None:
self.remote_ud._num_concurrent_uploads = 4

self.last_checkpoint_batch: Optional[Time] = None
self.mlflow_loggers = []
Expand All @@ -115,7 +113,7 @@ def run_event(self, event: Event, state: State, logger: Logger) -> None:
raise ValueError(
f'`HuggingFaceCheckpointer` is only compatible with `HuggingFaceModel`s. '
+ f'Got {type(state.model)} instead.')
if self.upload_to_object_store and self.remote_ud is not None:
if self.remote_ud is not None:
self.remote_ud.init(state, logger)
state.callbacks.append(self.remote_ud)

Expand Down Expand Up @@ -169,7 +167,7 @@ def _save_checkpoint(self, state: State, logger: Logger):
self.huggingface_folder_name_fstr), state.run_name,
state.timestamp)
dir_context_mgr = tempfile.TemporaryDirectory(
) if self.upload_to_object_store else contextlib.nullcontext(
) if self.remote_ud is not None else contextlib.nullcontext(
enter_result=save_dir)

with dir_context_mgr as temp_save_dir:
Expand Down Expand Up @@ -233,11 +231,8 @@ def _save_checkpoint(self, state: State, logger: Logger):
log.debug('Editing MPT files for HuggingFace compatibility')
edit_files_for_hf_compatibility(temp_save_dir)

if self.upload_to_object_store:
assert self.remote_ud is not None
log.info(
f'Uploading HuggingFace formatted checkpoint to {self.backend}://{self.bucket_name}/{save_dir}'
)
if self.remote_ud is not None:
log.info(f'Uploading HuggingFace formatted checkpoint')
for filename in os.listdir(temp_save_dir):
self.remote_ud.upload_file(
state=state,
Expand Down
2 changes: 2 additions & 0 deletions llmfoundry/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# SPDX-License-Identifier: Apache-2.0

from llmfoundry.data.data import ConcatTokensDataset, NoConcatDataset
from llmfoundry.data.dataloader import build_dataloader
from llmfoundry.data.denoising import (MixtureOfDenoisersCollator,
build_text_denoising_dataloader)
from llmfoundry.data.finetuning import (Seq2SeqFinetuningCollator,
Expand All @@ -18,4 +19,5 @@
'build_text_dataloader',
'NoConcatDataset',
'ConcatTokensDataset',
'build_dataloader',
]
44 changes: 44 additions & 0 deletions llmfoundry/data/dataloader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Copyright 2022 MosaicML LLM Foundry authors
# SPDX-License-Identifier: Apache-2.0

"""Dataloader builder utilities."""

from composer import DataSpec
from omegaconf import DictConfig
from transformers import PreTrainedTokenizerBase

from llmfoundry.data.denoising import build_text_denoising_dataloader
from llmfoundry.data.finetuning.dataloader import build_finetuning_dataloader
from llmfoundry.data.text_data import build_text_dataloader


def build_dataloader(cfg: DictConfig, tokenizer: PreTrainedTokenizerBase,
device_batch_size: int) -> DataSpec:
"""Builds a dataloader from a config.
Args:
cfg (DictConfig): An omegaconf dictionary used to configure the loader.
tokenizer (PreTrainedTokenizerBase): The tokenizer that the model will use.
device_batch_size (int): The size of the batches (number of examples)
that the dataloader will produce.
"""
if cfg.name == 'text':
return build_text_dataloader(
cfg,
tokenizer,
device_batch_size,
)
elif cfg.name == 'text_denoising':
return build_text_denoising_dataloader(
cfg,
tokenizer,
device_batch_size,
)
elif cfg.name == 'finetuning':
return build_finetuning_dataloader(
cfg,
tokenizer,
device_batch_size,
)
else:
raise ValueError(f'Not sure how to build dataloader with config: {cfg}')
16 changes: 11 additions & 5 deletions llmfoundry/data/denoising.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from torch.utils.data import DataLoader
from transformers import PreTrainedTokenizerBase

from llmfoundry.data.packing import BinPackWrapper
from llmfoundry.data.packing import BinPackCollator
from llmfoundry.data.text_data import (StreamingTextDataset,
get_tokens_per_batch_func)
from llmfoundry.models import utils
Expand Down Expand Up @@ -375,19 +375,25 @@ def build_text_denoising_dataloader(
cfg.dataset.max_seq_len (int): The maximum length of sequences
in the batch. See :class:`MixtureOfDenoisersCollator` docstring
for details.
cfg.dataset.packing_ratio (float, optional): If provided, this invokes
cfg.dataset.packing_ratio (Optional[float, Literal['auto']]): If provided, this invokes
a collator wrapper that packs device_batch_size*packing_ratio
raw examples into device_batch_size packed examples. This helps
minimize padding while preserving sequence integrity.
This adds `sequence_id` to the batch, which indicates which unique
sequence each token belongs to.
If set to 'auto', packing_ratio is profiled and the highest observed packing ratio with
zero waste is selected.
In practice, this may result in > 0 waste because profiling is done on only a portion
of the dataset.
Note: Using this feature will not change device_batch_size but it
will determine the number of raw examples consumed by the dataloader
per batch. Some examples may be discarded if they do not fit when
packing.
Select packing_ratio **carefully** based on the dataset
statistics, max_seq_len, and tolerance for discarding samples!
The packing code in `./packing.py` provides a script that can help
The script `scripts/misc/profile_packing.py` can help
you choose the best packing_ratio.
See :class:`StreamingTextDataset` for info on other standard config
options within `cfg.dataset`.
Expand Down Expand Up @@ -419,7 +425,7 @@ def build_text_denoising_dataloader(
that the dataloader will produce.
Note:
You can run the script inside `./packing.py` to quickly test the
You can use the script `scripts/misc/profile_packing.py` to quickly test the
padding/waste rates for different `cfg.dataset.packing_ratio` choices,
given a starting workload YAML.
"""
Expand Down Expand Up @@ -492,7 +498,7 @@ def build_text_denoising_dataloader(
raise NotImplementedError(
'On-the-fly packing is currently only supported for decoder-only formats.'
)
collate_fn = BinPackWrapper(
collate_fn = BinPackCollator(
collator=collate_fn,
target_batch_size=device_batch_size,
max_seq_len=cfg.dataset.max_seq_len,
Expand Down
Loading

0 comments on commit 6379634

Please sign in to comment.