Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for chat formatted finetuning input data. #884

Merged
merged 28 commits into from
Jan 26, 2024

Conversation

milocress
Copy link
Contributor

@milocress milocress commented Jan 18, 2024

Manual tests:

I ran this yaml using composer scripts/train/train.py test.yaml on chat-formatted and instruction-formatted data.

Chat Formatted run

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/workdisk/iamroot/llmfoundry-venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /mnt/workdisk/iamroot/llmfoundry-venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...
/mnt/workdisk/iamroot/work/../llm-foundry/scripts/train/train.py:429: UserWarning: FSDP is not applicable for single-GPU training. Reverting to DDP.
  warnings.warn(
/mnt/workdisk/iamroot/llm-foundry/llmfoundry/utils/config_utils.py:102: UserWarning: Using `cfg.model.init_device='meta'` is only valid when using FSDP! Reverting to `cfg.model.init_device='cpu'`.
  warnings.warn(
2024-01-19 15:51:38,830: rank0[11406][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "my_chat_data". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7973.96it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 783.84it/s]
Generating train split: 2 examples [00:00, 1027.51 examples/s]
num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.
2024-01-19 15:51:39,392: rank0[11406][MainThread]: WARNING: datasets.arrow_dataset: num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.
Tokenizing dataset (num_proc=2):   0%|                                                                                                                                  | 0/2 [00:00<?, ? examples/s]
No chat template is defined for this tokenizer - using the default template for the GPTNeoXTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.


No chat template is defined for this tokenizer - using the default template for the GPTNeoXTokenizerFast class. If the default is not appropriate for your model, please set `tokenizer.chat_template` to an appropriate template. See https://huggingface.co/docs/transformers/main/chat_templating for more information.

Tokenizing dataset (num_proc=2): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.23 examples/s]
num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.
2024-01-19 15:51:39,835: rank0[11406][MainThread]: WARNING: datasets.arrow_dataset: num_proc must be <= 2. Reducing num_proc to 2 for dataset of size 2.
Filtering out long prompts (num_proc=2): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 21.11 examples/s]
2024-01-19 15:51:39,967: rank0[11406][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Local rank 0 finished data prep
2024-01-19 15:51:40,872: rank0[11406][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-01-19 15:51:40,886: rank0[11406][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: Instantiating an MPTForCausalLM model from /mnt/workdisk/iamroot/llm-foundry/llmfoundry/models/mpt/modeling_mpt.py
2024-01-19 15:51:41,498: rank0[11406][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: We recommend using config.init_device="meta" with Composer + FSDP for faster initialization.
2024-01-19 15:51:43,879: rank0[11406][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: MPTModel(
  (wte): SharedEmbedding(50368, 768)
  (wpe): Embedding(2048, 768)
  (emb_drop): Dropout(p=0.0, inplace=False)
  (blocks): ModuleList(
    (0-11): 12 x MPTBlock(
      (norm_1): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): MultiheadAttention(
        (Wqkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (norm_2): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (ffn): MPTMLP(
        (up_proj): Linear(in_features=768, out_features=3072, bias=True)
        (down_proj): Linear(in_features=3072, out_features=768, bias=True)
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_ffn_dropout): Dropout(p=0.0, inplace=False)
    )
  )
  (norm_f): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2024-01-19 15:51:43,879: rank0[11406][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.
2024-01-19 15:51:44,264: rank0[11406][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2024-01-19 15:51:44,402: rank0[11406][MainThread]: INFO: composer.trainer.trainer: Run name: interactive-il0GeA
2024-01-19 15:51:45,555: rank0[11406][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2024-01-19 15:51:45,556: rank0[11406][MainThread]: DEBUG: composer.loggers.mosaicml_logger: Logging model initialized time to metadata
2024-01-19 15:51:45,566: rank0[11406][MainThread]: INFO: composer.loggers.mosaicml_logger: Logged 1 metadata to MosaicML, waiting on 1
2024-01-19 15:51:45,566: rank0[11406][MainThread]: INFO: composer.trainer.trainer: Setting seed to 17
2024-01-19 15:51:45,566: rank0[11406][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
max_seq_len: 2048
global_seed: 17
model:
  name: mpt_causal_lm
  init_device: meta
  d_model: 768
  n_heads: 12
  n_layers: 12
  expansion_ratio: 4
  max_seq_len: 2048
  vocab_size: 50368
  attn_config:
    attn_impl: triton
tokenizer:
  name: mosaicml/mpt-7b-chat
  kwargs:
    model_max_length: 2048
train_loader:
  name: finetuning
  dataset:
    split: train
    hf_name: my_chat_data
    shuffle: true
    max_seq_len: 2048
    shuffle_algo: py1e
    shuffle_seed: 17
    allow_pad_trimming: false
    decoder_only_format: true
  timeout: 0
  drop_last: false
  pin_memory: true
  num_workers: 8
  prefetch_factor: 2
  persistent_workers: true
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1
optimizer:
  name: decoupled_adamw
  lr: 0.0006
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 2ba
eval_interval: 0
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 256
seed: 17
device_eval_batch_size: 16
device_train_microbatch_size: 16
precision: amp_bf16
fsdp_config: null
log_to_console: true
console_log_interval: 1ba
n_gpus: 1
device_train_batch_size: 256
device_train_grad_accum: 16
merge: true
n_params: 125311488

2024-01-19 15:51:45,705: rank0[11406][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16
******************************
Config:
composer_commit_hash: None
composer_version: 0.17.2
node_name: a100-40sxm-h13-02
num_gpus_per_node: 1
num_nodes: 1
rank_zero_seed: 17

******************************
2024-01-19 15:51:45,706: rank0[11406][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
/mnt/workdisk/iamroot/llmfoundry-venv/lib/python3.10/site-packages/composer/core/data_spec.py:35: UserWarning: Cannot split tensor of length 2 into batches of size 16. As it is smaller, no splitting will be done. This may happen on the last batch of a dataset if it is a smaller size than the microbatch size.
  warnings.warn(f'Cannot split tensor of length {len(t)} into batches of size {microbatch_size}. '
/mnt/workdisk/iamroot/llm-foundry/llmfoundry/models/layers/attention.py:455: UserWarning: Propagating key_padding_mask to the attention module and applying it within the attention module can cause unnecessary computation/memory usage. Consider integrating into attn_bias once and passing that to each attention module instead.
  warnings.warn(
2024-01-19 15:51:47,665: rank0[11406][MainThread]: DEBUG: composer.loggers.mosaicml_logger: 
Logging training progress data to metadata:
        training_progress: [batch=1/2]
[batch=1/2]:
         Train time/epoch: 0
         Train time/batch: 0
         Train time/sample: 0
         Train time/batch_in_epoch: 0
         Train time/sample_in_epoch: 0
         Train time/token: 0
         Train time/token_in_epoch: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 11.7901
         Train metrics/train/LanguageCrossEntropy: 11.7875
         Train metrics/train/LanguagePerplexity: 131597.1250
2024-01-19 15:51:47,738: rank0[11406][MainThread]: DEBUG: composer.loggers.mosaicml_logger: 
Logging training progress data to metadata:
        training_progress: [batch=2/2]
[batch=2/2]:
         Train metrics/train/LanguageCrossEntropy: 11.7875
         Train metrics/train/LanguagePerplexity: 131597.1250
         Train time/epoch: 1
         Train time/batch: 1
         Train time/sample: 2
         Train time/batch_in_epoch: 0
         Train time/sample_in_epoch: 0
         Train time/token: 24
         Train time/token_in_epoch: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 11.7901
2024-01-19 15:51:47,739: rank0[11406][MainThread]: DEBUG: composer.loggers.mosaicml_logger: 
Logging FINAL training progress data to metadata:
        training_progress: [batch=2/2]
2024-01-19 15:51:47,750: rank0[11406][MainThread]: INFO: composer.loggers.mosaicml_logger: Logged 1 metadata to MosaicML, waiting on 1
2024-01-19 15:51:47,750: rank0[11406][MainThread]: DEBUG: composer.core.engine: Closing the engine.
2024-01-19 15:51:47,750: rank0[11406][MainThread]: DEBUG: composer.core.engine: Closing callback MosaicMLLogger
2024-01-19 15:51:47,760: rank0[11406][MainThread]: INFO: composer.loggers.mosaicml_logger: Logged 0 metadata to MosaicML, waiting on 2
2024-01-19 15:51:47,761: rank0[11406][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger
2024-01-19 15:51:47,761: rank0[11406][MainThread]: DEBUG: composer.core.engine: Post-closing callback MosaicMLLogger
2024-01-19 15:51:47,761: rank0[11406][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2024-01-19 15:51:47,795: rank0[11406][MainThread]: DEBUG: composer.core.engine: Engine closed.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.

Instruction formatted run


===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /mnt/workdisk/iamroot/llmfoundry-venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /mnt/workdisk/iamroot/llmfoundry-venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so...
/mnt/workdisk/iamroot/work/../llm-foundry/scripts/train/train.py:429: UserWarning: FSDP is not applicable for single-GPU training. Reverting to DDP.
  warnings.warn(
/mnt/workdisk/iamroot/llm-foundry/llmfoundry/utils/config_utils.py:102: UserWarning: Using `cfg.model.init_device='meta'` is only valid when using FSDP! Reverting to `cfg.model.init_device='cpu'`.
  warnings.warn(
2024-01-19 16:08:00,412: rank0[14067][MainThread]: INFO: llmfoundry.data.finetuning.tasks: No preprocessor was supplied and no preprocessing function is registered for dataset name "my_data". No additional preprocessing will be applied. If the dataset is already formatted correctly, you can ignore this message.
2024-01-19 16:08:00,730: rank0[14067][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: Local rank 0 finished data prep
2024-01-19 16:08:01,636: rank0[14067][MainThread]: DEBUG: llmfoundry.data.finetuning.tasks: All ranks finished data prep
2024-01-19 16:08:01,650: rank0[14067][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: Instantiating an MPTForCausalLM model from /mnt/workdisk/iamroot/llm-foundry/llmfoundry/models/mpt/modeling_mpt.py
2024-01-19 16:08:02,262: rank0[14067][MainThread]: INFO: llmfoundry.models.mpt.modeling_mpt: We recommend using config.init_device="meta" with Composer + FSDP for faster initialization.
2024-01-19 16:08:04,655: rank0[14067][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: MPTModel(
  (wte): SharedEmbedding(50368, 768)
  (wpe): Embedding(2048, 768)
  (emb_drop): Dropout(p=0.0, inplace=False)
  (blocks): ModuleList(
    (0-11): 12 x MPTBlock(
      (norm_1): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): MultiheadAttention(
        (Wqkv): Linear(in_features=768, out_features=2304, bias=True)
        (out_proj): Linear(in_features=768, out_features=768, bias=True)
      )
      (norm_2): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (ffn): MPTMLP(
        (up_proj): Linear(in_features=768, out_features=3072, bias=True)
        (down_proj): Linear(in_features=3072, out_features=768, bias=True)
      )
      (resid_attn_dropout): Dropout(p=0.0, inplace=False)
      (resid_ffn_dropout): Dropout(p=0.0, inplace=False)
    )
  )
  (norm_f): LPLayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
2024-01-19 16:08:04,655: rank0[14067][MainThread]: DEBUG: llmfoundry.models.mpt.modeling_mpt: Using kaiming_normal_ initialization.
2024-01-19 16:08:04,797: rank0[14067][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
2024-01-19 16:08:04,858: rank0[14067][MainThread]: INFO: composer.trainer.trainer: Run name: interactive-il0GeA
2024-01-19 16:08:06,015: rank0[14067][MainThread]: INFO: composer.trainer.trainer: Stepping schedulers every batch. To step schedulers every epoch, set `step_schedulers_every_batch=False`.
2024-01-19 16:08:06,016: rank0[14067][MainThread]: DEBUG: composer.loggers.mosaicml_logger: Logging model initialized time to metadata
2024-01-19 16:08:06,026: rank0[14067][MainThread]: INFO: composer.loggers.mosaicml_logger: Logged 1 metadata to MosaicML, waiting on 1
2024-01-19 16:08:06,026: rank0[14067][MainThread]: INFO: composer.trainer.trainer: Setting seed to 17
2024-01-19 16:08:06,026: rank0[14067][MainThread]: INFO: composer.utils.reproducibility: Setting seed to 17
max_seq_len: 2048
global_seed: 17
model:
  name: mpt_causal_lm
  init_device: meta
  d_model: 768
  n_heads: 12
  n_layers: 12
  expansion_ratio: 4
  max_seq_len: 2048
  vocab_size: 50368
  attn_config:
    attn_impl: triton
tokenizer:
  name: mosaicml/mpt-7b-chat
  kwargs:
    model_max_length: 2048
train_loader:
  name: finetuning
  dataset:
    split: train
    hf_name: my_data
    shuffle: true
    max_seq_len: 2048
    shuffle_algo: py1e
    shuffle_seed: 17
    allow_pad_trimming: false
    decoder_only_format: true
  timeout: 0
  drop_last: false
  pin_memory: true
  num_workers: 8
  prefetch_factor: 2
  persistent_workers: true
scheduler:
  name: cosine_with_warmup
  t_warmup: 100ba
  alpha_f: 0.1
optimizer:
  name: decoupled_adamw
  lr: 0.0006
  betas:
  - 0.9
  - 0.95
  eps: 1.0e-08
  weight_decay: 0.0
max_duration: 2ba
eval_interval: 0
eval_first: false
eval_subset_num_batches: -1
global_train_batch_size: 256
seed: 17
device_eval_batch_size: 16
device_train_microbatch_size: 16
precision: amp_bf16
fsdp_config: null
log_to_console: true
console_log_interval: 1ba
n_gpus: 1
device_train_batch_size: 256
device_train_grad_accum: 16
merge: true
n_params: 125311488

2024-01-19 16:08:06,194: rank0[14067][MainThread]: INFO: composer.trainer.trainer: Using precision Precision.AMP_BF16
******************************
Config:
composer_commit_hash: None
composer_version: 0.17.2
node_name: a100-40sxm-h13-02
num_gpus_per_node: 1
num_nodes: 1
rank_zero_seed: 17

******************************
2024-01-19 16:08:06,196: rank0[14067][MainThread]: DEBUG: composer.trainer.trainer: Spinning the dataloaders
/mnt/workdisk/iamroot/llm-foundry/llmfoundry/models/layers/attention.py:455: UserWarning: Propagating key_padding_mask to the attention module and applying it within the attention module can cause unnecessary computation/memory usage. Consider integrating into attn_bias once and passing that to each attention module instead.
  warnings.warn(
2024-01-19 16:08:12,390: rank0[14067][MainThread]: DEBUG: composer.loggers.mosaicml_logger: 
Logging training progress data to metadata:
        training_progress: [batch=1/2]
[batch=1/2]:
         Train time/epoch: 0
         Train time/batch: 0
         Train time/sample: 0
         Train time/batch_in_epoch: 0
         Train time/sample_in_epoch: 0
         Train time/token: 0
         Train time/token_in_epoch: 0
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 11.7774
         Train metrics/train/LanguageCrossEntropy: 11.7839
         Train metrics/train/LanguagePerplexity: 131123.0781
2024-01-19 16:08:16,439: rank0[14067][MainThread]: DEBUG: composer.loggers.mosaicml_logger: 
Logging training progress data to metadata:
        training_progress: [batch=2/2]
[batch=2/2]:
         Train time/batch: 1
         Train time/sample: 256
         Train time/batch_in_epoch: 1
         Train time/sample_in_epoch: 256
         Train time/token: 71426
         Train time/token_in_epoch: 71426
         Train trainer/device_train_microbatch_size: 16
         Train loss/train/total: 11.7663
         Train metrics/train/LanguageCrossEntropy: 11.7706
         Train metrics/train/LanguagePerplexity: 129387.9766
2024-01-19 16:08:16,439: rank0[14067][MainThread]: DEBUG: composer.loggers.mosaicml_logger: 
Logging FINAL training progress data to metadata:
        training_progress: [batch=2/2]
2024-01-19 16:08:16,450: rank0[14067][MainThread]: INFO: composer.loggers.mosaicml_logger: Logged 1 metadata to MosaicML, waiting on 1
2024-01-19 16:08:16,450: rank0[14067][MainThread]: DEBUG: composer.core.engine: Closing the engine.
2024-01-19 16:08:16,450: rank0[14067][MainThread]: DEBUG: composer.core.engine: Closing callback MosaicMLLogger
2024-01-19 16:08:16,461: rank0[14067][MainThread]: INFO: composer.loggers.mosaicml_logger: Logged 0 metadata to MosaicML, waiting on 2
2024-01-19 16:08:16,461: rank0[14067][MainThread]: DEBUG: composer.core.engine: Closing callback ConsoleLogger
2024-01-19 16:08:16,461: rank0[14067][MainThread]: DEBUG: composer.core.engine: Post-closing callback MosaicMLLogger
2024-01-19 16:08:16,461: rank0[14067][MainThread]: DEBUG: composer.core.engine: Post-closing callback ConsoleLogger
2024-01-19 16:08:16,512: rank0[14067][MainThread]: DEBUG: composer.core.engine: Engine closed.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.

@milocress milocress force-pushed the milo/use-chat-tokenizers branch from a97b865 to 5f5a144 Compare January 18, 2024 19:16
@irenedea
Copy link
Contributor

Can you add a manual test run that trains a model on a chat dataset with your changes?

llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
Copy link
Contributor

@irenedea irenedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good! Had some minor comments + test requests.

tests/data/test_dataloader.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@irenedea irenedea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work, thanks Milo! will let @dakinggg review as well

llmfoundry/data/finetuning/tasks.py Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
llmfoundry/data/finetuning/tasks.py Outdated Show resolved Hide resolved
tests/data/test_dataloader.py Show resolved Hide resolved
tests/data/test_template_tokenization.py Show resolved Hide resolved
@irenedea
Copy link
Contributor

Seeing some failures in CI/CD @milocress

Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with two last comments! Thanks Milo!

@milocress milocress merged commit ac78354 into mosaicml:main Jan 26, 2024
10 checks passed
XiaohanZhangCMU added a commit that referenced this pull request Mar 14, 2024
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* Logs upload URI (#850)

* fix style etc.

* fix

* fix fix

* fix fix fix

* fix fix fix fix

* removed unused dummy func

* deleted tests to make the tests pass

* tried adding back some tests to see if it triggers the issue

* add test_hf_checkpointer.py but remove references to MPT

* fix?

* fixed test cases overlapping in strange side-effecty ways

* update

* Delta to JSONL conversion script cleanup and bug fix (#868)

* Small test change

* small cleanups

* lint and precommit

* lint and precommit

* comments

* another one

* pr suggestion and use input param not args

* fix mock (#872)

* Add response tokens

* update

* fix regex (#877)

* Precompute flash attention padding info (#880)

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* dummy data

* undoing last commit

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* ..

* ..

---------

Co-authored-by: Vitaliy Chiley <[email protected]>

* add missing import (#882)

* fsdp wrap refac (#883)

* fsdp wrap refac

* refac

* refac

* Update model download utils to support ORAS (#881)

* wip

* wip

* Accept registry file for hostname

* Make sure no sensitive info is surfaced in subprocess error

* Refactor model downloading

* Save HF hub files to local dir

* fallback

* Remove commented code

* Update logging

* Update HTP download args

* Use files for ORAS

* Update llmfoundry/utils/model_download_utils.py

Co-authored-by: Irene Dea <[email protected]>

---------

Co-authored-by: Irene Dea <[email protected]>

* Update license (#887)

Updates the license for 2024. New files will have a copyright year of 2024 inserted in the header. Existing files will not be changed.

* Fix tiktoken add generation prompt (#890)

* update

* Upgrade Datasets version (#892)

* Disable MDSWrite, return token counts

* Bump transformers version to support Mixtral (#894)

* Add `tokenizer-only` flag to only download tokenizers from HF or oras (#895)

* Foundational Model API eval wrapper (#849)

* FMAPI model wrapper

* add chat wrapper too

* revert

* end line

* formatting

* less verbose

* better error messages

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* Add better error for non-empty local output folder in convert_text_to_mds.py (#891)

* Allow bool input for loggers (#897)

* Allow bool input for loggers

* Convert earlier on

* Fix test case

* Enable QK Group Norm (#869)

* start qkgn

* attn defaults for qk_gn

* impl qk_gn

* Update attention.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* lint

* Update attention.py

* lint

* add avlue error

* Update attention.py

* updt to include low precision groupnorm;

* perf improvement

* Revert "perf improvement"

This reverts commit 2b62d5e.

* Revert "updt to include low precision groupnorm;"

This reverts commit bca1c33.

* patch (#905)

* Add new GC option (#907)

* No symlinks at all for HF download (#908)

* Adds support for chat formatted finetuning input data. (#884)

* fix conflicting formatting linting guidelines

* used older union operator for legacy support

* did the same thing in another place

* isort ignore specific lines

* fixes

* isort do not skip line

* address comments

* renamed some more things

* split tests and add some verification for tokenization split

* fix formatting

* added docstrings

* added end-to-end-test with HF dataset

* fix code style

* renamed file and fixed tests

* use chat template diff

* addressed comment

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* fixed type of TokenizedExample

* use cast

* use _ALLOWED_{PROMPT, RESPONSE}_KEYS

* updated tests

* fix

* fix?

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add flag to enable/disable param upload (#912)

* Add flag to enable/disable param upload

* Yapf

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

* Rename

* Add to eval

---------

Co-authored-by: Daniel King <[email protected]>

* Add support for eval_loader & eval_subset_num_batches in async callback (#834)

* Skip evalloader in training if using async eval

* add support for subset_num_batches

* remove todo

* eval first

* rename arg

* fix

* small updates

* om

* fix test

* eval run config

---------

Co-authored-by: Daniel King <[email protected]>

* Add the model license file for mlflow (#915)

* Warn instead of error on tokenizer-only with http (#904)

* Fix fmapi_chat for instruct models and custom tokenizers (#914)

* Fix fmapi_chat for instruct models and custom tokenizers

* remove from tiktoken

* fix

* add tests

* fix test, 0->1

* refactor

* Make yamllint consistent with Composer (#918)

* Create HF checkpointer model on meta device (#916)

* Tiktoken chat format fix (#893)

* sys prompt fix

* remove eos tokens from chat formatter

* fix dash issue (#919)

* fix dash issue

* fix

* fix?

* added unit test

* fix fix

* fix tests

* fix fix tests

* Fixes yaml linting (#920)

* Adding deprecation warning for Flash Attention 1 and user warning against using Triton attention. (#921)

* Add rich formatting to tracebacks (#927)

* added rich traceback

* sorted imports

* added rich to eval

* Changes to setup.py invalidate docker cache. Use branch name in dockerfile (#930)

Co-authored-by: Daniel King <[email protected]>

* Remove .ci folder and move FILE_HEADER (#931)

* Throw error when no EOS (#922)

* bump (#934)

* Update eval_gauntlet_callback.py with math.log2 (#821)

Saw an automated ruff flag this, seems like a strict improvement and is marginally faster.

Co-authored-by: Daniel King <[email protected]>

* Switch to the Composer integration of LoRA (works with FSDP) (#886)

* Refactoring the function to accept list of metric names instead of a dictionary of metrics. (#938)

* ..

* undoing prev commit

* Refactoring the  function to accept list of metric names instead of dictionary

* ..

* ..

* ..

* ..

* Remove extra call to .to and load_state_dict in hf checkpointer (#939)

* Fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking (#940)

* ..

* undoing prev commit

* fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking

* Update modeling_mpt.py

* ..

---------

Co-authored-by: Daniel King <[email protected]>

* Update lora docs (#941)

* fix (#942)

* Retrieve license information when local files are provided for a pretrained model (#943)

* Initial implementation to test

* Add log for license overwrite

* Use Path for input to _write_license_information

* Set default

---------

Co-authored-by: Daniel King <[email protected]>

* Add and use VersionedDeprecationWarning (#944)

* Add and use VersionedDeprecationWarning

* Use remove_version instead.

* Fix merge

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Bump llm-foundry version to 0.5.0 (#948)

* Bump version to 0.5.0

* Remove deprecated features

* Other cleanup

* code quality

* Fix chain-of-thought tasks (#824)

* Skip flaky lion8b test (#598)

* relax atol and add retries to reduce flakiness in lion8b timing test

* add eval output logging

* add back tasks

* foo

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* fix prompt

* fix prompt

* modify mcli

* test

* test

* fix

* added math dataset

* edit yaml

* prep gsm8k identically to eleuther

* prep gsm8k identically to eleuther

* add early stopping criteria

* finish

* debug

* fix

* bug

* remove eval output logging callback

* restore

* fix

* fix

* fix composer verion

* gauntlet v0.2.1

* gauntlet v0.2.1

* prep

* prep

* foo

* restore

* restore

* restore mcli

* fix precommit

* fix

* Update hf_eval.yaml

* fix

* fix

* remove programming

* update readme

---------

Co-authored-by: dblalock <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Add finetuning streaming dataset conversion (#933)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* review comments

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update scripts/data_prep/convert_finetuning_dataset.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add default signature to mlflow saved model (#952)

* allow te to use meta device with deferred init (#958)

* Update TUTORIAL.md (#957)

* Update TUTORIAL.md

fix indentation problem

* Update TUTORIAL.md

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Bump mcli yaml foundry version to v0.5.0 (#959)

* add finutuning with streaming dataset example (#945)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* fix yaml

* comments

* comments

* comments

* add unit test

* comments

---------

Co-authored-by: Daniel King <[email protected]>

* Add fully configurable activation checkpointing (#951)

* add fully configurable activation checkpointing

* fix format

* fix format

* add docstring to activation_checkpointing_fn

* add block id range option in act ckpt

* resolve conflict

* add a check for blocks ids overlap in mapping

* fix typo

* update docstring

* refactor

* fix test

* Apply suggestions from code review

Co-authored-by: Mihir Patel <[email protected]>

* address comments

* add build mapping as a helper func

* fix format

---------

Co-authored-by: Mihir Patel <[email protected]>

* Use create_model_version instead of register_model (#953)

* Add streams support (#946)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* add convert

* fix

* v0

* fix

* fix MDS write

* streams support

* fake commit

* fix setup

* format

* add back arxiv

* trigger test

* review comments

* temporarily trigger test

* test

* add convert

* fix

* fix

* fix MDS write

* format

* trigger test

* fix

* format

* resolve conflicts

* add back jsonl

* fix yaml

* comments

* format

* comments

* comments

* add unit test

* comments

* comments

* merge

* format

* typo

* Update llmfoundry/data/finetuning/dataloader.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Fix typo (#966)

* Fix eval.py with lora (#965)

* just remove it?

* or not

* fix

* fix up

* clean up

* fix example yaml

* precommit

* add test

* add memorysnapshot to callbacks (#810)

Co-authored-by: Daniel King <[email protected]>

* Adding curriculum learning callback (experimental) (#954)

* curriculum learning callback

* curriculum learning callback

* fixing types

* dataset config types correct

* dataset config retrieved correctly

* access train dataloader correctly

* load state dict defaults

* get that damn dataloader

* missed dat

* dataspec L

* dataset L

* no logging, print is my best friend

* save first dataset config

* don't save new dataset config every single time

* logging dataset state

* have to set the damn timestamp. rip

* remove logging

* linting

* pyright

* removing rope...

* Delete scripts/eval/local_data/.DS_Store

* trailing comma is bacc

* fixed docstring

* fixed docstrings

* no more funky stuff in save_dict

* refactored, assuming before_load event in composer

* lingint

* bumped composer and streaming min versions

* moved line

* strengthened chat formatting validation (#960)

* strengthened chat formatting validation

* fix types

* made assert messages more descriptive

* used raise instead of assert, added type checks

* added list type check

* type error if no string content

* add test case for new validation

* relaxed type constraints to interface minimum

* use Mapping and Iterable

* fix mapping in type aliases too

* iterable -> sequence

* sequence -> list

* Mapping -> Dict

* use mapping again

* fixed another one

* updated message

* factored out duplicate functions

* dict -> mapping

* add sequence

* Add new base images and remove fa1 images (#970)

* Add new ICL kwargs in eval.py and long_context yamls (#925)

* add yamls w/ old links

* load from max's public hf and parse hf datasets

* update rest of tasks

* add better logging

* implemented leval tasks

* move level

* add level yaml

* add str parsing to hf

* wip

* llm-foundry working with new parser

* working w/ new parsing

* fix old long context tasks

* wip

* wip

* wip

* wip

* update to hf_parsing_map

* rm defaults

* fix parsing vars

* update defaults again

* rm merge conflict

* fix gen_kwargs

* rm old code path

* fixups

* wip

* rm leval from pr

* fix comments in yamls

* add cot params

* add fewshot_random_seed

* fix early_stopping_criteria, fewshot_num_seed default

* undo rm hf_eval

* add fewshot_random_seed to test

* add 64k tasks

* add longer context, update composer versin

* address comments

* mixed

* use seed by default

* rm  long_context_eval_8k.yaml

* add longer context evals

* mv yamls

* eval gauntlet wip

* update niah and wikiqa

* fix linting

* add default option

* change defaults

* fix linting

* fix linting 2

---------

Co-authored-by: Daniel King <[email protected]>

* Make Composer pins consistent with each other (#972)

* Make turbo an optional dependency (#964)

* Fix fewshot_random_seed default setting (#974)

* del fewshot_random default, fix hf_eval, fix gauntlet readme

* set in cfg defaults area

* fix the fix i applied that was actually not a fix

* rm num_batch from hf_eval

* improve error msg when checking target_blocks in activation_checkpointing_target (#977)

* Torch 2.2 upgrade - Part 1 (#976)

* Torch 2.2 - Part 2 (#979)

* PyTorch 2.2 - Part 3 (#981)

* Remove torch 2.1 from docker build (#982)

* Async callback: Don't skip checkpoints, reliably only launch async eval when the checkpoint is ready (#813)

* working without sharded checkpointing..

* add more debugs

* try this

* more debugging

* yikes dumb bug

* add notes

* fixes

* remove prints

* small updates

* fix typo

* refactor

* fix docstring formatting

* fighting with docstrings

* try this

* add unit tests

* point to composer update

* values -> items

* serialize time

* fix merge

* nits

* warning, small comment update

* add error

---------

Co-authored-by: Daniel King <[email protected]>

* Token accuracy metrics (#983)

* do not mention 1.13 in readme (#988)

Co-authored-by: Daniel King <[email protected]>

* Patch test, lock mcli version (#990)

* Bump gha timeouts (#991)

* Fix readme typo (#993)

* if condition in tie weights added (#989)

* if condition in tie weights added

* unit test for tie weights

* bump composer version (#995)

* Trim examples ahead of time for auto packing (#994)

* add oom observer callback (#932)

* add oom observer callback

* fix format

* Change ci/cd to use ci-testing repo

* Revert "Change ci/cd to use ci-testing repo"

This reverts commit e3f214e.

* Use ci-testing repo (#1000)

Co-authored-by: Irene Dea <[email protected]>

* Make CodeEval respect device_eval_batch_size (#956)

* Remove try except around imports (#1004)

* Deprecate triton, prefix lm, llama attention patch, and text denoising; Make ComposerHFT5 experimental (#1007)

* Deprecate features and mark experimental

* fix typo

---------

Co-authored-by: Daniel King <[email protected]>

* add magic filename for sharded state dicts (#1001)

* add magic filename for sharded state dicts

* Update scripts/train/train.py

Co-authored-by: Daniel King <[email protected]>

* oops forgot to push this

* no shard if no fsdp

* default to full on foundry

---------

Co-authored-by: Daniel King <[email protected]>

* bump (#1009)

* Fix evaluators actually pulling eval metrics (#1006)

* fix bug on metrics

* lint

* lint

* add unit test

* lint

* Build torch 2.2.1 images (#1010)

* add 2.2.1 tests (#1011)

* Bump min torch pin (#1013)

Red button because CI running jobs it doesn't need. Tests passed on main.

* Fix extra BOS token in front of response for some tokenizers (#1003)

* Bump min composer pin (#1015)

* add default for eval interval (#987)

Co-authored-by: Daniel King <[email protected]>

* Add support for olmo (#1016)

* Add deeper support for multi-turn chats and loss-generating tokens in finetuning (#985)

The main purpose of this PR is to support training on non-terminal responses in multi-round chats. This is achieved by tokenizing at the level of conversation "turns" and exposing some options for what turns are used as training targets (i.e. generate loss). This also adds support for treating prompt tokens as loss-generating.

The script for converting a finetuning dataset to streaming has also been updated (with some bug fixes).

* Fix profiling packing ratio to explicitly say 1 (#1019)

* Bump transformers to 4.38.2 (#1018)

* that kwargs (#1020)

* Update readme with pytorch 2.2.1 (#1021)

* Add code import to train/eval scripts (#1002)

* finish (#1022)

Co-authored-by: Max Marion <[email protected]>

* Bump version to 0.6.0 (#1023)

* Fix typo in monolithic chkpt callback docs (#1024)

* Fix typo in monolithic chkpt callback docs

* reorder to match function signature

* update pip install link

* Change done file location

* Create the dest folder

* Allow code-quality workflow to be callable (#1026)

Reverts part of the change made in
https://github.com/mosaicml/llm-foundry/pull/1000/files#diff-4a2765c2cfcbd3804a66aab805cb92ddda74de1730923cc5bf53671d0beccf06L11

* update notebook

* update

* update notebook

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Milo Cress <[email protected]>
Co-authored-by: Nancy Hung <[email protected]>
Co-authored-by: Jerry Chen <[email protected]>
Co-authored-by: Shashank Rajput <[email protected]>
Co-authored-by: Vitaliy Chiley <[email protected]>
Co-authored-by: Irene Dea <[email protected]>
Co-authored-by: Brian <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: Nicholas Garcia <[email protected]>
Co-authored-by: Prithviraj Ammanabrolu <[email protected]>
Co-authored-by: Jane Zhang <[email protected]>
Co-authored-by: Vincent Chen <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
Co-authored-by: Jeremy D <[email protected]>
Co-authored-by: dblalock <[email protected]>
Co-authored-by: bigning <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Sebastián Donoso Bustos <[email protected]>
Co-authored-by: Saaketh Narayan <[email protected]>
Co-authored-by: Max Marion <[email protected]>
Co-authored-by: Megha Agarwal <[email protected]>
Co-authored-by: Jose Javier <[email protected]>
Co-authored-by: Alex Trott <[email protected]>
Co-authored-by: Sasha Doubov <[email protected]>
XiaohanZhangCMU added a commit that referenced this pull request Mar 14, 2024
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* Logs upload URI (#850)

* fix style etc.

* fix

* fix fix

* fix fix fix

* fix fix fix fix

* removed unused dummy func

* deleted tests to make the tests pass

* tried adding back some tests to see if it triggers the issue

* add test_hf_checkpointer.py but remove references to MPT

* fix?

* fixed test cases overlapping in strange side-effecty ways

* update

* Delta to JSONL conversion script cleanup and bug fix (#868)

* Small test change

* small cleanups

* lint and precommit

* lint and precommit

* comments

* another one

* pr suggestion and use input param not args

* fix mock (#872)

* Add response tokens

* update

* fix regex (#877)

* Precompute flash attention padding info (#880)

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* dummy data

* undoing last commit

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* ..

* ..

---------

Co-authored-by: Vitaliy Chiley <[email protected]>

* add missing import (#882)

* fsdp wrap refac (#883)

* fsdp wrap refac

* refac

* refac

* Update model download utils to support ORAS (#881)

* wip

* wip

* Accept registry file for hostname

* Make sure no sensitive info is surfaced in subprocess error

* Refactor model downloading

* Save HF hub files to local dir

* fallback

* Remove commented code

* Update logging

* Update HTP download args

* Use files for ORAS

* Update llmfoundry/utils/model_download_utils.py

Co-authored-by: Irene Dea <[email protected]>

---------

Co-authored-by: Irene Dea <[email protected]>

* Update license (#887)

Updates the license for 2024. New files will have a copyright year of 2024 inserted in the header. Existing files will not be changed.

* Fix tiktoken add generation prompt (#890)

* update

* Upgrade Datasets version (#892)

* Disable MDSWrite, return token counts

* Bump transformers version to support Mixtral (#894)

* Add `tokenizer-only` flag to only download tokenizers from HF or oras (#895)

* Foundational Model API eval wrapper (#849)

* FMAPI model wrapper

* add chat wrapper too

* revert

* end line

* formatting

* less verbose

* better error messages

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* Add better error for non-empty local output folder in convert_text_to_mds.py (#891)

* Allow bool input for loggers (#897)

* Allow bool input for loggers

* Convert earlier on

* Fix test case

* Enable QK Group Norm (#869)

* start qkgn

* attn defaults for qk_gn

* impl qk_gn

* Update attention.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* lint

* Update attention.py

* lint

* add avlue error

* Update attention.py

* updt to include low precision groupnorm;

* perf improvement

* Revert "perf improvement"

This reverts commit 2b62d5e.

* Revert "updt to include low precision groupnorm;"

This reverts commit bca1c33.

* patch (#905)

* Add new GC option (#907)

* No symlinks at all for HF download (#908)

* Adds support for chat formatted finetuning input data. (#884)

* fix conflicting formatting linting guidelines

* used older union operator for legacy support

* did the same thing in another place

* isort ignore specific lines

* fixes

* isort do not skip line

* address comments

* renamed some more things

* split tests and add some verification for tokenization split

* fix formatting

* added docstrings

* added end-to-end-test with HF dataset

* fix code style

* renamed file and fixed tests

* use chat template diff

* addressed comment

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* fixed type of TokenizedExample

* use cast

* use _ALLOWED_{PROMPT, RESPONSE}_KEYS

* updated tests

* fix

* fix?

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add flag to enable/disable param upload (#912)

* Add flag to enable/disable param upload

* Yapf

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

* Rename

* Add to eval

---------

Co-authored-by: Daniel King <[email protected]>

* Add support for eval_loader & eval_subset_num_batches in async callback (#834)

* Skip evalloader in training if using async eval

* add support for subset_num_batches

* remove todo

* eval first

* rename arg

* fix

* small updates

* om

* fix test

* eval run config

---------

Co-authored-by: Daniel King <[email protected]>

* Add the model license file for mlflow (#915)

* Warn instead of error on tokenizer-only with http (#904)

* Fix fmapi_chat for instruct models and custom tokenizers (#914)

* Fix fmapi_chat for instruct models and custom tokenizers

* remove from tiktoken

* fix

* add tests

* fix test, 0->1

* refactor

* Make yamllint consistent with Composer (#918)

* Create HF checkpointer model on meta device (#916)

* Tiktoken chat format fix (#893)

* sys prompt fix

* remove eos tokens from chat formatter

* fix dash issue (#919)

* fix dash issue

* fix

* fix?

* added unit test

* fix fix

* fix tests

* fix fix tests

* Fixes yaml linting (#920)

* Adding deprecation warning for Flash Attention 1 and user warning against using Triton attention. (#921)

* Add rich formatting to tracebacks (#927)

* added rich traceback

* sorted imports

* added rich to eval

* Changes to setup.py invalidate docker cache. Use branch name in dockerfile (#930)

Co-authored-by: Daniel King <[email protected]>

* Remove .ci folder and move FILE_HEADER (#931)

* Throw error when no EOS (#922)

* bump (#934)

* Update eval_gauntlet_callback.py with math.log2 (#821)

Saw an automated ruff flag this, seems like a strict improvement and is marginally faster.

Co-authored-by: Daniel King <[email protected]>

* Switch to the Composer integration of LoRA (works with FSDP) (#886)

* Refactoring the function to accept list of metric names instead of a dictionary of metrics. (#938)

* ..

* undoing prev commit

* Refactoring the  function to accept list of metric names instead of dictionary

* ..

* ..

* ..

* ..

* Remove extra call to .to and load_state_dict in hf checkpointer (#939)

* Fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking (#940)

* ..

* undoing prev commit

* fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking

* Update modeling_mpt.py

* ..

---------

Co-authored-by: Daniel King <[email protected]>

* Update lora docs (#941)

* fix (#942)

* Retrieve license information when local files are provided for a pretrained model (#943)

* Initial implementation to test

* Add log for license overwrite

* Use Path for input to _write_license_information

* Set default

---------

Co-authored-by: Daniel King <[email protected]>

* Add and use VersionedDeprecationWarning (#944)

* Add and use VersionedDeprecationWarning

* Use remove_version instead.

* Fix merge

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Bump llm-foundry version to 0.5.0 (#948)

* Bump version to 0.5.0

* Remove deprecated features

* Other cleanup

* code quality

* Fix chain-of-thought tasks (#824)

* Skip flaky lion8b test (#598)

* relax atol and add retries to reduce flakiness in lion8b timing test

* add eval output logging

* add back tasks

* foo

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* fix prompt

* fix prompt

* modify mcli

* test

* test

* fix

* added math dataset

* edit yaml

* prep gsm8k identically to eleuther

* prep gsm8k identically to eleuther

* add early stopping criteria

* finish

* debug

* fix

* bug

* remove eval output logging callback

* restore

* fix

* fix

* fix composer verion

* gauntlet v0.2.1

* gauntlet v0.2.1

* prep

* prep

* foo

* restore

* restore

* restore mcli

* fix precommit

* fix

* Update hf_eval.yaml

* fix

* fix

* remove programming

* update readme

---------

Co-authored-by: dblalock <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Add finetuning streaming dataset conversion (#933)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* review comments

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update scripts/data_prep/convert_finetuning_dataset.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add default signature to mlflow saved model (#952)

* allow te to use meta device with deferred init (#958)

* Update TUTORIAL.md (#957)

* Update TUTORIAL.md

fix indentation problem

* Update TUTORIAL.md

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Bump mcli yaml foundry version to v0.5.0 (#959)

* add finutuning with streaming dataset example (#945)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* fix yaml

* comments

* comments

* comments

* add unit test

* comments

---------

Co-authored-by: Daniel King <[email protected]>

* Add fully configurable activation checkpointing (#951)

* add fully configurable activation checkpointing

* fix format

* fix format

* add docstring to activation_checkpointing_fn

* add block id range option in act ckpt

* resolve conflict

* add a check for blocks ids overlap in mapping

* fix typo

* update docstring

* refactor

* fix test

* Apply suggestions from code review

Co-authored-by: Mihir Patel <[email protected]>

* address comments

* add build mapping as a helper func

* fix format

---------

Co-authored-by: Mihir Patel <[email protected]>

* Use create_model_version instead of register_model (#953)

* Add streams support (#946)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* add convert

* fix

* v0

* fix

* fix MDS write

* streams support

* fake commit

* fix setup

* format

* add back arxiv

* trigger test

* review comments

* temporarily trigger test

* test

* add convert

* fix

* fix

* fix MDS write

* format

* trigger test

* fix

* format

* resolve conflicts

* add back jsonl

* fix yaml

* comments

* format

* comments

* comments

* add unit test

* comments

* comments

* merge

* format

* typo

* Update llmfoundry/data/finetuning/dataloader.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Fix typo (#966)

* Fix eval.py with lora (#965)

* just remove it?

* or not

* fix

* fix up

* clean up

* fix example yaml

* precommit

* add test

* add memorysnapshot to callbacks (#810)

Co-authored-by: Daniel King <[email protected]>

* Adding curriculum learning callback (experimental) (#954)

* curriculum learning callback

* curriculum learning callback

* fixing types

* dataset config types correct

* dataset config retrieved correctly

* access train dataloader correctly

* load state dict defaults

* get that damn dataloader

* missed dat

* dataspec L

* dataset L

* no logging, print is my best friend

* save first dataset config

* don't save new dataset config every single time

* logging dataset state

* have to set the damn timestamp. rip

* remove logging

* linting

* pyright

* removing rope...

* Delete scripts/eval/local_data/.DS_Store

* trailing comma is bacc

* fixed docstring

* fixed docstrings

* no more funky stuff in save_dict

* refactored, assuming before_load event in composer

* lingint

* bumped composer and streaming min versions

* moved line

* strengthened chat formatting validation (#960)

* strengthened chat formatting validation

* fix types

* made assert messages more descriptive

* used raise instead of assert, added type checks

* added list type check

* type error if no string content

* add test case for new validation

* relaxed type constraints to interface minimum

* use Mapping and Iterable

* fix mapping in type aliases too

* iterable -> sequence

* sequence -> list

* Mapping -> Dict

* use mapping again

* fixed another one

* updated message

* factored out duplicate functions

* dict -> mapping

* add sequence

* Add new base images and remove fa1 images (#970)

* Add new ICL kwargs in eval.py and long_context yamls (#925)

* add yamls w/ old links

* load from max's public hf and parse hf datasets

* update rest of tasks

* add better logging

* implemented leval tasks

* move level

* add level yaml

* add str parsing to hf

* wip

* llm-foundry working with new parser

* working w/ new parsing

* fix old long context tasks

* wip

* wip

* wip

* wip

* update to hf_parsing_map

* rm defaults

* fix parsing vars

* update defaults again

* rm merge conflict

* fix gen_kwargs

* rm old code path

* fixups

* wip

* rm leval from pr

* fix comments in yamls

* add cot params

* add fewshot_random_seed

* fix early_stopping_criteria, fewshot_num_seed default

* undo rm hf_eval

* add fewshot_random_seed to test

* add 64k tasks

* add longer context, update composer versin

* address comments

* mixed

* use seed by default

* rm  long_context_eval_8k.yaml

* add longer context evals

* mv yamls

* eval gauntlet wip

* update niah and wikiqa

* fix linting

* add default option

* change defaults

* fix linting

* fix linting 2

---------

Co-authored-by: Daniel King <[email protected]>

* Make Composer pins consistent with each other (#972)

* Make turbo an optional dependency (#964)

* Fix fewshot_random_seed default setting (#974)

* del fewshot_random default, fix hf_eval, fix gauntlet readme

* set in cfg defaults area

* fix the fix i applied that was actually not a fix

* rm num_batch from hf_eval

* improve error msg when checking target_blocks in activation_checkpointing_target (#977)

* Torch 2.2 upgrade - Part 1 (#976)

* Torch 2.2 - Part 2 (#979)

* PyTorch 2.2 - Part 3 (#981)

* Remove torch 2.1 from docker build (#982)

* Async callback: Don't skip checkpoints, reliably only launch async eval when the checkpoint is ready (#813)

* working without sharded checkpointing..

* add more debugs

* try this

* more debugging

* yikes dumb bug

* add notes

* fixes

* remove prints

* small updates

* fix typo

* refactor

* fix docstring formatting

* fighting with docstrings

* try this

* add unit tests

* point to composer update

* values -> items

* serialize time

* fix merge

* nits

* warning, small comment update

* add error

---------

Co-authored-by: Daniel King <[email protected]>

* Token accuracy metrics (#983)

* do not mention 1.13 in readme (#988)

Co-authored-by: Daniel King <[email protected]>

* Patch test, lock mcli version (#990)

* Bump gha timeouts (#991)

* Fix readme typo (#993)

* if condition in tie weights added (#989)

* if condition in tie weights added

* unit test for tie weights

* bump composer version (#995)

* Trim examples ahead of time for auto packing (#994)

* add oom observer callback (#932)

* add oom observer callback

* fix format

* Change ci/cd to use ci-testing repo

* Revert "Change ci/cd to use ci-testing repo"

This reverts commit e3f214e.

* Use ci-testing repo (#1000)

Co-authored-by: Irene Dea <[email protected]>

* Make CodeEval respect device_eval_batch_size (#956)

* Remove try except around imports (#1004)

* Deprecate triton, prefix lm, llama attention patch, and text denoising; Make ComposerHFT5 experimental (#1007)

* Deprecate features and mark experimental

* fix typo

---------

Co-authored-by: Daniel King <[email protected]>

* add magic filename for sharded state dicts (#1001)

* add magic filename for sharded state dicts

* Update scripts/train/train.py

Co-authored-by: Daniel King <[email protected]>

* oops forgot to push this

* no shard if no fsdp

* default to full on foundry

---------

Co-authored-by: Daniel King <[email protected]>

* bump (#1009)

* Fix evaluators actually pulling eval metrics (#1006)

* fix bug on metrics

* lint

* lint

* add unit test

* lint

* Build torch 2.2.1 images (#1010)

* add 2.2.1 tests (#1011)

* Bump min torch pin (#1013)

Red button because CI running jobs it doesn't need. Tests passed on main.

* Fix extra BOS token in front of response for some tokenizers (#1003)

* Bump min composer pin (#1015)

* add default for eval interval (#987)

Co-authored-by: Daniel King <[email protected]>

* Add support for olmo (#1016)

* Add deeper support for multi-turn chats and loss-generating tokens in finetuning (#985)

The main purpose of this PR is to support training on non-terminal responses in multi-round chats. This is achieved by tokenizing at the level of conversation "turns" and exposing some options for what turns are used as training targets (i.e. generate loss). This also adds support for treating prompt tokens as loss-generating.

The script for converting a finetuning dataset to streaming has also been updated (with some bug fixes).

* Fix profiling packing ratio to explicitly say 1 (#1019)

* Bump transformers to 4.38.2 (#1018)

* that kwargs (#1020)

* Update readme with pytorch 2.2.1 (#1021)

* Add code import to train/eval scripts (#1002)

* finish (#1022)

Co-authored-by: Max Marion <[email protected]>

* Bump version to 0.6.0 (#1023)

* Fix typo in monolithic chkpt callback docs (#1024)

* Fix typo in monolithic chkpt callback docs

* reorder to match function signature

* update pip install link

* Change done file location

* Create the dest folder

* Allow code-quality workflow to be callable (#1026)

Reverts part of the change made in
https://github.com/mosaicml/llm-foundry/pull/1000/files#diff-4a2765c2cfcbd3804a66aab805cb92ddda74de1730923cc5bf53671d0beccf06L11

* update notebook

* update

* update notebook

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Milo Cress <[email protected]>
Co-authored-by: Nancy Hung <[email protected]>
Co-authored-by: Jerry Chen <[email protected]>
Co-authored-by: Shashank Rajput <[email protected]>
Co-authored-by: Vitaliy Chiley <[email protected]>
Co-authored-by: Irene Dea <[email protected]>
Co-authored-by: Brian <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: Nicholas Garcia <[email protected]>
Co-authored-by: Prithviraj Ammanabrolu <[email protected]>
Co-authored-by: Jane Zhang <[email protected]>
Co-authored-by: Vincent Chen <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
Co-authored-by: Jeremy D <[email protected]>
Co-authored-by: dblalock <[email protected]>
Co-authored-by: bigning <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Sebastián Donoso Bustos <[email protected]>
Co-authored-by: Saaketh Narayan <[email protected]>
Co-authored-by: Max Marion <[email protected]>
Co-authored-by: Megha Agarwal <[email protected]>
Co-authored-by: Jose Javier <[email protected]>
Co-authored-by: Alex Trott <[email protected]>
Co-authored-by: Sasha Doubov <[email protected]>
XiaohanZhangCMU added a commit that referenced this pull request Mar 14, 2024
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* Logs upload URI (#850)

* fix style etc.

* fix

* fix fix

* fix fix fix

* fix fix fix fix

* removed unused dummy func

* deleted tests to make the tests pass

* tried adding back some tests to see if it triggers the issue

* add test_hf_checkpointer.py but remove references to MPT

* fix?

* fixed test cases overlapping in strange side-effecty ways

* update

* Delta to JSONL conversion script cleanup and bug fix (#868)

* Small test change

* small cleanups

* lint and precommit

* lint and precommit

* comments

* another one

* pr suggestion and use input param not args

* fix mock (#872)

* Add response tokens

* update

* fix regex (#877)

* Precompute flash attention padding info (#880)

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* dummy data

* undoing last commit

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* ..

* ..

---------

Co-authored-by: Vitaliy Chiley <[email protected]>

* add missing import (#882)

* fsdp wrap refac (#883)

* fsdp wrap refac

* refac

* refac

* Update model download utils to support ORAS (#881)

* wip

* wip

* Accept registry file for hostname

* Make sure no sensitive info is surfaced in subprocess error

* Refactor model downloading

* Save HF hub files to local dir

* fallback

* Remove commented code

* Update logging

* Update HTP download args

* Use files for ORAS

* Update llmfoundry/utils/model_download_utils.py

Co-authored-by: Irene Dea <[email protected]>

---------

Co-authored-by: Irene Dea <[email protected]>

* Update license (#887)

Updates the license for 2024. New files will have a copyright year of 2024 inserted in the header. Existing files will not be changed.

* Fix tiktoken add generation prompt (#890)

* update

* Upgrade Datasets version (#892)

* Disable MDSWrite, return token counts

* Bump transformers version to support Mixtral (#894)

* Add `tokenizer-only` flag to only download tokenizers from HF or oras (#895)

* Foundational Model API eval wrapper (#849)

* FMAPI model wrapper

* add chat wrapper too

* revert

* end line

* formatting

* less verbose

* better error messages

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* Add better error for non-empty local output folder in convert_text_to_mds.py (#891)

* Allow bool input for loggers (#897)

* Allow bool input for loggers

* Convert earlier on

* Fix test case

* Enable QK Group Norm (#869)

* start qkgn

* attn defaults for qk_gn

* impl qk_gn

* Update attention.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* lint

* Update attention.py

* lint

* add avlue error

* Update attention.py

* updt to include low precision groupnorm;

* perf improvement

* Revert "perf improvement"

This reverts commit 2b62d5e.

* Revert "updt to include low precision groupnorm;"

This reverts commit bca1c33.

* patch (#905)

* Add new GC option (#907)

* No symlinks at all for HF download (#908)

* Adds support for chat formatted finetuning input data. (#884)

* fix conflicting formatting linting guidelines

* used older union operator for legacy support

* did the same thing in another place

* isort ignore specific lines

* fixes

* isort do not skip line

* address comments

* renamed some more things

* split tests and add some verification for tokenization split

* fix formatting

* added docstrings

* added end-to-end-test with HF dataset

* fix code style

* renamed file and fixed tests

* use chat template diff

* addressed comment

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* fixed type of TokenizedExample

* use cast

* use _ALLOWED_{PROMPT, RESPONSE}_KEYS

* updated tests

* fix

* fix?

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add flag to enable/disable param upload (#912)

* Add flag to enable/disable param upload

* Yapf

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

* Rename

* Add to eval

---------

Co-authored-by: Daniel King <[email protected]>

* Add support for eval_loader & eval_subset_num_batches in async callback (#834)

* Skip evalloader in training if using async eval

* add support for subset_num_batches

* remove todo

* eval first

* rename arg

* fix

* small updates

* om

* fix test

* eval run config

---------

Co-authored-by: Daniel King <[email protected]>

* Add the model license file for mlflow (#915)

* Warn instead of error on tokenizer-only with http (#904)

* Fix fmapi_chat for instruct models and custom tokenizers (#914)

* Fix fmapi_chat for instruct models and custom tokenizers

* remove from tiktoken

* fix

* add tests

* fix test, 0->1

* refactor

* Make yamllint consistent with Composer (#918)

* Create HF checkpointer model on meta device (#916)

* Tiktoken chat format fix (#893)

* sys prompt fix

* remove eos tokens from chat formatter

* fix dash issue (#919)

* fix dash issue

* fix

* fix?

* added unit test

* fix fix

* fix tests

* fix fix tests

* Fixes yaml linting (#920)

* Adding deprecation warning for Flash Attention 1 and user warning against using Triton attention. (#921)

* Add rich formatting to tracebacks (#927)

* added rich traceback

* sorted imports

* added rich to eval

* Changes to setup.py invalidate docker cache. Use branch name in dockerfile (#930)

Co-authored-by: Daniel King <[email protected]>

* Remove .ci folder and move FILE_HEADER (#931)

* Throw error when no EOS (#922)

* bump (#934)

* Update eval_gauntlet_callback.py with math.log2 (#821)

Saw an automated ruff flag this, seems like a strict improvement and is marginally faster.

Co-authored-by: Daniel King <[email protected]>

* Switch to the Composer integration of LoRA (works with FSDP) (#886)

* Refactoring the function to accept list of metric names instead of a dictionary of metrics. (#938)

* ..

* undoing prev commit

* Refactoring the  function to accept list of metric names instead of dictionary

* ..

* ..

* ..

* ..

* Remove extra call to .to and load_state_dict in hf checkpointer (#939)

* Fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking (#940)

* ..

* undoing prev commit

* fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking

* Update modeling_mpt.py

* ..

---------

Co-authored-by: Daniel King <[email protected]>

* Update lora docs (#941)

* fix (#942)

* Retrieve license information when local files are provided for a pretrained model (#943)

* Initial implementation to test

* Add log for license overwrite

* Use Path for input to _write_license_information

* Set default

---------

Co-authored-by: Daniel King <[email protected]>

* Add and use VersionedDeprecationWarning (#944)

* Add and use VersionedDeprecationWarning

* Use remove_version instead.

* Fix merge

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Bump llm-foundry version to 0.5.0 (#948)

* Bump version to 0.5.0

* Remove deprecated features

* Other cleanup

* code quality

* Fix chain-of-thought tasks (#824)

* Skip flaky lion8b test (#598)

* relax atol and add retries to reduce flakiness in lion8b timing test

* add eval output logging

* add back tasks

* foo

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* fix prompt

* fix prompt

* modify mcli

* test

* test

* fix

* added math dataset

* edit yaml

* prep gsm8k identically to eleuther

* prep gsm8k identically to eleuther

* add early stopping criteria

* finish

* debug

* fix

* bug

* remove eval output logging callback

* restore

* fix

* fix

* fix composer verion

* gauntlet v0.2.1

* gauntlet v0.2.1

* prep

* prep

* foo

* restore

* restore

* restore mcli

* fix precommit

* fix

* Update hf_eval.yaml

* fix

* fix

* remove programming

* update readme

---------

Co-authored-by: dblalock <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Add finetuning streaming dataset conversion (#933)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* review comments

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update scripts/data_prep/convert_finetuning_dataset.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add default signature to mlflow saved model (#952)

* allow te to use meta device with deferred init (#958)

* Update TUTORIAL.md (#957)

* Update TUTORIAL.md

fix indentation problem

* Update TUTORIAL.md

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Bump mcli yaml foundry version to v0.5.0 (#959)

* add finutuning with streaming dataset example (#945)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* fix yaml

* comments

* comments

* comments

* add unit test

* comments

---------

Co-authored-by: Daniel King <[email protected]>

* Add fully configurable activation checkpointing (#951)

* add fully configurable activation checkpointing

* fix format

* fix format

* add docstring to activation_checkpointing_fn

* add block id range option in act ckpt

* resolve conflict

* add a check for blocks ids overlap in mapping

* fix typo

* update docstring

* refactor

* fix test

* Apply suggestions from code review

Co-authored-by: Mihir Patel <[email protected]>

* address comments

* add build mapping as a helper func

* fix format

---------

Co-authored-by: Mihir Patel <[email protected]>

* Use create_model_version instead of register_model (#953)

* Add streams support (#946)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* add convert

* fix

* v0

* fix

* fix MDS write

* streams support

* fake commit

* fix setup

* format

* add back arxiv

* trigger test

* review comments

* temporarily trigger test

* test

* add convert

* fix

* fix

* fix MDS write

* format

* trigger test

* fix

* format

* resolve conflicts

* add back jsonl

* fix yaml

* comments

* format

* comments

* comments

* add unit test

* comments

* comments

* merge

* format

* typo

* Update llmfoundry/data/finetuning/dataloader.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Fix typo (#966)

* Fix eval.py with lora (#965)

* just remove it?

* or not

* fix

* fix up

* clean up

* fix example yaml

* precommit

* add test

* add memorysnapshot to callbacks (#810)

Co-authored-by: Daniel King <[email protected]>

* Adding curriculum learning callback (experimental) (#954)

* curriculum learning callback

* curriculum learning callback

* fixing types

* dataset config types correct

* dataset config retrieved correctly

* access train dataloader correctly

* load state dict defaults

* get that damn dataloader

* missed dat

* dataspec L

* dataset L

* no logging, print is my best friend

* save first dataset config

* don't save new dataset config every single time

* logging dataset state

* have to set the damn timestamp. rip

* remove logging

* linting

* pyright

* removing rope...

* Delete scripts/eval/local_data/.DS_Store

* trailing comma is bacc

* fixed docstring

* fixed docstrings

* no more funky stuff in save_dict

* refactored, assuming before_load event in composer

* lingint

* bumped composer and streaming min versions

* moved line

* strengthened chat formatting validation (#960)

* strengthened chat formatting validation

* fix types

* made assert messages more descriptive

* used raise instead of assert, added type checks

* added list type check

* type error if no string content

* add test case for new validation

* relaxed type constraints to interface minimum

* use Mapping and Iterable

* fix mapping in type aliases too

* iterable -> sequence

* sequence -> list

* Mapping -> Dict

* use mapping again

* fixed another one

* updated message

* factored out duplicate functions

* dict -> mapping

* add sequence

* Add new base images and remove fa1 images (#970)

* Add new ICL kwargs in eval.py and long_context yamls (#925)

* add yamls w/ old links

* load from max's public hf and parse hf datasets

* update rest of tasks

* add better logging

* implemented leval tasks

* move level

* add level yaml

* add str parsing to hf

* wip

* llm-foundry working with new parser

* working w/ new parsing

* fix old long context tasks

* wip

* wip

* wip

* wip

* update to hf_parsing_map

* rm defaults

* fix parsing vars

* update defaults again

* rm merge conflict

* fix gen_kwargs

* rm old code path

* fixups

* wip

* rm leval from pr

* fix comments in yamls

* add cot params

* add fewshot_random_seed

* fix early_stopping_criteria, fewshot_num_seed default

* undo rm hf_eval

* add fewshot_random_seed to test

* add 64k tasks

* add longer context, update composer versin

* address comments

* mixed

* use seed by default

* rm  long_context_eval_8k.yaml

* add longer context evals

* mv yamls

* eval gauntlet wip

* update niah and wikiqa

* fix linting

* add default option

* change defaults

* fix linting

* fix linting 2

---------

Co-authored-by: Daniel King <[email protected]>

* Make Composer pins consistent with each other (#972)

* Make turbo an optional dependency (#964)

* Fix fewshot_random_seed default setting (#974)

* del fewshot_random default, fix hf_eval, fix gauntlet readme

* set in cfg defaults area

* fix the fix i applied that was actually not a fix

* rm num_batch from hf_eval

* improve error msg when checking target_blocks in activation_checkpointing_target (#977)

* Torch 2.2 upgrade - Part 1 (#976)

* Torch 2.2 - Part 2 (#979)

* PyTorch 2.2 - Part 3 (#981)

* Remove torch 2.1 from docker build (#982)

* Async callback: Don't skip checkpoints, reliably only launch async eval when the checkpoint is ready (#813)

* working without sharded checkpointing..

* add more debugs

* try this

* more debugging

* yikes dumb bug

* add notes

* fixes

* remove prints

* small updates

* fix typo

* refactor

* fix docstring formatting

* fighting with docstrings

* try this

* add unit tests

* point to composer update

* values -> items

* serialize time

* fix merge

* nits

* warning, small comment update

* add error

---------

Co-authored-by: Daniel King <[email protected]>

* Token accuracy metrics (#983)

* do not mention 1.13 in readme (#988)

Co-authored-by: Daniel King <[email protected]>

* Patch test, lock mcli version (#990)

* Bump gha timeouts (#991)

* Fix readme typo (#993)

* if condition in tie weights added (#989)

* if condition in tie weights added

* unit test for tie weights

* bump composer version (#995)

* Trim examples ahead of time for auto packing (#994)

* add oom observer callback (#932)

* add oom observer callback

* fix format

* Change ci/cd to use ci-testing repo

* Revert "Change ci/cd to use ci-testing repo"

This reverts commit e3f214e.

* Use ci-testing repo (#1000)

Co-authored-by: Irene Dea <[email protected]>

* Make CodeEval respect device_eval_batch_size (#956)

* Remove try except around imports (#1004)

* Deprecate triton, prefix lm, llama attention patch, and text denoising; Make ComposerHFT5 experimental (#1007)

* Deprecate features and mark experimental

* fix typo

---------

Co-authored-by: Daniel King <[email protected]>

* add magic filename for sharded state dicts (#1001)

* add magic filename for sharded state dicts

* Update scripts/train/train.py

Co-authored-by: Daniel King <[email protected]>

* oops forgot to push this

* no shard if no fsdp

* default to full on foundry

---------

Co-authored-by: Daniel King <[email protected]>

* bump (#1009)

* Fix evaluators actually pulling eval metrics (#1006)

* fix bug on metrics

* lint

* lint

* add unit test

* lint

* Build torch 2.2.1 images (#1010)

* add 2.2.1 tests (#1011)

* Bump min torch pin (#1013)

Red button because CI running jobs it doesn't need. Tests passed on main.

* Fix extra BOS token in front of response for some tokenizers (#1003)

* Bump min composer pin (#1015)

* add default for eval interval (#987)

Co-authored-by: Daniel King <[email protected]>

* Add support for olmo (#1016)

* Add deeper support for multi-turn chats and loss-generating tokens in finetuning (#985)

The main purpose of this PR is to support training on non-terminal responses in multi-round chats. This is achieved by tokenizing at the level of conversation "turns" and exposing some options for what turns are used as training targets (i.e. generate loss). This also adds support for treating prompt tokens as loss-generating.

The script for converting a finetuning dataset to streaming has also been updated (with some bug fixes).

* Fix profiling packing ratio to explicitly say 1 (#1019)

* Bump transformers to 4.38.2 (#1018)

* that kwargs (#1020)

* Update readme with pytorch 2.2.1 (#1021)

* Add code import to train/eval scripts (#1002)

* finish (#1022)

Co-authored-by: Max Marion <[email protected]>

* Bump version to 0.6.0 (#1023)

* Fix typo in monolithic chkpt callback docs (#1024)

* Fix typo in monolithic chkpt callback docs

* reorder to match function signature

* update pip install link

* Change done file location

* Create the dest folder

* Allow code-quality workflow to be callable (#1026)

Reverts part of the change made in
https://github.com/mosaicml/llm-foundry/pull/1000/files#diff-4a2765c2cfcbd3804a66aab805cb92ddda74de1730923cc5bf53671d0beccf06L11

* update notebook

* update

* update notebook

* update token_counts

* update pip install list

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Milo Cress <[email protected]>
Co-authored-by: Nancy Hung <[email protected]>
Co-authored-by: Jerry Chen <[email protected]>
Co-authored-by: Shashank Rajput <[email protected]>
Co-authored-by: Vitaliy Chiley <[email protected]>
Co-authored-by: Irene Dea <[email protected]>
Co-authored-by: Brian <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: Nicholas Garcia <[email protected]>
Co-authored-by: Prithviraj Ammanabrolu <[email protected]>
Co-authored-by: Jane Zhang <[email protected]>
Co-authored-by: Vincent Chen <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
Co-authored-by: Jeremy D <[email protected]>
Co-authored-by: dblalock <[email protected]>
Co-authored-by: bigning <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Sebastián Donoso Bustos <[email protected]>
Co-authored-by: Saaketh Narayan <[email protected]>
Co-authored-by: Max Marion <[email protected]>
Co-authored-by: Megha Agarwal <[email protected]>
Co-authored-by: Jose Javier <[email protected]>
Co-authored-by: Alex Trott <[email protected]>
Co-authored-by: Sasha Doubov <[email protected]>
XiaohanZhangCMU added a commit that referenced this pull request Mar 14, 2024
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* Logs upload URI (#850)

* fix style etc.

* fix

* fix fix

* fix fix fix

* fix fix fix fix

* removed unused dummy func

* deleted tests to make the tests pass

* tried adding back some tests to see if it triggers the issue

* add test_hf_checkpointer.py but remove references to MPT

* fix?

* fixed test cases overlapping in strange side-effecty ways

* update

* Delta to JSONL conversion script cleanup and bug fix (#868)

* Small test change

* small cleanups

* lint and precommit

* lint and precommit

* comments

* another one

* pr suggestion and use input param not args

* fix mock (#872)

* Add response tokens

* update

* fix regex (#877)

* Precompute flash attention padding info (#880)

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* dummy data

* undoing last commit

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* ..

* ..

---------

Co-authored-by: Vitaliy Chiley <[email protected]>

* add missing import (#882)

* fsdp wrap refac (#883)

* fsdp wrap refac

* refac

* refac

* Update model download utils to support ORAS (#881)

* wip

* wip

* Accept registry file for hostname

* Make sure no sensitive info is surfaced in subprocess error

* Refactor model downloading

* Save HF hub files to local dir

* fallback

* Remove commented code

* Update logging

* Update HTP download args

* Use files for ORAS

* Update llmfoundry/utils/model_download_utils.py

Co-authored-by: Irene Dea <[email protected]>

---------

Co-authored-by: Irene Dea <[email protected]>

* Update license (#887)

Updates the license for 2024. New files will have a copyright year of 2024 inserted in the header. Existing files will not be changed.

* Fix tiktoken add generation prompt (#890)

* update

* Upgrade Datasets version (#892)

* Disable MDSWrite, return token counts

* Bump transformers version to support Mixtral (#894)

* Add `tokenizer-only` flag to only download tokenizers from HF or oras (#895)

* Foundational Model API eval wrapper (#849)

* FMAPI model wrapper

* add chat wrapper too

* revert

* end line

* formatting

* less verbose

* better error messages

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* Add better error for non-empty local output folder in convert_text_to_mds.py (#891)

* Allow bool input for loggers (#897)

* Allow bool input for loggers

* Convert earlier on

* Fix test case

* Enable QK Group Norm (#869)

* start qkgn

* attn defaults for qk_gn

* impl qk_gn

* Update attention.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* lint

* Update attention.py

* lint

* add avlue error

* Update attention.py

* updt to include low precision groupnorm;

* perf improvement

* Revert "perf improvement"

This reverts commit 2b62d5e.

* Revert "updt to include low precision groupnorm;"

This reverts commit bca1c33.

* patch (#905)

* Add new GC option (#907)

* No symlinks at all for HF download (#908)

* Adds support for chat formatted finetuning input data. (#884)

* fix conflicting formatting linting guidelines

* used older union operator for legacy support

* did the same thing in another place

* isort ignore specific lines

* fixes

* isort do not skip line

* address comments

* renamed some more things

* split tests and add some verification for tokenization split

* fix formatting

* added docstrings

* added end-to-end-test with HF dataset

* fix code style

* renamed file and fixed tests

* use chat template diff

* addressed comment

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* fixed type of TokenizedExample

* use cast

* use _ALLOWED_{PROMPT, RESPONSE}_KEYS

* updated tests

* fix

* fix?

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add flag to enable/disable param upload (#912)

* Add flag to enable/disable param upload

* Yapf

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

* Rename

* Add to eval

---------

Co-authored-by: Daniel King <[email protected]>

* Add support for eval_loader & eval_subset_num_batches in async callback (#834)

* Skip evalloader in training if using async eval

* add support for subset_num_batches

* remove todo

* eval first

* rename arg

* fix

* small updates

* om

* fix test

* eval run config

---------

Co-authored-by: Daniel King <[email protected]>

* Add the model license file for mlflow (#915)

* Warn instead of error on tokenizer-only with http (#904)

* Fix fmapi_chat for instruct models and custom tokenizers (#914)

* Fix fmapi_chat for instruct models and custom tokenizers

* remove from tiktoken

* fix

* add tests

* fix test, 0->1

* refactor

* Make yamllint consistent with Composer (#918)

* Create HF checkpointer model on meta device (#916)

* Tiktoken chat format fix (#893)

* sys prompt fix

* remove eos tokens from chat formatter

* fix dash issue (#919)

* fix dash issue

* fix

* fix?

* added unit test

* fix fix

* fix tests

* fix fix tests

* Fixes yaml linting (#920)

* Adding deprecation warning for Flash Attention 1 and user warning against using Triton attention. (#921)

* Add rich formatting to tracebacks (#927)

* added rich traceback

* sorted imports

* added rich to eval

* Changes to setup.py invalidate docker cache. Use branch name in dockerfile (#930)

Co-authored-by: Daniel King <[email protected]>

* Remove .ci folder and move FILE_HEADER (#931)

* Throw error when no EOS (#922)

* bump (#934)

* Update eval_gauntlet_callback.py with math.log2 (#821)

Saw an automated ruff flag this, seems like a strict improvement and is marginally faster.

Co-authored-by: Daniel King <[email protected]>

* Switch to the Composer integration of LoRA (works with FSDP) (#886)

* Refactoring the function to accept list of metric names instead of a dictionary of metrics. (#938)

* ..

* undoing prev commit

* Refactoring the  function to accept list of metric names instead of dictionary

* ..

* ..

* ..

* ..

* Remove extra call to .to and load_state_dict in hf checkpointer (#939)

* Fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking (#940)

* ..

* undoing prev commit

* fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking

* Update modeling_mpt.py

* ..

---------

Co-authored-by: Daniel King <[email protected]>

* Update lora docs (#941)

* fix (#942)

* Retrieve license information when local files are provided for a pretrained model (#943)

* Initial implementation to test

* Add log for license overwrite

* Use Path for input to _write_license_information

* Set default

---------

Co-authored-by: Daniel King <[email protected]>

* Add and use VersionedDeprecationWarning (#944)

* Add and use VersionedDeprecationWarning

* Use remove_version instead.

* Fix merge

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Bump llm-foundry version to 0.5.0 (#948)

* Bump version to 0.5.0

* Remove deprecated features

* Other cleanup

* code quality

* Fix chain-of-thought tasks (#824)

* Skip flaky lion8b test (#598)

* relax atol and add retries to reduce flakiness in lion8b timing test

* add eval output logging

* add back tasks

* foo

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* fix prompt

* fix prompt

* modify mcli

* test

* test

* fix

* added math dataset

* edit yaml

* prep gsm8k identically to eleuther

* prep gsm8k identically to eleuther

* add early stopping criteria

* finish

* debug

* fix

* bug

* remove eval output logging callback

* restore

* fix

* fix

* fix composer verion

* gauntlet v0.2.1

* gauntlet v0.2.1

* prep

* prep

* foo

* restore

* restore

* restore mcli

* fix precommit

* fix

* Update hf_eval.yaml

* fix

* fix

* remove programming

* update readme

---------

Co-authored-by: dblalock <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Add finetuning streaming dataset conversion (#933)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* review comments

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update scripts/data_prep/convert_finetuning_dataset.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add default signature to mlflow saved model (#952)

* allow te to use meta device with deferred init (#958)

* Update TUTORIAL.md (#957)

* Update TUTORIAL.md

fix indentation problem

* Update TUTORIAL.md

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Bump mcli yaml foundry version to v0.5.0 (#959)

* add finutuning with streaming dataset example (#945)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* fix yaml

* comments

* comments

* comments

* add unit test

* comments

---------

Co-authored-by: Daniel King <[email protected]>

* Add fully configurable activation checkpointing (#951)

* add fully configurable activation checkpointing

* fix format

* fix format

* add docstring to activation_checkpointing_fn

* add block id range option in act ckpt

* resolve conflict

* add a check for blocks ids overlap in mapping

* fix typo

* update docstring

* refactor

* fix test

* Apply suggestions from code review

Co-authored-by: Mihir Patel <[email protected]>

* address comments

* add build mapping as a helper func

* fix format

---------

Co-authored-by: Mihir Patel <[email protected]>

* Use create_model_version instead of register_model (#953)

* Add streams support (#946)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* add convert

* fix

* v0

* fix

* fix MDS write

* streams support

* fake commit

* fix setup

* format

* add back arxiv

* trigger test

* review comments

* temporarily trigger test

* test

* add convert

* fix

* fix

* fix MDS write

* format

* trigger test

* fix

* format

* resolve conflicts

* add back jsonl

* fix yaml

* comments

* format

* comments

* comments

* add unit test

* comments

* comments

* merge

* format

* typo

* Update llmfoundry/data/finetuning/dataloader.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Fix typo (#966)

* Fix eval.py with lora (#965)

* just remove it?

* or not

* fix

* fix up

* clean up

* fix example yaml

* precommit

* add test

* add memorysnapshot to callbacks (#810)

Co-authored-by: Daniel King <[email protected]>

* Adding curriculum learning callback (experimental) (#954)

* curriculum learning callback

* curriculum learning callback

* fixing types

* dataset config types correct

* dataset config retrieved correctly

* access train dataloader correctly

* load state dict defaults

* get that damn dataloader

* missed dat

* dataspec L

* dataset L

* no logging, print is my best friend

* save first dataset config

* don't save new dataset config every single time

* logging dataset state

* have to set the damn timestamp. rip

* remove logging

* linting

* pyright

* removing rope...

* Delete scripts/eval/local_data/.DS_Store

* trailing comma is bacc

* fixed docstring

* fixed docstrings

* no more funky stuff in save_dict

* refactored, assuming before_load event in composer

* lingint

* bumped composer and streaming min versions

* moved line

* strengthened chat formatting validation (#960)

* strengthened chat formatting validation

* fix types

* made assert messages more descriptive

* used raise instead of assert, added type checks

* added list type check

* type error if no string content

* add test case for new validation

* relaxed type constraints to interface minimum

* use Mapping and Iterable

* fix mapping in type aliases too

* iterable -> sequence

* sequence -> list

* Mapping -> Dict

* use mapping again

* fixed another one

* updated message

* factored out duplicate functions

* dict -> mapping

* add sequence

* Add new base images and remove fa1 images (#970)

* Add new ICL kwargs in eval.py and long_context yamls (#925)

* add yamls w/ old links

* load from max's public hf and parse hf datasets

* update rest of tasks

* add better logging

* implemented leval tasks

* move level

* add level yaml

* add str parsing to hf

* wip

* llm-foundry working with new parser

* working w/ new parsing

* fix old long context tasks

* wip

* wip

* wip

* wip

* update to hf_parsing_map

* rm defaults

* fix parsing vars

* update defaults again

* rm merge conflict

* fix gen_kwargs

* rm old code path

* fixups

* wip

* rm leval from pr

* fix comments in yamls

* add cot params

* add fewshot_random_seed

* fix early_stopping_criteria, fewshot_num_seed default

* undo rm hf_eval

* add fewshot_random_seed to test

* add 64k tasks

* add longer context, update composer versin

* address comments

* mixed

* use seed by default

* rm  long_context_eval_8k.yaml

* add longer context evals

* mv yamls

* eval gauntlet wip

* update niah and wikiqa

* fix linting

* add default option

* change defaults

* fix linting

* fix linting 2

---------

Co-authored-by: Daniel King <[email protected]>

* Make Composer pins consistent with each other (#972)

* Make turbo an optional dependency (#964)

* Fix fewshot_random_seed default setting (#974)

* del fewshot_random default, fix hf_eval, fix gauntlet readme

* set in cfg defaults area

* fix the fix i applied that was actually not a fix

* rm num_batch from hf_eval

* improve error msg when checking target_blocks in activation_checkpointing_target (#977)

* Torch 2.2 upgrade - Part 1 (#976)

* Torch 2.2 - Part 2 (#979)

* PyTorch 2.2 - Part 3 (#981)

* Remove torch 2.1 from docker build (#982)

* Async callback: Don't skip checkpoints, reliably only launch async eval when the checkpoint is ready (#813)

* working without sharded checkpointing..

* add more debugs

* try this

* more debugging

* yikes dumb bug

* add notes

* fixes

* remove prints

* small updates

* fix typo

* refactor

* fix docstring formatting

* fighting with docstrings

* try this

* add unit tests

* point to composer update

* values -> items

* serialize time

* fix merge

* nits

* warning, small comment update

* add error

---------

Co-authored-by: Daniel King <[email protected]>

* Token accuracy metrics (#983)

* do not mention 1.13 in readme (#988)

Co-authored-by: Daniel King <[email protected]>

* Patch test, lock mcli version (#990)

* Bump gha timeouts (#991)

* Fix readme typo (#993)

* if condition in tie weights added (#989)

* if condition in tie weights added

* unit test for tie weights

* bump composer version (#995)

* Trim examples ahead of time for auto packing (#994)

* add oom observer callback (#932)

* add oom observer callback

* fix format

* Change ci/cd to use ci-testing repo

* Revert "Change ci/cd to use ci-testing repo"

This reverts commit e3f214e.

* Use ci-testing repo (#1000)

Co-authored-by: Irene Dea <[email protected]>

* Make CodeEval respect device_eval_batch_size (#956)

* Remove try except around imports (#1004)

* Deprecate triton, prefix lm, llama attention patch, and text denoising; Make ComposerHFT5 experimental (#1007)

* Deprecate features and mark experimental

* fix typo

---------

Co-authored-by: Daniel King <[email protected]>

* add magic filename for sharded state dicts (#1001)

* add magic filename for sharded state dicts

* Update scripts/train/train.py

Co-authored-by: Daniel King <[email protected]>

* oops forgot to push this

* no shard if no fsdp

* default to full on foundry

---------

Co-authored-by: Daniel King <[email protected]>

* bump (#1009)

* Fix evaluators actually pulling eval metrics (#1006)

* fix bug on metrics

* lint

* lint

* add unit test

* lint

* Build torch 2.2.1 images (#1010)

* add 2.2.1 tests (#1011)

* Bump min torch pin (#1013)

Red button because CI running jobs it doesn't need. Tests passed on main.

* Fix extra BOS token in front of response for some tokenizers (#1003)

* Bump min composer pin (#1015)

* add default for eval interval (#987)

Co-authored-by: Daniel King <[email protected]>

* Add support for olmo (#1016)

* Add deeper support for multi-turn chats and loss-generating tokens in finetuning (#985)

The main purpose of this PR is to support training on non-terminal responses in multi-round chats. This is achieved by tokenizing at the level of conversation "turns" and exposing some options for what turns are used as training targets (i.e. generate loss). This also adds support for treating prompt tokens as loss-generating.

The script for converting a finetuning dataset to streaming has also been updated (with some bug fixes).

* Fix profiling packing ratio to explicitly say 1 (#1019)

* Bump transformers to 4.38.2 (#1018)

* that kwargs (#1020)

* Update readme with pytorch 2.2.1 (#1021)

* Add code import to train/eval scripts (#1002)

* finish (#1022)

Co-authored-by: Max Marion <[email protected]>

* Bump version to 0.6.0 (#1023)

* Fix typo in monolithic chkpt callback docs (#1024)

* Fix typo in monolithic chkpt callback docs

* reorder to match function signature

* update pip install link

* Change done file location

* Create the dest folder

* Allow code-quality workflow to be callable (#1026)

Reverts part of the change made in
https://github.com/mosaicml/llm-foundry/pull/1000/files#diff-4a2765c2cfcbd3804a66aab805cb92ddda74de1730923cc5bf53671d0beccf06L11

* update notebook

* update

* update notebook

* update token_counts

* update pip install list

* fix

* update

* fix token counts

* Expose validate chat

* Expose more

* update

* expose

* add collate

* Fix

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Milo Cress <[email protected]>
Co-authored-by: Nancy Hung <[email protected]>
Co-authored-by: Jerry Chen <[email protected]>
Co-authored-by: Shashank Rajput <[email protected]>
Co-authored-by: Vitaliy Chiley <[email protected]>
Co-authored-by: Irene Dea <[email protected]>
Co-authored-by: Brian <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: Nicholas Garcia <[email protected]>
Co-authored-by: Prithviraj Ammanabrolu <[email protected]>
Co-authored-by: Jane Zhang <[email protected]>
Co-authored-by: Vincent Chen <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
Co-authored-by: Jeremy D <[email protected]>
Co-authored-by: dblalock <[email protected]>
Co-authored-by: bigning <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Sebastián Donoso Bustos <[email protected]>
Co-authored-by: Saaketh Narayan <[email protected]>
Co-authored-by: Max Marion <[email protected]>
Co-authored-by: Megha Agarwal <[email protected]>
Co-authored-by: Jose Javier <[email protected]>
Co-authored-by: Alex Trott <[email protected]>
Co-authored-by: Sasha Doubov <[email protected]>
XiaohanZhangCMU added a commit that referenced this pull request Mar 14, 2024
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* Logs upload URI (#850)

* fix style etc.

* fix

* fix fix

* fix fix fix

* fix fix fix fix

* removed unused dummy func

* deleted tests to make the tests pass

* tried adding back some tests to see if it triggers the issue

* add test_hf_checkpointer.py but remove references to MPT

* fix?

* fixed test cases overlapping in strange side-effecty ways

* update

* Delta to JSONL conversion script cleanup and bug fix (#868)

* Small test change

* small cleanups

* lint and precommit

* lint and precommit

* comments

* another one

* pr suggestion and use input param not args

* fix mock (#872)

* Add response tokens

* update

* fix regex (#877)

* Precompute flash attention padding info (#880)

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* dummy data

* undoing last commit

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* ..

* ..

---------

Co-authored-by: Vitaliy Chiley <[email protected]>

* add missing import (#882)

* fsdp wrap refac (#883)

* fsdp wrap refac

* refac

* refac

* Update model download utils to support ORAS (#881)

* wip

* wip

* Accept registry file for hostname

* Make sure no sensitive info is surfaced in subprocess error

* Refactor model downloading

* Save HF hub files to local dir

* fallback

* Remove commented code

* Update logging

* Update HTP download args

* Use files for ORAS

* Update llmfoundry/utils/model_download_utils.py

Co-authored-by: Irene Dea <[email protected]>

---------

Co-authored-by: Irene Dea <[email protected]>

* Update license (#887)

Updates the license for 2024. New files will have a copyright year of 2024 inserted in the header. Existing files will not be changed.

* Fix tiktoken add generation prompt (#890)

* update

* Upgrade Datasets version (#892)

* Disable MDSWrite, return token counts

* Bump transformers version to support Mixtral (#894)

* Add `tokenizer-only` flag to only download tokenizers from HF or oras (#895)

* Foundational Model API eval wrapper (#849)

* FMAPI model wrapper

* add chat wrapper too

* revert

* end line

* formatting

* less verbose

* better error messages

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* Add better error for non-empty local output folder in convert_text_to_mds.py (#891)

* Allow bool input for loggers (#897)

* Allow bool input for loggers

* Convert earlier on

* Fix test case

* Enable QK Group Norm (#869)

* start qkgn

* attn defaults for qk_gn

* impl qk_gn

* Update attention.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* lint

* Update attention.py

* lint

* add avlue error

* Update attention.py

* updt to include low precision groupnorm;

* perf improvement

* Revert "perf improvement"

This reverts commit 2b62d5e.

* Revert "updt to include low precision groupnorm;"

This reverts commit bca1c33.

* patch (#905)

* Add new GC option (#907)

* No symlinks at all for HF download (#908)

* Adds support for chat formatted finetuning input data. (#884)

* fix conflicting formatting linting guidelines

* used older union operator for legacy support

* did the same thing in another place

* isort ignore specific lines

* fixes

* isort do not skip line

* address comments

* renamed some more things

* split tests and add some verification for tokenization split

* fix formatting

* added docstrings

* added end-to-end-test with HF dataset

* fix code style

* renamed file and fixed tests

* use chat template diff

* addressed comment

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* fixed type of TokenizedExample

* use cast

* use _ALLOWED_{PROMPT, RESPONSE}_KEYS

* updated tests

* fix

* fix?

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add flag to enable/disable param upload (#912)

* Add flag to enable/disable param upload

* Yapf

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

* Rename

* Add to eval

---------

Co-authored-by: Daniel King <[email protected]>

* Add support for eval_loader & eval_subset_num_batches in async callback (#834)

* Skip evalloader in training if using async eval

* add support for subset_num_batches

* remove todo

* eval first

* rename arg

* fix

* small updates

* om

* fix test

* eval run config

---------

Co-authored-by: Daniel King <[email protected]>

* Add the model license file for mlflow (#915)

* Warn instead of error on tokenizer-only with http (#904)

* Fix fmapi_chat for instruct models and custom tokenizers (#914)

* Fix fmapi_chat for instruct models and custom tokenizers

* remove from tiktoken

* fix

* add tests

* fix test, 0->1

* refactor

* Make yamllint consistent with Composer (#918)

* Create HF checkpointer model on meta device (#916)

* Tiktoken chat format fix (#893)

* sys prompt fix

* remove eos tokens from chat formatter

* fix dash issue (#919)

* fix dash issue

* fix

* fix?

* added unit test

* fix fix

* fix tests

* fix fix tests

* Fixes yaml linting (#920)

* Adding deprecation warning for Flash Attention 1 and user warning against using Triton attention. (#921)

* Add rich formatting to tracebacks (#927)

* added rich traceback

* sorted imports

* added rich to eval

* Changes to setup.py invalidate docker cache. Use branch name in dockerfile (#930)

Co-authored-by: Daniel King <[email protected]>

* Remove .ci folder and move FILE_HEADER (#931)

* Throw error when no EOS (#922)

* bump (#934)

* Update eval_gauntlet_callback.py with math.log2 (#821)

Saw an automated ruff flag this, seems like a strict improvement and is marginally faster.

Co-authored-by: Daniel King <[email protected]>

* Switch to the Composer integration of LoRA (works with FSDP) (#886)

* Refactoring the function to accept list of metric names instead of a dictionary of metrics. (#938)

* ..

* undoing prev commit

* Refactoring the  function to accept list of metric names instead of dictionary

* ..

* ..

* ..

* ..

* Remove extra call to .to and load_state_dict in hf checkpointer (#939)

* Fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking (#940)

* ..

* undoing prev commit

* fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking

* Update modeling_mpt.py

* ..

---------

Co-authored-by: Daniel King <[email protected]>

* Update lora docs (#941)

* fix (#942)

* Retrieve license information when local files are provided for a pretrained model (#943)

* Initial implementation to test

* Add log for license overwrite

* Use Path for input to _write_license_information

* Set default

---------

Co-authored-by: Daniel King <[email protected]>

* Add and use VersionedDeprecationWarning (#944)

* Add and use VersionedDeprecationWarning

* Use remove_version instead.

* Fix merge

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Bump llm-foundry version to 0.5.0 (#948)

* Bump version to 0.5.0

* Remove deprecated features

* Other cleanup

* code quality

* Fix chain-of-thought tasks (#824)

* Skip flaky lion8b test (#598)

* relax atol and add retries to reduce flakiness in lion8b timing test

* add eval output logging

* add back tasks

* foo

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* fix prompt

* fix prompt

* modify mcli

* test

* test

* fix

* added math dataset

* edit yaml

* prep gsm8k identically to eleuther

* prep gsm8k identically to eleuther

* add early stopping criteria

* finish

* debug

* fix

* bug

* remove eval output logging callback

* restore

* fix

* fix

* fix composer verion

* gauntlet v0.2.1

* gauntlet v0.2.1

* prep

* prep

* foo

* restore

* restore

* restore mcli

* fix precommit

* fix

* Update hf_eval.yaml

* fix

* fix

* remove programming

* update readme

---------

Co-authored-by: dblalock <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Add finetuning streaming dataset conversion (#933)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* review comments

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update scripts/data_prep/convert_finetuning_dataset.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add default signature to mlflow saved model (#952)

* allow te to use meta device with deferred init (#958)

* Update TUTORIAL.md (#957)

* Update TUTORIAL.md

fix indentation problem

* Update TUTORIAL.md

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Bump mcli yaml foundry version to v0.5.0 (#959)

* add finutuning with streaming dataset example (#945)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* fix yaml

* comments

* comments

* comments

* add unit test

* comments

---------

Co-authored-by: Daniel King <[email protected]>

* Add fully configurable activation checkpointing (#951)

* add fully configurable activation checkpointing

* fix format

* fix format

* add docstring to activation_checkpointing_fn

* add block id range option in act ckpt

* resolve conflict

* add a check for blocks ids overlap in mapping

* fix typo

* update docstring

* refactor

* fix test

* Apply suggestions from code review

Co-authored-by: Mihir Patel <[email protected]>

* address comments

* add build mapping as a helper func

* fix format

---------

Co-authored-by: Mihir Patel <[email protected]>

* Use create_model_version instead of register_model (#953)

* Add streams support (#946)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* add convert

* fix

* v0

* fix

* fix MDS write

* streams support

* fake commit

* fix setup

* format

* add back arxiv

* trigger test

* review comments

* temporarily trigger test

* test

* add convert

* fix

* fix

* fix MDS write

* format

* trigger test

* fix

* format

* resolve conflicts

* add back jsonl

* fix yaml

* comments

* format

* comments

* comments

* add unit test

* comments

* comments

* merge

* format

* typo

* Update llmfoundry/data/finetuning/dataloader.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Fix typo (#966)

* Fix eval.py with lora (#965)

* just remove it?

* or not

* fix

* fix up

* clean up

* fix example yaml

* precommit

* add test

* add memorysnapshot to callbacks (#810)

Co-authored-by: Daniel King <[email protected]>

* Adding curriculum learning callback (experimental) (#954)

* curriculum learning callback

* curriculum learning callback

* fixing types

* dataset config types correct

* dataset config retrieved correctly

* access train dataloader correctly

* load state dict defaults

* get that damn dataloader

* missed dat

* dataspec L

* dataset L

* no logging, print is my best friend

* save first dataset config

* don't save new dataset config every single time

* logging dataset state

* have to set the damn timestamp. rip

* remove logging

* linting

* pyright

* removing rope...

* Delete scripts/eval/local_data/.DS_Store

* trailing comma is bacc

* fixed docstring

* fixed docstrings

* no more funky stuff in save_dict

* refactored, assuming before_load event in composer

* lingint

* bumped composer and streaming min versions

* moved line

* strengthened chat formatting validation (#960)

* strengthened chat formatting validation

* fix types

* made assert messages more descriptive

* used raise instead of assert, added type checks

* added list type check

* type error if no string content

* add test case for new validation

* relaxed type constraints to interface minimum

* use Mapping and Iterable

* fix mapping in type aliases too

* iterable -> sequence

* sequence -> list

* Mapping -> Dict

* use mapping again

* fixed another one

* updated message

* factored out duplicate functions

* dict -> mapping

* add sequence

* Add new base images and remove fa1 images (#970)

* Add new ICL kwargs in eval.py and long_context yamls (#925)

* add yamls w/ old links

* load from max's public hf and parse hf datasets

* update rest of tasks

* add better logging

* implemented leval tasks

* move level

* add level yaml

* add str parsing to hf

* wip

* llm-foundry working with new parser

* working w/ new parsing

* fix old long context tasks

* wip

* wip

* wip

* wip

* update to hf_parsing_map

* rm defaults

* fix parsing vars

* update defaults again

* rm merge conflict

* fix gen_kwargs

* rm old code path

* fixups

* wip

* rm leval from pr

* fix comments in yamls

* add cot params

* add fewshot_random_seed

* fix early_stopping_criteria, fewshot_num_seed default

* undo rm hf_eval

* add fewshot_random_seed to test

* add 64k tasks

* add longer context, update composer versin

* address comments

* mixed

* use seed by default

* rm  long_context_eval_8k.yaml

* add longer context evals

* mv yamls

* eval gauntlet wip

* update niah and wikiqa

* fix linting

* add default option

* change defaults

* fix linting

* fix linting 2

---------

Co-authored-by: Daniel King <[email protected]>

* Make Composer pins consistent with each other (#972)

* Make turbo an optional dependency (#964)

* Fix fewshot_random_seed default setting (#974)

* del fewshot_random default, fix hf_eval, fix gauntlet readme

* set in cfg defaults area

* fix the fix i applied that was actually not a fix

* rm num_batch from hf_eval

* improve error msg when checking target_blocks in activation_checkpointing_target (#977)

* Torch 2.2 upgrade - Part 1 (#976)

* Torch 2.2 - Part 2 (#979)

* PyTorch 2.2 - Part 3 (#981)

* Remove torch 2.1 from docker build (#982)

* Async callback: Don't skip checkpoints, reliably only launch async eval when the checkpoint is ready (#813)

* working without sharded checkpointing..

* add more debugs

* try this

* more debugging

* yikes dumb bug

* add notes

* fixes

* remove prints

* small updates

* fix typo

* refactor

* fix docstring formatting

* fighting with docstrings

* try this

* add unit tests

* point to composer update

* values -> items

* serialize time

* fix merge

* nits

* warning, small comment update

* add error

---------

Co-authored-by: Daniel King <[email protected]>

* Token accuracy metrics (#983)

* do not mention 1.13 in readme (#988)

Co-authored-by: Daniel King <[email protected]>

* Patch test, lock mcli version (#990)

* Bump gha timeouts (#991)

* Fix readme typo (#993)

* if condition in tie weights added (#989)

* if condition in tie weights added

* unit test for tie weights

* bump composer version (#995)

* Trim examples ahead of time for auto packing (#994)

* add oom observer callback (#932)

* add oom observer callback

* fix format

* Change ci/cd to use ci-testing repo

* Revert "Change ci/cd to use ci-testing repo"

This reverts commit e3f214e.

* Use ci-testing repo (#1000)

Co-authored-by: Irene Dea <[email protected]>

* Make CodeEval respect device_eval_batch_size (#956)

* Remove try except around imports (#1004)

* Deprecate triton, prefix lm, llama attention patch, and text denoising; Make ComposerHFT5 experimental (#1007)

* Deprecate features and mark experimental

* fix typo

---------

Co-authored-by: Daniel King <[email protected]>

* add magic filename for sharded state dicts (#1001)

* add magic filename for sharded state dicts

* Update scripts/train/train.py

Co-authored-by: Daniel King <[email protected]>

* oops forgot to push this

* no shard if no fsdp

* default to full on foundry

---------

Co-authored-by: Daniel King <[email protected]>

* bump (#1009)

* Fix evaluators actually pulling eval metrics (#1006)

* fix bug on metrics

* lint

* lint

* add unit test

* lint

* Build torch 2.2.1 images (#1010)

* add 2.2.1 tests (#1011)

* Bump min torch pin (#1013)

Red button because CI running jobs it doesn't need. Tests passed on main.

* Fix extra BOS token in front of response for some tokenizers (#1003)

* Bump min composer pin (#1015)

* add default for eval interval (#987)

Co-authored-by: Daniel King <[email protected]>

* Add support for olmo (#1016)

* Add deeper support for multi-turn chats and loss-generating tokens in finetuning (#985)

The main purpose of this PR is to support training on non-terminal responses in multi-round chats. This is achieved by tokenizing at the level of conversation "turns" and exposing some options for what turns are used as training targets (i.e. generate loss). This also adds support for treating prompt tokens as loss-generating.

The script for converting a finetuning dataset to streaming has also been updated (with some bug fixes).

* Fix profiling packing ratio to explicitly say 1 (#1019)

* Bump transformers to 4.38.2 (#1018)

* that kwargs (#1020)

* Update readme with pytorch 2.2.1 (#1021)

* Add code import to train/eval scripts (#1002)

* finish (#1022)

Co-authored-by: Max Marion <[email protected]>

* Bump version to 0.6.0 (#1023)

* Fix typo in monolithic chkpt callback docs (#1024)

* Fix typo in monolithic chkpt callback docs

* reorder to match function signature

* update pip install link

* Change done file location

* Create the dest folder

* Allow code-quality workflow to be callable (#1026)

Reverts part of the change made in
https://github.com/mosaicml/llm-foundry/pull/1000/files#diff-4a2765c2cfcbd3804a66aab805cb92ddda74de1730923cc5bf53671d0beccf06L11

* update notebook

* update

* update notebook

* update token_counts

* update pip install list

* fix

* update

* fix token counts

* Expose validate chat

* Expose more

* update

* expose

* add collate

* Fix

* update notebook

* Fix conflict

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Milo Cress <[email protected]>
Co-authored-by: Nancy Hung <[email protected]>
Co-authored-by: Jerry Chen <[email protected]>
Co-authored-by: Shashank Rajput <[email protected]>
Co-authored-by: Vitaliy Chiley <[email protected]>
Co-authored-by: Irene Dea <[email protected]>
Co-authored-by: Brian <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: Nicholas Garcia <[email protected]>
Co-authored-by: Prithviraj Ammanabrolu <[email protected]>
Co-authored-by: Jane Zhang <[email protected]>
Co-authored-by: Vincent Chen <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
Co-authored-by: Jeremy D <[email protected]>
Co-authored-by: dblalock <[email protected]>
Co-authored-by: bigning <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Sebastián Donoso Bustos <[email protected]>
Co-authored-by: Saaketh Narayan <[email protected]>
Co-authored-by: Max Marion <[email protected]>
Co-authored-by: Megha Agarwal <[email protected]>
Co-authored-by: Jose Javier <[email protected]>
Co-authored-by: Alex Trott <[email protected]>
Co-authored-by: Sasha Doubov <[email protected]>
XiaohanZhangCMU added a commit that referenced this pull request Mar 15, 2024
* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* add validation script

* update

* change token count function

* reorganize cells

* Add unit tests

* Add a printout for CPT

* update question

* Add questions

* Fix lints

* update format

* update

* nb source

* Remove license insert for validation notebook

* Add validation utils

* Minor cleanups (#858)

* nits

* logger

* add log

* lint

* update utils/__init__.py to include extra validation functions

* update notebook

* update

* update

* Read UC delta table (#773)

* initial commit

* use databricks-sql to read delta table and convert to json

* update

* update

* update

* add mocked unittest

* Fix lints

* update

* update

* restructure code

* Add timer for optimizing

* Add db-connect

* add wrapper

* update

* add install dbconnect

* update

* update

* patch dbconnect to allow multiple return formats

* update

* add arrow

* use compression

* clean up

* Add cluster rt check

* Fix lints

* remove patch.py for CI

* update

* update

* updat

* update

* fix tests

* fix lint

* update

* update

* Add more tests

* update

* update

* update

* change to download_json

* update

* fix lints

* Add decompressed option for arrow

* format json to jsonl

* Add comments

* Make cf_collect_type global option

* fix comments

* fix lints

* fix comments

* Fix lints

* change to use workspaceclient

* Add CPT support

* Rewire method assignment logic

* Fix bug in stripping https

* Add tests for rewired method assignment logic

* Fix lints

* Fix lints

* Removed logger set_level

* Remove pyspark. It conflicts with databricks-connect

* Update the comment

* skip cluster version check when cluster_id is serverless

* Add use_serverless flag

* update tests with use_serverless flag

* Fix lints

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* Add download remote function to util

* update

* remove fused layernorm (#859)

* update

* update

* update

* update

* update

* update

* update

* update

* update

* Remove hardcoded combined.jsonl with a flag (#861)

* Remove hardcoded combined.jsonl with a flag

* update

* change output_json_path output_json_folder

---------

Co-authored-by: Xiaohan Zhang <[email protected]>

* bump (#828)

* Add dask and dataframe_to_mds

* update

* update

* update

* update

* Add notebook

* update

* update

* remove script and tests, keep notebook

* update

* update

* update

* update

* Always initialize dist  (#864)

* fix dev

* lint

* remove gpu

* updated notebook

* remove scripts keep notebook

* update notebook. rephrase.

* Logs upload URI (#850)

* fix style etc.

* fix

* fix fix

* fix fix fix

* fix fix fix fix

* removed unused dummy func

* deleted tests to make the tests pass

* tried adding back some tests to see if it triggers the issue

* add test_hf_checkpointer.py but remove references to MPT

* fix?

* fixed test cases overlapping in strange side-effecty ways

* update

* Delta to JSONL conversion script cleanup and bug fix (#868)

* Small test change

* small cleanups

* lint and precommit

* lint and precommit

* comments

* another one

* pr suggestion and use input param not args

* fix mock (#872)

* Add response tokens

* update

* fix regex (#877)

* Precompute flash attention padding info (#880)

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* dummy data

* undoing last commit

* ..

* ..

* Update llmfoundry/models/mpt/modeling_mpt.py

Co-authored-by: Vitaliy Chiley <[email protected]>

* ..

* ..

---------

Co-authored-by: Vitaliy Chiley <[email protected]>

* add missing import (#882)

* fsdp wrap refac (#883)

* fsdp wrap refac

* refac

* refac

* Update model download utils to support ORAS (#881)

* wip

* wip

* Accept registry file for hostname

* Make sure no sensitive info is surfaced in subprocess error

* Refactor model downloading

* Save HF hub files to local dir

* fallback

* Remove commented code

* Update logging

* Update HTP download args

* Use files for ORAS

* Update llmfoundry/utils/model_download_utils.py

Co-authored-by: Irene Dea <[email protected]>

---------

Co-authored-by: Irene Dea <[email protected]>

* Update license (#887)

Updates the license for 2024. New files will have a copyright year of 2024 inserted in the header. Existing files will not be changed.

* Fix tiktoken add generation prompt (#890)

* update

* Upgrade Datasets version (#892)

* Disable MDSWrite, return token counts

* Bump transformers version to support Mixtral (#894)

* Add `tokenizer-only` flag to only download tokenizers from HF or oras (#895)

* Foundational Model API eval wrapper (#849)

* FMAPI model wrapper

* add chat wrapper too

* revert

* end line

* formatting

* less verbose

* better error messages

* Change plot settings

* update notebook

* update

* update notebook

* update

* update notebook

* Add better error for non-empty local output folder in convert_text_to_mds.py (#891)

* Allow bool input for loggers (#897)

* Allow bool input for loggers

* Convert earlier on

* Fix test case

* Enable QK Group Norm (#869)

* start qkgn

* attn defaults for qk_gn

* impl qk_gn

* Update attention.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* Update test_flash_triton_torch.py

* Update attention.py

* lint

* Update attention.py

* lint

* add avlue error

* Update attention.py

* updt to include low precision groupnorm;

* perf improvement

* Revert "perf improvement"

This reverts commit 2b62d5e.

* Revert "updt to include low precision groupnorm;"

This reverts commit bca1c33.

* patch (#905)

* Add new GC option (#907)

* No symlinks at all for HF download (#908)

* Adds support for chat formatted finetuning input data. (#884)

* fix conflicting formatting linting guidelines

* used older union operator for legacy support

* did the same thing in another place

* isort ignore specific lines

* fixes

* isort do not skip line

* address comments

* renamed some more things

* split tests and add some verification for tokenization split

* fix formatting

* added docstrings

* added end-to-end-test with HF dataset

* fix code style

* renamed file and fixed tests

* use chat template diff

* addressed comment

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* fixed type of TokenizedExample

* use cast

* use _ALLOWED_{PROMPT, RESPONSE}_KEYS

* updated tests

* fix

* fix?

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add flag to enable/disable param upload (#912)

* Add flag to enable/disable param upload

* Yapf

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

* Rename

* Add to eval

---------

Co-authored-by: Daniel King <[email protected]>

* Add support for eval_loader & eval_subset_num_batches in async callback (#834)

* Skip evalloader in training if using async eval

* add support for subset_num_batches

* remove todo

* eval first

* rename arg

* fix

* small updates

* om

* fix test

* eval run config

---------

Co-authored-by: Daniel King <[email protected]>

* Add the model license file for mlflow (#915)

* Warn instead of error on tokenizer-only with http (#904)

* Fix fmapi_chat for instruct models and custom tokenizers (#914)

* Fix fmapi_chat for instruct models and custom tokenizers

* remove from tiktoken

* fix

* add tests

* fix test, 0->1

* refactor

* Make yamllint consistent with Composer (#918)

* Create HF checkpointer model on meta device (#916)

* Tiktoken chat format fix (#893)

* sys prompt fix

* remove eos tokens from chat formatter

* fix dash issue (#919)

* fix dash issue

* fix

* fix?

* added unit test

* fix fix

* fix tests

* fix fix tests

* Fixes yaml linting (#920)

* Adding deprecation warning for Flash Attention 1 and user warning against using Triton attention. (#921)

* Add rich formatting to tracebacks (#927)

* added rich traceback

* sorted imports

* added rich to eval

* Changes to setup.py invalidate docker cache. Use branch name in dockerfile (#930)

Co-authored-by: Daniel King <[email protected]>

* Remove .ci folder and move FILE_HEADER (#931)

* Throw error when no EOS (#922)

* bump (#934)

* Update eval_gauntlet_callback.py with math.log2 (#821)

Saw an automated ruff flag this, seems like a strict improvement and is marginally faster.

Co-authored-by: Daniel King <[email protected]>

* Switch to the Composer integration of LoRA (works with FSDP) (#886)

* Refactoring the function to accept list of metric names instead of a dictionary of metrics. (#938)

* ..

* undoing prev commit

* Refactoring the  function to accept list of metric names instead of dictionary

* ..

* ..

* ..

* ..

* Remove extra call to .to and load_state_dict in hf checkpointer (#939)

* Fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking (#940)

* ..

* undoing prev commit

* fixing the gen_attention_mask_in_length function to handle the case when sequence id is -1 due to attention masking

* Update modeling_mpt.py

* ..

---------

Co-authored-by: Daniel King <[email protected]>

* Update lora docs (#941)

* fix (#942)

* Retrieve license information when local files are provided for a pretrained model (#943)

* Initial implementation to test

* Add log for license overwrite

* Use Path for input to _write_license_information

* Set default

---------

Co-authored-by: Daniel King <[email protected]>

* Add and use VersionedDeprecationWarning (#944)

* Add and use VersionedDeprecationWarning

* Use remove_version instead.

* Fix merge

* Apply suggestions from code review

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Bump llm-foundry version to 0.5.0 (#948)

* Bump version to 0.5.0

* Remove deprecated features

* Other cleanup

* code quality

* Fix chain-of-thought tasks (#824)

* Skip flaky lion8b test (#598)

* relax atol and add retries to reduce flakiness in lion8b timing test

* add eval output logging

* add back tasks

* foo

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* add rlhf prompts

* fix prompt

* fix prompt

* modify mcli

* test

* test

* fix

* added math dataset

* edit yaml

* prep gsm8k identically to eleuther

* prep gsm8k identically to eleuther

* add early stopping criteria

* finish

* debug

* fix

* bug

* remove eval output logging callback

* restore

* fix

* fix

* fix composer verion

* gauntlet v0.2.1

* gauntlet v0.2.1

* prep

* prep

* foo

* restore

* restore

* restore mcli

* fix precommit

* fix

* Update hf_eval.yaml

* fix

* fix

* remove programming

* update readme

---------

Co-authored-by: dblalock <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Add finetuning streaming dataset conversion (#933)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* review comments

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update llmfoundry/data/finetuning/tasks.py

Co-authored-by: Daniel King <[email protected]>

* Update scripts/data_prep/convert_finetuning_dataset.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Add default signature to mlflow saved model (#952)

* allow te to use meta device with deferred init (#958)

* Update TUTORIAL.md (#957)

* Update TUTORIAL.md

fix indentation problem

* Update TUTORIAL.md

---------

Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Daniel King <[email protected]>

* Bump mcli yaml foundry version to v0.5.0 (#959)

* add finutuning with streaming dataset example (#945)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* fix yaml

* comments

* comments

* comments

* add unit test

* comments

---------

Co-authored-by: Daniel King <[email protected]>

* Add fully configurable activation checkpointing (#951)

* add fully configurable activation checkpointing

* fix format

* fix format

* add docstring to activation_checkpointing_fn

* add block id range option in act ckpt

* resolve conflict

* add a check for blocks ids overlap in mapping

* fix typo

* update docstring

* refactor

* fix test

* Apply suggestions from code review

Co-authored-by: Mihir Patel <[email protected]>

* address comments

* add build mapping as a helper func

* fix format

---------

Co-authored-by: Mihir Patel <[email protected]>

* Use create_model_version instead of register_model (#953)

* Add streams support (#946)

* add convert

* fix

* fix convert

* add jsonl

* revert setup

* test precommit

* pre-commit

* test pre-commit

* v0

* review comments

* temporarily trigger test

* test

* add convert

* fix

* v0

* fix

* fix MDS write

* streams support

* fake commit

* fix setup

* format

* add back arxiv

* trigger test

* review comments

* temporarily trigger test

* test

* add convert

* fix

* fix

* fix MDS write

* format

* trigger test

* fix

* format

* resolve conflicts

* add back jsonl

* fix yaml

* comments

* format

* comments

* comments

* add unit test

* comments

* comments

* merge

* format

* typo

* Update llmfoundry/data/finetuning/dataloader.py

Co-authored-by: Daniel King <[email protected]>

---------

Co-authored-by: Daniel King <[email protected]>

* Fix typo (#966)

* Fix eval.py with lora (#965)

* just remove it?

* or not

* fix

* fix up

* clean up

* fix example yaml

* precommit

* add test

* add memorysnapshot to callbacks (#810)

Co-authored-by: Daniel King <[email protected]>

* Adding curriculum learning callback (experimental) (#954)

* curriculum learning callback

* curriculum learning callback

* fixing types

* dataset config types correct

* dataset config retrieved correctly

* access train dataloader correctly

* load state dict defaults

* get that damn dataloader

* missed dat

* dataspec L

* dataset L

* no logging, print is my best friend

* save first dataset config

* don't save new dataset config every single time

* logging dataset state

* have to set the damn timestamp. rip

* remove logging

* linting

* pyright

* removing rope...

* Delete scripts/eval/local_data/.DS_Store

* trailing comma is bacc

* fixed docstring

* fixed docstrings

* no more funky stuff in save_dict

* refactored, assuming before_load event in composer

* lingint

* bumped composer and streaming min versions

* moved line

* strengthened chat formatting validation (#960)

* strengthened chat formatting validation

* fix types

* made assert messages more descriptive

* used raise instead of assert, added type checks

* added list type check

* type error if no string content

* add test case for new validation

* relaxed type constraints to interface minimum

* use Mapping and Iterable

* fix mapping in type aliases too

* iterable -> sequence

* sequence -> list

* Mapping -> Dict

* use mapping again

* fixed another one

* updated message

* factored out duplicate functions

* dict -> mapping

* add sequence

* Add new base images and remove fa1 images (#970)

* Add new ICL kwargs in eval.py and long_context yamls (#925)

* add yamls w/ old links

* load from max's public hf and parse hf datasets

* update rest of tasks

* add better logging

* implemented leval tasks

* move level

* add level yaml

* add str parsing to hf

* wip

* llm-foundry working with new parser

* working w/ new parsing

* fix old long context tasks

* wip

* wip

* wip

* wip

* update to hf_parsing_map

* rm defaults

* fix parsing vars

* update defaults again

* rm merge conflict

* fix gen_kwargs

* rm old code path

* fixups

* wip

* rm leval from pr

* fix comments in yamls

* add cot params

* add fewshot_random_seed

* fix early_stopping_criteria, fewshot_num_seed default

* undo rm hf_eval

* add fewshot_random_seed to test

* add 64k tasks

* add longer context, update composer versin

* address comments

* mixed

* use seed by default

* rm  long_context_eval_8k.yaml

* add longer context evals

* mv yamls

* eval gauntlet wip

* update niah and wikiqa

* fix linting

* add default option

* change defaults

* fix linting

* fix linting 2

---------

Co-authored-by: Daniel King <[email protected]>

* Make Composer pins consistent with each other (#972)

* Make turbo an optional dependency (#964)

* Fix fewshot_random_seed default setting (#974)

* del fewshot_random default, fix hf_eval, fix gauntlet readme

* set in cfg defaults area

* fix the fix i applied that was actually not a fix

* rm num_batch from hf_eval

* improve error msg when checking target_blocks in activation_checkpointing_target (#977)

* Torch 2.2 upgrade - Part 1 (#976)

* Torch 2.2 - Part 2 (#979)

* PyTorch 2.2 - Part 3 (#981)

* Remove torch 2.1 from docker build (#982)

* Async callback: Don't skip checkpoints, reliably only launch async eval when the checkpoint is ready (#813)

* working without sharded checkpointing..

* add more debugs

* try this

* more debugging

* yikes dumb bug

* add notes

* fixes

* remove prints

* small updates

* fix typo

* refactor

* fix docstring formatting

* fighting with docstrings

* try this

* add unit tests

* point to composer update

* values -> items

* serialize time

* fix merge

* nits

* warning, small comment update

* add error

---------

Co-authored-by: Daniel King <[email protected]>

* Token accuracy metrics (#983)

* do not mention 1.13 in readme (#988)

Co-authored-by: Daniel King <[email protected]>

* Patch test, lock mcli version (#990)

* Bump gha timeouts (#991)

* Fix readme typo (#993)

* if condition in tie weights added (#989)

* if condition in tie weights added

* unit test for tie weights

* bump composer version (#995)

* Trim examples ahead of time for auto packing (#994)

* add oom observer callback (#932)

* add oom observer callback

* fix format

* Change ci/cd to use ci-testing repo

* Revert "Change ci/cd to use ci-testing repo"

This reverts commit e3f214e.

* Use ci-testing repo (#1000)

Co-authored-by: Irene Dea <[email protected]>

* Make CodeEval respect device_eval_batch_size (#956)

* Remove try except around imports (#1004)

* Deprecate triton, prefix lm, llama attention patch, and text denoising; Make ComposerHFT5 experimental (#1007)

* Deprecate features and mark experimental

* fix typo

---------

Co-authored-by: Daniel King <[email protected]>

* add magic filename for sharded state dicts (#1001)

* add magic filename for sharded state dicts

* Update scripts/train/train.py

Co-authored-by: Daniel King <[email protected]>

* oops forgot to push this

* no shard if no fsdp

* default to full on foundry

---------

Co-authored-by: Daniel King <[email protected]>

* bump (#1009)

* Fix evaluators actually pulling eval metrics (#1006)

* fix bug on metrics

* lint

* lint

* add unit test

* lint

* Build torch 2.2.1 images (#1010)

* add 2.2.1 tests (#1011)

* Bump min torch pin (#1013)

Red button because CI running jobs it doesn't need. Tests passed on main.

* Fix extra BOS token in front of response for some tokenizers (#1003)

* Bump min composer pin (#1015)

* add default for eval interval (#987)

Co-authored-by: Daniel King <[email protected]>

* Add support for olmo (#1016)

* Add deeper support for multi-turn chats and loss-generating tokens in finetuning (#985)

The main purpose of this PR is to support training on non-terminal responses in multi-round chats. This is achieved by tokenizing at the level of conversation "turns" and exposing some options for what turns are used as training targets (i.e. generate loss). This also adds support for treating prompt tokens as loss-generating.

The script for converting a finetuning dataset to streaming has also been updated (with some bug fixes).

* Fix profiling packing ratio to explicitly say 1 (#1019)

* Bump transformers to 4.38.2 (#1018)

* that kwargs (#1020)

* Update readme with pytorch 2.2.1 (#1021)

* Add code import to train/eval scripts (#1002)

* finish (#1022)

Co-authored-by: Max Marion <[email protected]>

* Bump version to 0.6.0 (#1023)

* Fix typo in monolithic chkpt callback docs (#1024)

* Fix typo in monolithic chkpt callback docs

* reorder to match function signature

* update pip install link

* Change done file location

* Create the dest folder

* Allow code-quality workflow to be callable (#1026)

Reverts part of the change made in
https://github.com/mosaicml/llm-foundry/pull/1000/files#diff-4a2765c2cfcbd3804a66aab805cb92ddda74de1730923cc5bf53671d0beccf06L11

* update notebook

* update

* update notebook

* update token_counts

* update pip install list

* fix

* update

* fix token counts

* Expose validate chat

* Expose more

* update

* expose

* add collate

* Fix

* update notebook

* Fix conflict

* update notebook

---------

Co-authored-by: Xiaohan Zhang <[email protected]>
Co-authored-by: xiaohanzhan-db <xiaohanzhan-db>
Co-authored-by: Mihir Patel <[email protected]>
Co-authored-by: Milo Cress <[email protected]>
Co-authored-by: Nancy Hung <[email protected]>
Co-authored-by: Jerry Chen <[email protected]>
Co-authored-by: Shashank Rajput <[email protected]>
Co-authored-by: Vitaliy Chiley <[email protected]>
Co-authored-by: Irene Dea <[email protected]>
Co-authored-by: Brian <[email protected]>
Co-authored-by: Daniel King <[email protected]>
Co-authored-by: Anna <[email protected]>
Co-authored-by: Nicholas Garcia <[email protected]>
Co-authored-by: Prithviraj Ammanabrolu <[email protected]>
Co-authored-by: Jane Zhang <[email protected]>
Co-authored-by: Vincent Chen <[email protected]>
Co-authored-by: Aaron Gokaslan <[email protected]>
Co-authored-by: Jeremy D <[email protected]>
Co-authored-by: dblalock <[email protected]>
Co-authored-by: bigning <[email protected]>
Co-authored-by: Cheng Li <[email protected]>
Co-authored-by: Sebastián Donoso Bustos <[email protected]>
Co-authored-by: Saaketh Narayan <[email protected]>
Co-authored-by: Max Marion <[email protected]>
Co-authored-by: Megha Agarwal <[email protected]>
Co-authored-by: Jose Javier <[email protected]>
Co-authored-by: Alex Trott <[email protected]>
Co-authored-by: Sasha Doubov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants