Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modernize MosaicBERT #440

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

Skylion007
Copy link

@Skylion007 Skylion007 commented Jan 2, 2024

This PR modernizes the MosaicBERT codebase with Flash Attention 2, PyTorch 2 (torch==2.1.1), and an updated version of composer (mosaicml>=0.17).

In particular, this updates MosaicBERT to be compatible with Flash Attention 2 (flash-attn==4.2.4), which now supports ALiBi slopes (PR#540).

Context:


See w&b runs here

Note that changes to files outside of examples/benchmarks/bert are simply formatting changes due to linting.

@Skylion007 Skylion007 force-pushed the skylion007/add-fa2-to-bert branch 3 times, most recently from 617db70 to c9ee668 Compare January 2, 2024 18:00
@Skylion007 Skylion007 force-pushed the skylion007/add-fa2-to-bert branch from c9ee668 to b809a7b Compare January 2, 2024 18:09
if convert_dtype:
# Triton implementation only supports fp16 and bf16
orig_dtype = qkv.dtype
qkv = qkv.to(torch.float16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to be in torch.float16?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not, this code was here before though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we select between bfloat16 and float16 though?

@@ -266,8 +261,6 @@ def build_text_dataloader(
cfg.dataset.get('validate_hash', None),
keep_zip=stream.get('keep_zip', None) or
cfg.dataset.get('keep_zip', False),
keep_raw=stream.get('keep_raw', None) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that this is correct and that keep_raw is no longer a flag in mosaicml-streaming (see Streaming docs)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check that the defaults here match the defaults currently set in llm foundry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaults in llm foundry are a bit different. Should we update this function whole-hog?

From llmfoundry text_data.py

def __init__(self,
                 tokenizer: PreTrainedTokenizerBase,
                 max_seq_len: int,
                 streams: Optional[Sequence[Stream]] = None,
                 remote: Optional[str] = None,
                 local: Optional[str] = None,
                 split: Optional[str] = None,
                 download_retry: int = 2,
                 download_timeout: float = 60,
                 validate_hash: Optional[str] = None,
                 keep_zip: bool = False,
                 epoch_size: Optional[Union[int, str]] = None,
                 predownload: Optional[int] = None,
                 cache_limit: Optional[Union[int, str]] = None,
                 partition_algo: str = 'relaxed',
                 num_canonical_nodes: Optional[int] = None,
                 batch_size: Optional[int] = None,
                 shuffle: bool = False,
                 shuffle_algo: str = 'py1e',
                 shuffle_seed: int = 9176,
                 shuffle_block_size: Optional[int] = None,
                 sampling_method: str = 'balanced',
                 sampling_granularity: int = 1,
                 batching_method: str = 'random',
                 **kwargs: Any):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's still text data, this should be good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just linting

@jacobfulano
Copy link
Contributor

Should be close to done @dakinggg, the two failed pytests were

FAILED tests/test_classification.py::test_classification_script - RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
FAILED tests/test_glue.py::test_glue_script[mosaic_bert] - RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
============= 2 failed, 3 passed, 3 warnings in 147.01s (0:02:27) ==============

@@ -425,6 +499,7 @@ def __init__(self, config):
(1, self.num_attention_heads, self._current_alibi_size,
self._current_alibi_size))
self.rebuild_alibi_tensor(size=config.alibi_starting_size)
self.slopes = None
Copy link

@stefan-it stefan-it Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Skylion007 many thanks for this PR! I am currently testing it (with own dataset) and training is working (8x H100).

I had to remove this line, because:

  • this.slopes is set in the rebuild_alibi_tensor function before
  • it is later needed in line 583

Setting to None will then cause an error in line 583.

Copy link
Contributor

@jacobfulano jacobfulano Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was on me trying to appease the linting gods. Thanks for catching! Should be removed now

@Taytay
Copy link

Taytay commented Jan 6, 2024

UPDATE on 1/8/24: This was not an issue for me on a clean machine, so this is unlikely to be a real issue, and VERY unlikely to be an issue with this PR.

==============
ORIGINAL:
I don't think this error needs to hold up this PR, but FA2 was giving me some headaches as part of a clean requirements.txt installation. I fixed it by ensuring that packaging and torch were both installed BEFORE running the pip install for FA2.

Details:

Env: (This is in WSL for Windows, but most of the time that's equivalent to a Ubuntu environment, and I don't think it's the source of this error.)

I just checked out the branch and created a clean conda env. Then, I did the pip install -r requirements.txt and got an error:

❯ pip install -r requirements.txt
Collecting packaging (from -r requirements.txt (line 1))
  Using cached packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Collecting einops==0.5.0 (from -r requirements.txt (line 2))
  Using cached einops-0.5.0-py3-none-any.whl (36 kB)
Collecting torch==2.1.1 (from -r requirements.txt (line 3))
  Using cached torch-2.1.1-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting composer<0.18,>=0.17.0 (from composer[nlp,wandb]<0.18,>=0.17.0->-r requirements.txt (line 4))
  Using cached composer-0.17.2-py3-none-any.whl.metadata (27 kB)
Collecting mosaicml-streaming<=0.7 (from -r requirements.txt (line 5))
  Using cached mosaicml_streaming-0.7.0-py3-none-any.whl.metadata (20 kB)
Collecting omegaconf==2.3.0 (from -r requirements.txt (line 6))
  Using cached omegaconf-2.3.0-py3-none-any.whl (79 kB)
Collecting transformers==4.35.2 (from -r requirements.txt (line 7))
  Using cached transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
Collecting flash_attn>=2.4.2 (from -r requirements.txt (line 9))
  Using cached flash_attn-2.4.2.tar.gz (2.4 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-snje5q4q/flash-attn_a0ad7b7eaf5e4b1bb1d9c8af1808da4b/setup.py", line 9, in <module>
          from packaging.version import parse, Version
      ModuleNotFoundError: No module named 'packaging'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

I tried adding packaging to the top of the requirements.txt, but got the same error. This is happening I believe because FA2 is trying to run some setup stuff as part of its install?

so I pip install packaging on the command line:

Collecting packaging
  Using cached packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Using cached packaging-23.2-py3-none-any.whl (53 kB)
Installing collected packages: packaging
Successfully installed packaging-23.2

re-ran pip install -r requirements.txt:

❯ pip install -r requirements.txt
Requirement already satisfied: packaging in /home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (23.2)
Collecting einops==0.5.0 (from -r requirements.txt (line 2))
  Using cached einops-0.5.0-py3-none-any.whl (36 kB)
Collecting torch==2.1.1 (from -r requirements.txt (line 3))
  Using cached torch-2.1.1-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting composer<0.18,>=0.17.0 (from composer[nlp,wandb]<0.18,>=0.17.0->-r requirements.txt (line 4))
  Using cached composer-0.17.2-py3-none-any.whl.metadata (27 kB)
Collecting mosaicml-streaming<=0.7 (from -r requirements.txt (line 5))
  Using cached mosaicml_streaming-0.7.0-py3-none-any.whl.metadata (20 kB)
Collecting omegaconf==2.3.0 (from -r requirements.txt (line 6))
  Using cached omegaconf-2.3.0-py3-none-any.whl (79 kB)
Collecting transformers==4.35.2 (from -r requirements.txt (line 7))
  Using cached transformers-4.35.2-py3-none-any.whl.metadata (123 kB)
Collecting flash_attn>=2.4.2 (from -r requirements.txt (line 9))
  Using cached flash_attn-2.4.2.tar.gz (2.4 MB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-gv890oec/flash-attn_bd567b3ed4774a49a637dedaf268441f/setup.py", line 19, in <module>
          import torch
      ModuleNotFoundError: No module named 'torch'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

FA2 is assuming that torch is already installed, but it's being installed as a sibling, so it's not a module yet!
I moved the FA2 requirement to its own requirements_fa2.txt file and got the requirements.txt to succeed.

Then I installed FA2 by running that: pip install -r requirements_fa2.txt
and it worked like a champ.

This no module named torch is not unheard of with FA2: Dao-AILab/flash-attention#246

@Taytay
Copy link

Taytay commented Jan 6, 2024

One more bug that I'll report here just in case it is not just a "my machine" thing. I didn't see NVidia Apex mentioned on the requirements, but when I get to the point where I am running this:

# This will pre-train a MosaicBERT that reaches the same downstream accuracy in roughly 1/3 the time.
composer main.py yamls/main/mosaic-bert-base-uncased.yaml

It looks like I need to have NVidia Apex installed:

/home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 6, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
Building eval loader...
Traceback (most recent call last):
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 271, in <module>
    main(cfg)
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 210, in main
    algorithms = [
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 211, in <listcomp>
    build_algorithm(name, algorithm_cfg)
  File "/home/taytay/YNAB/ML/mosaicml_examples_skylion/examples/benchmarks/bert/main.py", line 72, in build_algorithm
    return algorithms.FusedLayerNorm(**kwargs)
  File "/home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages/composer/algorithms/fused_layernorm/fused_layernorm.py", line 110, in __init__
    check_if_apex_installed()
  File "/home/taytay/miniconda3/envs/mosaic_bert_fa2/lib/python3.10/site-packages/composer/algorithms/fused_layernorm/fused_layernorm.py", line 30, in check_if_apex_installed
    raise ImportError(
ImportError: https://github.com/NVIDIA/apex is not installed. The Fused LayerNorm algorithm cannot be applied. The MosaicML Docker Images (https://hub.docker.com/r/mosaicml/pytorch) contain a copy of APEX for easy use.
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 13697) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 13697) exited with code 1

@Taytay
Copy link

Taytay commented Jan 8, 2024

An update on the above: Once I installed Apex from source, the command worked.

You have already recommended the MosaicML Pytorch base image, which presumably comes with Apex pre-installed. I decided to ignore that handy tip and run from my existing WSL environment.

Something that would have helped me would be to clarify that if the user does not use the recommended Pytorch base image, they will need to install Apex after pip installing the requirements.txt. If I'm not the target audience, or this is opening you up to way too much config specification, I get it.

@Taytay
Copy link

Taytay commented Jan 8, 2024

With regards to my comment :

I don't think this error needs to hold up this PR, but FA2 was giving me some headaches as part of a clean requirements.txt installation. I fixed it by ensuring that packaging and torch were both installed BEFORE running the pip install for FA2

This was not an issue for me on a clean machine, so this is unlikely to be a real issue, and VERY unlikely to be an issue with this PR.

@Taytay
Copy link

Taytay commented Jan 8, 2024

I believe that one of the test yamls is missing:

algorithms:
  fused_layernorm: {}

I say that because in the README, it explains you can do a test run of training a Mosaic model by running:

# Run the pre-training script with the test config and MosaicBERT
composer main.py yamls/test/main.yaml model.name=mosaic_bert

However, yamls/test/main.yaml doesn't have these lines:

algorithms:
  fused_layernorm: {}

But yamls/main/mosaic-bert-base-uncased.yaml DOES specify fused_layernorm.

That means that the first time it tries to load Apex's fused_layernorm is when you get to this section:

# This will pre-train a MosaicBERT that reaches the same downstream accuracy in roughly 1/3 the time.
composer main.py yamls/main/mosaic-bert-base-uncased.yaml

I noticed this because I got an error when it tried to load Apex and my environment didn't have it installed. I was surprised because all of my "tests" from the README worked.

@jacobfulano
Copy link
Contributor

I believe that one of the test yamls is missing:

algorithms:
  fused_layernorm: {}

I say that because in the README, it explains you can do a test run of training a Mosaic model by running:

# Run the pre-training script with the test config and MosaicBERT
composer main.py yamls/test/main.yaml model.name=mosaic_bert

However, yamls/test/main.yaml doesn't have these lines:

algorithms:
  fused_layernorm: {}

But yamls/main/mosaic-bert-base-uncased.yaml DOES specify fused_layernorm.

That means that the first time it tries to load Apex's fused_layernorm is when you get to this section:

# This will pre-train a MosaicBERT that reaches the same downstream accuracy in roughly 1/3 the time.
composer main.py yamls/main/mosaic-bert-base-uncased.yaml

I noticed this because I got an error when it tried to load Apex and my environment didn't have it installed. I was surprised because all of my "tests" from the README worked.

Hi @Taytay,

Thanks for pointing this out. The MosaicML Composer library for a while used Fused Layernorm as a Composer "algorithm" to speed up pretraining. It relies on NVIDIA Apex and enables a faster kernel for LayerNorm.

More recently, we've been using Low Precision LayerNorm which does not rely on APEX and works just as well as Fused LayerNorm. From the Composer docs:

Low Precision LayerNorm is meant to replace our Fused LayerNorm algorithm. The two algorithms achieve very similar throughput. Fused LayerNorm also runs in low precision, but it is a more complex algorithm, since it uses a custom kernel. Since the custom kernel provides no additional speedup, we have replaced it with this simpler algorithm.

In the yaml, you can replace fused_layernorm with

algorithms:
  low_precision_layernorm: {}

I've updated the mosaicbert pretraining and finetuning yamls to use low_precision_layernorm.

@Taytay
Copy link

Taytay commented Jan 10, 2024

Thanks @jacobfulano. That's good news. It's worth mentioning that I ran into a bug in this branch that is fixed by #443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants