Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config for Amberish experiments at 1B #621

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
c5913aa
7b normal baseline scripts
AkshitaB Jun 12, 2024
e2cd59b
add new evals
AkshitaB Jun 12, 2024
3d02325
add 1b config
AkshitaB Jun 12, 2024
995247f
1b scripts
AkshitaB Jun 12, 2024
b71dff9
turn off fused_loss
AkshitaB Jun 12, 2024
0de7234
fix name
AkshitaB Jun 12, 2024
75ae73f
make executable
AkshitaB Jun 12, 2024
ed51f61
temporarily don't run new evals
AkshitaB Jun 12, 2024
3293cbb
switch to pete's torch2.3 image
AkshitaB Jun 12, 2024
d77add5
no clipping warmup
AkshitaB Jun 12, 2024
eff21ee
wait longer
AkshitaB Jun 12, 2024
c1075ce
priority
AkshitaB Jun 12, 2024
9028262
config for llamaish1 base run with amber data
drschwenk Jun 12, 2024
5d9dce5
launch scripts for llamaish1 with amber data
drschwenk Jun 12, 2024
a45fe68
fixed tokenizer def
drschwenk Jun 12, 2024
f8f530a
turn off perplexity eval
drschwenk Jun 12, 2024
0538bce
load last checkpoint
drschwenk Jun 12, 2024
15c5606
changeing sharding strategy to shard_grad_op
drschwenk Jun 12, 2024
0d2259e
change run names
drschwenk Jun 12, 2024
d74e26f
switch to huggyface tokenizer
drschwenk Jun 12, 2024
d909a98
initial config changes for amberish 1B
drschwenk Jun 12, 2024
9d1ad1f
initial launch script changes
drschwenk Jun 12, 2024
109e5b5
change rms_layernorm eps to match amber
drschwenk Jun 12, 2024
397af95
additional config changes
drschwenk Jun 12, 2024
93c88a8
move last couple of configs from launch script to config file
drschwenk Jun 12, 2024
d9c929d
adding rms_layer_norm eps to config
drschwenk Jun 12, 2024
7e4f0df
turn off fg activation checkpointing
drschwenk Jun 12, 2024
c41b2ed
rename config and scripts
drschwenk Jun 12, 2024
c63d821
removed files not intended for the amberish PR
drschwenk Jun 12, 2024
dc1c656
added one file I didn't want to remove
drschwenk Jun 12, 2024
0dfba0d
Update configs/llm-360-amber1.yaml
drschwenk Jun 12, 2024
c849064
clear out redundant settings
drschwenk Jun 12, 2024
6484000
Merge branch 'dustins/amberish1' of github.com:allenai/OLMo into dust…
drschwenk Jun 12, 2024
be47e5c
change opt eps key name
drschwenk Jun 12, 2024
412fdb4
reduce N nodes
drschwenk Jun 13, 2024
9e20d15
change layer norm eps, remove fused loss
drschwenk Jun 13, 2024
963576a
move to WEKA, turn on additional evals
drschwenk Jun 15, 2024
47c953d
fix hf_cache uri
drschwenk Jun 15, 2024
7a53674
change load path
drschwenk Jun 15, 2024
afd65d9
fix data path
drschwenk Jun 15, 2024
f7aa424
one last data path fix
drschwenk Jun 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
598 changes: 598 additions & 0 deletions configs/llm-360-amber1.yaml

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions olmo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,8 @@ class ModelConfig(BaseConfig):
to ``False``.
"""

layer_norm_eps: float = 1e-05

attention_layer_norm_with_affine: bool = True
"""
Toggle affine transform for the QK norms.
Expand Down
9 changes: 3 additions & 6 deletions olmo/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,11 +136,10 @@ def __init__(
*,
size: Optional[int] = None,
elementwise_affine: Optional[bool] = True,
eps: float = 1e-05,
):
super().__init__()
self.config = config
self.eps = eps
self.eps = config.layer_norm_eps
self.normalized_shape = (size or config.d_model,)
if elementwise_affine or (elementwise_affine is None and self.config.layer_norm_with_affine):
self.weight = nn.Parameter(torch.ones(self.normalized_shape, device=config.init_device))
Expand Down Expand Up @@ -199,9 +198,8 @@ def __init__(
size: Optional[int] = None,
low_precision: bool = False,
elementwise_affine: Optional[bool] = None,
eps: float = 1e-05,
):
super().__init__(config, size=size, elementwise_affine=elementwise_affine, eps=eps)
super().__init__(config, size=size, elementwise_affine=elementwise_affine)
self.low_precision = low_precision

def forward(self, x: torch.Tensor) -> torch.Tensor:
Expand Down Expand Up @@ -230,9 +228,8 @@ def __init__(
config: ModelConfig,
size: Optional[int] = None,
elementwise_affine: Optional[bool] = None,
eps: float = 1e-5,
):
super().__init__(config, size=size, elementwise_affine=elementwise_affine, eps=eps)
super().__init__(config, size=size, elementwise_affine=elementwise_affine)

def forward(self, x: torch.Tensor) -> torch.Tensor:
with torch.autocast(enabled=False, device_type=x.device.type):
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ requires-python = ">=3.8"
license = { file = "LICENSE" }
dependencies = [
"numpy",
"torch>=2.1,<2.3",
"torch>=2.1,<=2.3",
"ai2-olmo-core==0.1.0",
"omegaconf",
"rich",
Expand Down
33 changes: 33 additions & 0 deletions scripts/beaker/llamaish7-normal-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=64

gantry run \
--workspace ai2/OLMo-training \
--task-name llamaish7-normal \
--description "OLMo medium - 7B - Llamaish Normal" \
--priority urgent \
--preemptible \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--propagate-failure \
--synchronized-start-timeout 15m \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env-secret WANDB_API_KEY=AKSHITAB_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=AKSHITAB_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=AKSHITAB_AWS_SECRET_ACCESS_KEY \
--shared-memory 10GiB \
--venv base \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/beaker/llamaish7-normal.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK"
34 changes: 34 additions & 0 deletions scripts/beaker/llm-360-amber1-launch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
#!/usr/bin/env bash

set -ex

NUM_NODES=4

gantry run \
--workspace ai2/OLMo-training \
--task-name amberish1-base \
--description "OLMo small - 1B - Amberish with Amber data" \
--priority urgent \
--preemptible \
--beaker-image petew/olmo-torch23-gantry \
--cluster ai2/jupiter-cirrascale-2 \
--gpus 8 \
--replicas "${NUM_NODES}" \
--leader-selection \
--host-networking \
--budget ai2/oe-training \
--no-nfs \
--weka oe-training-default:/weka/oe-training-default \
--propagate-failure \
--synchronized-start-timeout 20m \
--env LOG_FILTER_TYPE=local_rank0_only \
--env OMP_NUM_THREADS=8 \
--env OLMO_TASK=model \
--env-secret WANDB_API_KEY=DUSTINS_WANDB_API_KEY \
--env-secret AWS_ACCESS_KEY_ID=DUSTINS_AWS_ACCESS_KEY_ID \
--env-secret AWS_SECRET_ACCESS_KEY=DUSTINS_AWS_SECRET_ACCESS_KEY \
--shared-memory 10GiB \
--venv base \
--yes \
--timeout=-1 \
-- /bin/bash -c "scripts/beaker/llm-360-amber1.sh \$BEAKER_LEADER_REPLICA_HOSTNAME ${NUM_NODES} \$BEAKER_REPLICA_RANK"
39 changes: 39 additions & 0 deletions scripts/beaker/llm-360-amber1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env bash
set -exuo pipefail
IFS=$'\n\t'

BEAKER_LEADER_REPLICA_HOSTNAME=$1
shift

NUM_NODES=$1
shift

BEAKER_REPLICA_RANK=$1
shift

# Warm HF cache
mkdir -p /root/.cache
pushd /root/.cache
curl "https://storage.googleapis.com/hf-cache/huggingface_cache_v4.tar.gz" | tar --keep-newer-files -xzf -
popd
export HF_DATASETS_OFFLINE=1


torchrun \
--nnodes ${NUM_NODES}:${NUM_NODES} \
--nproc-per-node 8 \
--rdzv_id=12347 \
--rdzv_backend=static \
--rdzv_endpoint=$BEAKER_LEADER_REPLICA_HOSTNAME:29400 \
--node_rank=$BEAKER_REPLICA_RANK \
--rdzv_conf="read_timeout=420" \
scripts/train.py \
configs/llm-360-amber1.yaml \
--gen1_gc_interval=null \
--save_folder=runs/ \
--save_interval=1000 \
--eval_interval=1000 \
--optimizer.metrics_log_interval=1 \
--save_overwrite \
--save_num_checkpoints_to_keep=3 \
'--load_path=s3://ai2-llm/checkpoints/OLMo-small/${run_name}/step69750/'
Loading