From 4d72aa3811e487c57e90be19a044afabdac8b30c Mon Sep 17 00:00:00 2001
From: Joel Lamy-Poirier <joel.lamy-poirier@servicenow.com>
Date: Wed, 18 Dec 2024 21:14:16 -0500
Subject: [PATCH] Markdownlint pre-commit

---
 .github/ISSUE_TEMPLATE/bug_report.md          |  24 ++---
 .github/ISSUE_TEMPLATE/feature_request.md     |  10 +-
 .github/PULL_REQUEST_TEMPLATE.md              |  54 +++++-----
 docs/.markdownlint.yaml => .markdownlint.yaml |  20 +++-
 .pre-commit-config.yaml                       |  15 ++-
 CODE_OF_CONDUCT.md                            |  40 +++----
 README.md                                     | 102 +++++++++---------
 SECURITY.md                                   |  16 +--
 docs/about-us.md                              |  14 +--
 docs/developers/contributing.md               |  46 ++++----
 docs/developers/style-guide.md                |  97 +++++++++++------
 docs/index.md                                 |  20 ++--
 docs/join-us.md                               |  12 +--
 docs/quick-start.md                           |   6 +-
 docs/recipes/data-preparation.md              |   6 +-
 fast_llm/models/custom/readme.md              |  27 +++--
 16 files changed, 281 insertions(+), 228 deletions(-)
 rename docs/.markdownlint.yaml => .markdownlint.yaml (69%)

diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
index 72e68d69..f5879ce4 100644
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -7,16 +7,16 @@ assignees: jlamypoirier
 
 ---
 
-# 🐞 Describe the Bug
+## 🐞 Describe the Bug
 
 Provide a clear and concise description of the bug.
 
-# 🔄 Steps to Reproduce
+## 🔄 Steps to Reproduce
 
 Steps to reproduce the behavior:
 
-1. **Get the relevant Fast-LLM version** (e.g., git commit hash or Docker image tag) that you encountered the issue with.
-2. **Run the following command** (modify or redact as needed):
+1.  **Get the relevant Fast-LLM version** (e.g., git commit hash or Docker image tag) that you encountered the issue with.
+2.  **Run the following command** (modify or redact as needed):
 
     ```bash
     torchrun --rdzv_backend=static \
@@ -31,14 +31,14 @@ Steps to reproduce the behavior:
              --config /path/to/your/config.yaml
     ```
 
-3. **Include relevant log excerpts** to help us diagnose the issue, with `NCCL_DEBUG=INFO` (or higher) enabled. Make sure the logs contain the full configuration of the run.
-4. **Provide the configuration YAML** used for the Fast-LLM setup if logs are unavailable.
+3.  **Include relevant log excerpts** to help us diagnose the issue, with `NCCL_DEBUG=INFO` (or higher) enabled. Make sure the logs contain the full configuration of the run.
+4.  **Provide the configuration YAML** used for the Fast-LLM setup if logs are unavailable.
 
-# 🎯 Expected Behavior
+## 🎯 Expected Behavior
 
 Describe what you expected to happen.
 
-# 📜 Environment Information
+## 📜 Environment Information
 
 Run the following script in your environment and paste its output here:
 
@@ -105,10 +105,10 @@ fi
 echo "=== END OF ENVIRONMENT INFORMATION ==="
 ```
 
-# 📝 Additional Context
+## 📝 Additional Context
 
 Include any other information that may help us understand the issue, such as:
 
-- Recent changes to the configuration or code.
-- Whether the issue occurs consistently or intermittently.
-- Any troubleshooting steps you have already tried.
+-                     Recent changes to the configuration or code.
+-                     Whether the issue occurs consistently or intermittently.
+-                     Any troubleshooting steps you have already tried.
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
index 1b434b9b..d258dedd 100644
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -7,27 +7,27 @@ assignees: ''
 
 ---
 
-# 🧐 Problem Description
+## 🧐 Problem Description
 
 Is your feature request related to a specific problem? Please describe it clearly.
 For example: "I'm always frustrated when [...]"
 
-# 💡 Proposed Solution
+## 💡 Proposed Solution
 
 Describe the solution you would like to see.
 Be as specific as possible about how it would work or be implemented.
 
-# 🔄 Alternatives Considered
+## 🔄 Alternatives Considered
 
 Have you considered any alternative solutions or approaches?
 If so, please describe them and explain why they might not be ideal.
 
-# 📈 Potential Benefits
+## 📈 Potential Benefits
 
 Explain how this feature could benefit Fast-LLM users.
 Consider how it might improve performance, usability, scalability, etc.
 
-# 📝 Additional Context
+## 📝 Additional Context
 
 Add any other context or information that could help us understand the feature request better.
 If applicable, provide links to relevant references or examples.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
index 8b595407..6330048c 100644
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@@ -9,21 +9,21 @@ Closes # <!-- Insert issue number here, if applicable -->
 
 Select all that apply:
 
-- [ ] 🐛 **Bug fix** (non-breaking change that addresses a specific issue)
-- [ ] 🚀 **New feature** (non-breaking change that adds functionality)
-- [ ] ⚠️ **Breaking change** (a change that could affect existing functionality)
-- [ ] 📈 **Performance improvement/optimization** (improves speed, memory usage, or efficiency)
-- [ ] 🛠️ **Code refactor** (non-functional changes that improve code readability, structure, etc.)
-- [ ] 📦 **Dependency bump** (updates dependencies, including Dockerfile or package changes)
-- [ ] 📝 **Documentation change** (updates documentation, including new content or typo fixes)
-- [ ] 🔧 **Infrastructure/Build change** (affects build process, CI/CD, or dependencies)
+-                     [ ] 🐛 **Bug fix** (non-breaking change that addresses a specific issue)
+-                     [ ] 🚀 **New feature** (non-breaking change that adds functionality)
+-                     [ ] ⚠️ **Breaking change** (a change that could affect existing functionality)
+-                     [ ] 📈 **Performance improvement/optimization** (improves speed, memory usage, or efficiency)
+-                     [ ] 🛠️ **Code refactor** (non-functional changes that improve code readability, structure, etc.)
+-                     [ ] 📦 **Dependency bump** (updates dependencies, including Dockerfile or package changes)
+-                     [ ] 📝 **Documentation change** (updates documentation, including new content or typo fixes)
+-                     [ ] 🔧 **Infrastructure/Build change** (affects build process, CI/CD, or dependencies)
 
 ## 📝 Changes
 
 List the key changes introduced in this PR:
 
-1. Change A
-2. Change B
+1.                Change A
+2.                Change B
 
 ## ✅ Checklist
 
@@ -31,32 +31,32 @@ Make sure the following tasks are completed before submitting the PR:
 
 ### General
 
-- [ ] 📜 I have read and followed the [contributing guidelines](https://servicenow.github.io/Fast-LLM/developers/contributing).
-- [ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
-- [ ] 🎉 The functionality is complete, and I have tested the changes.
-- [ ] 📝 I have updated the documentation if needed.
-- [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
-- [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
+-                     [ ] 📜 I have read and followed the [contributing guidelines](https://servicenow.github.io/Fast-LLM/developers/contributing).
+-                     [ ] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
+-                     [ ] 🎉 The functionality is complete, and I have tested the changes.
+-                     [ ] 📝 I have updated the documentation if needed.
+-                     [ ] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
+-                     [ ] 🧩 I have commented my code, especially in hard-to-understand areas.
 
 ### Dependencies and Configuration
 
-- [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
-- [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
+-                     [ ] 🐋 I have updated the Docker configuration or dependencies, if applicable.
+-                     [ ] 🔄 I have ensured compatibility with the existing setup after dependency changes.
 
 ### Testing
 
-- [ ] 🧪 I have added or updated tests to cover my changes.
-- [ ] ✔️ New and existing tests pass locally with my changes.
-- [ ] 🚦 I have tested these changes on GPUs and verified training stability.
-- [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
+-                     [ ] 🧪 I have added or updated tests to cover my changes.
+-                     [ ] ✔️ New and existing tests pass locally with my changes.
+-                     [ ] 🚦 I have tested these changes on GPUs and verified training stability.
+-                     [ ] 🏋️ I have tested the changes on realistic training workloads, if applicable.
 
 ### Performance Impact
 
-- [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
-- [ ] ✅ The benchmarks show no performance regression.
-- [ ] 🚀 The benchmarks indicate a potential performance improvement.
-- [ ] ⚠️ The benchmarks indicate a potential performance degradation.
-- [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
+-                     [ ] 📊 I have run benchmarks where applicable to evaluate the performance impact.
+-                     [ ] ✅ The benchmarks show no performance regression.
+-                     [ ] 🚀 The benchmarks indicate a potential performance improvement.
+-                     [ ] ⚠️ The benchmarks indicate a potential performance degradation.
+-                     [ ] 📈 I have provided benchmark results and detailed any performance impact below, if applicable.
 
 ## 📊 Performance Impact Details
 
diff --git a/docs/.markdownlint.yaml b/.markdownlint.yaml
similarity index 69%
rename from docs/.markdownlint.yaml
rename to .markdownlint.yaml
index 44d5cf91..bdd8af70 100644
--- a/docs/.markdownlint.yaml
+++ b/.markdownlint.yaml
@@ -20,13 +20,23 @@ MD010:
 # MD013/line-length : Line length : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md013.md
 MD013: false
 
+# MD024/no-duplicate-heading Multiple headings with the same content (disabled because we do it).
+MD024: false
+
+# Temporarily disabled because not automatically fixed.
 # MD030/list-marker-space : Spaces after list markers : https://github.com/DavidAnson/markdownlint/blob/v0.32.1/doc/md030.md
-MD030:
+MD030: false
   # Spaces for single-line unordered list items
-  ul_single: 3
+  # ul_single: 3
   # Spaces for single-line ordered list items
-  ol_single: 2
+  # ol_single: 2
   # Spaces for multi-line unordered list items
-  ul_multi: 3
+  # ul_multi: 3
   # Spaces for multi-line ordered list items
-  ol_multi: 2
+  # ol_multi: 2
+
+# Code block style (disable because of interactions with mkdocs note blocks)
+MD046: false
+
+# Link and image reference definitions (disable because of interactions with mkdocs footnotes)
+MD053: false
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index f8465c52..c6d2671a 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -2,7 +2,7 @@
 # See https://pre-commit.com/hooks.html for more hooks
 repos:
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.6.0
+    rev: v5.0.0
     hooks:
     -   id: trailing-whitespace
     -   id: end-of-file-fixer
@@ -11,7 +11,7 @@ repos:
         - --unsafe
     -   id: check-added-large-files
 -   repo: https://github.com/asottile/pyupgrade
-    rev: v3.17.0
+    rev: v3.19.1
     hooks:
     -   id: pyupgrade
         args:
@@ -42,9 +42,18 @@ repos:
         name: isort (pyi)
         types: [pyi]
 -   repo: https://github.com/psf/black
-    rev: 24.8.0
+    rev: 24.10.0
     hooks:
     -   id: black
         args:
             - "--config"
             - "./pyproject.toml"
+-   repo: https://github.com/DavidAnson/markdownlint-cli2
+    rev: v0.16.0
+    hooks:
+    -   id: markdownlint-cli2
+        name: markdownlint
+        entry: markdownlint-cli2
+        args: ["--fix"]
+        language: node
+        types: [markdown]
diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
index 4e623f9f..3a639c3f 100644
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -6,15 +6,15 @@ This code of conduct provides guidelines for participation in ServiceNow-managed
 
 Communities thrive when members support each other and provide useful feedback.
 
-- Be polite and courteous. Respect and treat others as you would expect to be treated yourself.
-- Respect your audience. Posts should not upset, annoy, threaten, harass, abuse or embarrass other members.
-- User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
-- Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
-- Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
-- Do not publish anyone's private personal information without their explicit consent.
-- Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
-- Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
-- Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
+-                     Be polite and courteous. Respect and treat others as you would expect to be treated yourself.
+-                     Respect your audience. Posts should not upset, annoy, threaten, harass, abuse or embarrass other members.
+-                     User Contributions must not include material that is defamatory, obscene, indecent, abusive, offensive, harassing, violent, hateful, inflammatory or otherwise objectionable.
+-                     Lively and collegial discussions are always encouraged in a healthy community. It is okay to argue facts but not okay to argue personalities or personal beliefs.
+-                     Do not use text formats such as all caps or bold that may be read as annoying, rude or send a strong message.
+-                     Do not publish anyone's private personal information without their explicit consent.
+-                     Avoid using abbreviations or terminology that others may not understand. An abbreviation may mean something to you but in another context or country, it may have another meaning.
+-                     Be accountable for your actions by correcting your mistakes and indicating where you have changed a previous post of yours.
+-                     Mark content as correct and helpful, and provide feedback. If you read a discussion post that you find helpful, we encourage you to leave a positive vote and comment in the replies. If you find a post that is unhelpful, please provide more information in the issue comments.
 
 ## Issue board guidelines
 
@@ -22,20 +22,20 @@ Many open-source projects provide an Issues board, with similar functionality to
 
 ServiceNow suggests the following technical support pathways for open-source projects:
 
-1. Clearly identify and document the issue or question you have.
-2. View the Documentation.
-3. Search the Discussions.
-4. Search the project documentation for known errors, useful solutions, and troubleshooting tips.
-5. Check the project contribution guidelines if you would like details on how you can submit a change. Community contributions are valued and appreciated!
-6. Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
-7. Contact the project team contributors of the project to see if they can help as a last resort only.
+1.                Clearly identify and document the issue or question you have.
+2.                View the Documentation.
+3.                Search the Discussions.
+4.                Search the project documentation for known errors, useful solutions, and troubleshooting tips.
+5.                Check the project contribution guidelines if you would like details on how you can submit a change. Community contributions are valued and appreciated!
+6.                Log an Issue if it hasn't already been logged. If the issue has already been logged by another user, vote it up, and add a comment with additional or missing information. Do your best to choose the correct category when logging a new issue. This will make it easier to differentiate bugs from new feature requests or ideas. If after logging an issue you find the solution, please close your issue and provide a comment with the solution. This will help the project owners and other users.
+7.                Contact the project team contributors of the project to see if they can help as a last resort only.
 
 ## Repositories
 
-- Read and follow the license instructions
-- Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
-- ‘Star' project repos to save for future reference.
-- ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
+-                     Read and follow the license instructions
+-                     Remember to include citations if you use someone else's work in your own project. Use the [`CITATION.cff`](CITATION.cff) to find the correct project citation reference.
+-                     ‘Star' project repos to save for future reference.
+-                     ‘Watch' project repos to get notifications of changes – this can get noisy for some projects, so only watch the ones you really need to track closely.
 
 ## Enforcement and reporting
 
diff --git a/README.md b/README.md
index d02e7f95..91b04f38 100644
--- a/README.md
+++ b/README.md
@@ -1,12 +1,15 @@
+
+<!-- markdownlint-disable-next-line -->
 <div align="center" style="margin-bottom: 1em;">
 
+<!-- markdownlint-disable-next-line -->
 <img width=50% src="docs/assets/images/logo.svg" alt="Fast-LLM"></img>
 
 [![Docker][ci-badge]][ci-workflow]
 [![Documentation][docs-badge]][docs-workflow]
 [![License][license-badge]][license]
 
-*Accelerating your LLM training to full speed*
+# Accelerating your LLM training to full speed
 
 Made with ❤️ by [ServiceNow Research][servicenow-research]
 
@@ -25,36 +28,36 @@ As a truly open-source project, Fast-LLM allows full customization and extension
 
 ## Why Fast-LLM?
 
-1. 🚀 **Fast-LLM is Blazingly Fast**:
-    - ⚡️ Optimized kernel efficiency and reduced overheads.
-    - 🔋 Optimized memory usage for best performance.
-    - ⏳ Minimizes training time and cost.
-
-2. 📈 **Fast-LLM is Highly Scalable**:
-    - 📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
-    - 🔗 Supports sequence length parallelism to handle longer sequences effectively.
-    - 🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
-    - 🎛️ Mixed precision training support for better performance.
-    - 🏋️‍♂️ Large batch training and gradient accumulation support.
-    - 🔄 Reproducible training with deterministic behavior.
-
-3. 🎨 **Fast-LLM is Incredibly Flexible**:
-    - 🤖 Compatible with all common language model architectures in a unified class.
-    - ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
-    - 🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
-    - 🤗 Seamless integration with [Hugging Face Transformers][transformers].
-
-4. 🎯 **Fast-LLM is Super Easy to Use**:
-    - 📦 [Pre-built Docker images](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) for quick deployment.
-    - 📝 Simple YAML configuration for hassle-free setup.
-    - 💻 Command-line interface for easy launches.
-    - 📊 Detailed logging and real-time monitoring features.
-    - 📚 Extensive [documentation][docs] and practical tutorials (in progress).
-
-5. 🌐 **Fast-LLM is Truly Open Source**:
-    - ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
-    - 💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
-    - 🤝 Contributions and collaboration are always welcome!
+1.  🚀 **Fast-LLM is Blazingly Fast**:
+    -                     ⚡️ Optimized kernel efficiency and reduced overheads.
+    -                     🔋 Optimized memory usage for best performance.
+    -                     ⏳ Minimizes training time and cost.
+
+2.  📈 **Fast-LLM is Highly Scalable**:
+    -                     📡 Distributed training across multiple GPUs and nodes using 3D parallelism (Data, Tensor, and Pipeline).
+    -                     🔗 Supports sequence length parallelism to handle longer sequences effectively.
+    -                     🧠 ZeRO-1, ZeRO-2, and ZeRO-3 implementations for improved memory efficiency.
+    -                     🎛️ Mixed precision training support for better performance.
+    -                     🏋️‍♂️ Large batch training and gradient accumulation support.
+    -                     🔄 Reproducible training with deterministic behavior.
+
+3.  🎨 **Fast-LLM is Incredibly Flexible**:
+    -                     🤖 Compatible with all common language model architectures in a unified class.
+    -                     ⚡ Efficient dropless Mixture-of-Experts (MoE) implementation with SoTA performance.
+    -                     🧩 Customizable language model architectures, data loaders, loss functions, and optimizers (in progress).
+    -                     🤗 Seamless integration with [Hugging Face Transformers][transformers].
+
+4.  🎯 **Fast-LLM is Super Easy to Use**:
+    -                     📦 [Pre-built Docker images](https://github.com/ServiceNow/Fast-LLM/pkgs/container/fast-llm) for quick deployment.
+    -                     📝 Simple YAML configuration for hassle-free setup.
+    -                     💻 Command-line interface for easy launches.
+    -                     📊 Detailed logging and real-time monitoring features.
+    -                     📚 Extensive [documentation][docs] and practical tutorials (in progress).
+
+5.  🌐 **Fast-LLM is Truly Open Source**:
+    -                     ⚖️ Licensed under [Apache 2.0][license] for maximum freedom to use Fast-LLM at work, in your projects, or for research.
+    -                     💻 Transparently developed on GitHub with public [roadmap][roadmap] and [issue tracking][issues].
+    -                     🤝 Contributions and collaboration are always welcome!
 
 ## Usage
 
@@ -71,14 +74,14 @@ Expect to see a significant speedup in training time compared to other libraries
 
 #### Prerequisites
 
-- A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
-- CUDA 12.1 or higher.
-- Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
+-                     A [Slurm](https://slurm.schedmd.com/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+-                     CUDA 12.1 or higher.
+-                     Dependencies: [PyTorch][pytorch], [Triton][triton], and [Apex](https://github.com/NVIDIA/apex) installed on all nodes.
 
 #### Steps
 
-1. Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
-2. Install Fast-LLM on all nodes:
+1.  Deploy the [nvcr.io/nvidia/pytorch:24.07-py3](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) Docker image to all nodes (recommended), because it contains all the necessary dependencies.
+2.  Install Fast-LLM on all nodes:
 
     ```bash
     sbatch <<EOF
@@ -92,16 +95,16 @@ Expect to see a significant speedup in training time compared to other libraries
     EOF
     ```
 
-3. Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
+3.  Use the example Slurm job script [examples/fast-llm.sbat](examples/fast-llm.sbat) to submit the job to the cluster:
 
     ```bash
     sbatch examples/fast-llm.sbat
     ```
 
-4. Monitor the job's progress:
+4.  Monitor the job's progress:
 
-    - Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
-    - Status: Use `squeue -u $USER` to see the job status.
+    -                     Logs: Follow `job_output.log` and `job_error.log` in your working directory for logs.
+    -                     Status: Use `squeue -u $USER` to see the job status.
 
 Now, you can sit back and relax while Fast-LLM trains your model at full speed! ☕
 
@@ -109,28 +112,28 @@ Now, you can sit back and relax while Fast-LLM trains your model at full speed!
 
 #### Prerequisites
 
-- A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
-- [KubeFlow](https://www.kubeflow.org/) installed.
-- Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
+-                     A [Kubernetes](https://kubernetes.io/) cluster with at least 4 DGX nodes with 8 A100-80GB or H100-80GB GPUs each.
+-                     [KubeFlow](https://www.kubeflow.org/) installed.
+-                     Locked memory limit set to unlimited at the host level on all nodes. Ask your cluster admin to do this if needed.
 
 #### Steps
 
-1. Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
+1.  Create a Kubernetes [PersistentVolumeClaim](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) (PVC) named `fast-llm-home` that will be mounted to `/home/fast-llm` in the container using [examples/fast-llm-pvc.yaml](examples/fast-llm-pvc.yaml):
 
     ```bash
     kubectl apply -f examples/fast-llm-pvc.yaml
     ```
 
-2. Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
+2.  Create a [PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch/) resource using the example configuration file [examples/fast-llm.pytorchjob.yaml](examples/fast-llm.pytorchjob.yaml):
 
     ```bash
     kubectl apply -f examples/fast-llm.pytorchjob.yaml
     ```
 
-3. Monitor the job status:
+3.  Monitor the job status:
 
-    - Use `kubectl get pytorchjobs` to see the job status.
-    - Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
+    -                     Use `kubectl get pytorchjobs` to see the job status.
+    -                     Use `kubectl logs -f fast-llm-master-0 -c pytorch` to follow the logs.
 
 That's it! You're now up and running with Fast-LLM on Kubernetes. 🚀
 
@@ -150,8 +153,6 @@ Fast-LLM is licensed by ServiceNow, Inc. under the Apache 2.0 License. See [LICE
 
 For security issues, email [disclosure@servicenow.com](mailto:disclosure@servicenow.com). See our [security policy](SECURITY.md).
 
-[roadmap]: https://github.com/ServiceNow/Fast-LLM/milestones
-[issues]: https://github.com/ServiceNow/Fast-LLM/issues
 [ci-badge]: https://github.com/ServiceNow/Fast-LLM/actions/workflows/ci.yaml/badge.svg
 [ci-workflow]: https://github.com/ServiceNow/Fast-LLM/actions/workflows/ci.yaml
 [docs-badge]: https://github.com/ServiceNow/Fast-LLM/actions/workflows/docs.yaml/badge.svg
@@ -162,4 +163,3 @@ For security issues, email [disclosure@servicenow.com](mailto:disclosure@service
 [servicenow-research]: https://www.servicenow.com/research/
 [pytorch]: https://pytorch.org/
 [triton]: https://triton-lang.org
-[transformers]: https://huggingface.co/transformers
diff --git a/SECURITY.md b/SECURITY.md
index e3a80c5b..ca97480f 100644
--- a/SECURITY.md
+++ b/SECURITY.md
@@ -21,11 +21,11 @@ We will process your report as soon as possible, depending on the severity of yo
 
 Please follow the guidelines below when [disclosing vulnerabilities](https://www.servicenow.com/company/trust/privacy/responsible-disclosure.html):
 
-- Report any potential security issue as soon as possible. ServiceNow will make every effort to quickly resolve the issue.
-- Provide sufficient detail to reproduce the vulnerability, including proof of concept. The use of ReproNow to demonstrate reproducibility is encouraged but not required.
-- Please do not disclose an issue to the public or any third party until ServiceNow has resolved it.
-- Make a good faith effort to avoid privacy violations, data destruction, and interruption or degradation of our services. Only interact with accounts you own or have explicit permission from the account holder to access.
-- Redact any language or images that may identify the program or ServiceNow customers from information about a resolved vulnerability.
-- Do not engage in disruptive testing (such as Denial of Service attacks) or any action that could impact the confidentiality, integrity, or availability of information and systems.
-- Do not engage in social engineering or phishing against customers or employees.
-- Please do not request compensation for time, materials, or discovered vulnerabilities through the Responsible Disclosure Program.
+-                     Report any potential security issue as soon as possible. ServiceNow will make every effort to quickly resolve the issue.
+-                     Provide sufficient detail to reproduce the vulnerability, including proof of concept. The use of ReproNow to demonstrate reproducibility is encouraged but not required.
+-                     Please do not disclose an issue to the public or any third party until ServiceNow has resolved it.
+-                     Make a good faith effort to avoid privacy violations, data destruction, and interruption or degradation of our services. Only interact with accounts you own or have explicit permission from the account holder to access.
+-                     Redact any language or images that may identify the program or ServiceNow customers from information about a resolved vulnerability.
+-                     Do not engage in disruptive testing (such as Denial of Service attacks) or any action that could impact the confidentiality, integrity, or availability of information and systems.
+-                     Do not engage in social engineering or phishing against customers or employees.
+-                     Please do not request compensation for time, materials, or discovered vulnerabilities through the Responsible Disclosure Program.
diff --git a/docs/about-us.md b/docs/about-us.md
index eedd852a..2dd10503 100644
--- a/docs/about-us.md
+++ b/docs/about-us.md
@@ -18,17 +18,17 @@ We envision Fast-LLM as the go-to solution for serious AI practitioners who requ
 
 At Fast-LLM, we adhere to a set of guiding principles that define our approach:
 
--   **Performance-Driven:** We are relentless in our pursuit of speed and efficiency. Fast-LLM is built to reduce training time and scale to the largest clusters, enabling our users to achieve breakthrough results faster.
--   **Professional-Grade Customization:** We understand that serious AI work demands flexibility. Fast-LLM is designed for extensive customization, allowing users to tailor every aspect of the training process to their unique needs.
--   **Open Innovation:** While we cater to advanced users, our commitment to open-source ensures that innovation remains accessible. We believe in building a community where professionals can collaborate and contribute to shaping the future of AI.
--   **Reliability at Scale:** Fast-LLM is built with rigorous standards to support production-level workloads. We prioritize stability, reproducibility, and robustness, ensuring that your models can scale from research to real-world applications seamlessly.
+-                                 **Performance-Driven:** We are relentless in our pursuit of speed and efficiency. Fast-LLM is built to reduce training time and scale to the largest clusters, enabling our users to achieve breakthrough results faster.
+-                                 **Professional-Grade Customization:** We understand that serious AI work demands flexibility. Fast-LLM is designed for extensive customization, allowing users to tailor every aspect of the training process to their unique needs.
+-                                 **Open Innovation:** While we cater to advanced users, our commitment to open-source ensures that innovation remains accessible. We believe in building a community where professionals can collaborate and contribute to shaping the future of AI.
+-                                 **Reliability at Scale:** Fast-LLM is built with rigorous standards to support production-level workloads. We prioritize stability, reproducibility, and robustness, ensuring that your models can scale from research to real-world applications seamlessly.
 
 ## 👥 Meet the Team
 
 Fast-LLM is led by the Foundation Models Lab at [ServiceNow Research](https://www.servicenow.com/research/), with development driven by a dedicated group of professionals who bring extensive expertise in AI, machine learning, and distributed systems. While the project direction is guided by the Foundation Models Lab, contributions come from a growing network of researchers, developers, and industry experts worldwide. Here are some of the key members leading the project:
 
--   [**Joel Lamy Poirier**](https://www.servicenow.com/research/author/joel-lamy-poirier.html) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
--   [**Sean Hughes**](https://www.servicenow.com/research/author/sean-hughes.html) - Ecosystem Director, ServiceNow Research: Sean focuses on building partnerships and open scientific collaborations to advance Fast-LLM's capabilities and reach.
--   [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
+-                                 [**Joel Lamy Poirier**](https://www.servicenow.com/research/author/joel-lamy-poirier.html) - Lead Developer and maintainer, ServiceNow Research: Joel spearheads the core development, ensuring that Fast-LLM delivers on its promise of speed and scalability.
+-                                 [**Sean Hughes**](https://www.servicenow.com/research/author/sean-hughes.html) - Ecosystem Director, ServiceNow Research: Sean focuses on building partnerships and open scientific collaborations to advance Fast-LLM's capabilities and reach.
+-                                 [**Torsten Scholak**](https://www.servicenow.com/research/author/torsten-scholak.html) - Research Lead, ServiceNow Research: Torsten leads our research efforts, driving the scientific innovations that keep Fast-LLM at the forefront of AI training.
 
 Our core team includes members affiliated with ServiceNow Research, as well as other contributors who bring unique perspectives and skills to the project. We welcome new participants from the broader AI community who share our vision of creating the best tools for training large-scale language models.
diff --git a/docs/developers/contributing.md b/docs/developers/contributing.md
index 38b01868..13423c7d 100644
--- a/docs/developers/contributing.md
+++ b/docs/developers/contributing.md
@@ -10,17 +10,17 @@ If you have questions or want to start a discussion, feel free to [open a discus
 
 To get started with contributing to Fast-LLM, follow these steps to set up your environment:
 
-1.  **Learn Our Development Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices), which cover development setup, testing, and benchmarking.
-2.  **Read the Style Guide**: Follow our [style guide](https://servicenow.github.io/Fast-LLM/developers/style-guide) to maintain consistency in code style, documentation, and commit messages.
+1.                           **Learn Our Development Practices**: Get familiar with our [development best practices](https://servicenow.github.io/Fast-LLM/developers/dev-practices), which cover development setup, testing, and benchmarking.
+2.                           **Read the Style Guide**: Follow our [style guide](https://servicenow.github.io/Fast-LLM/developers/style-guide) to maintain consistency in code style, documentation, and commit messages.
 
 ## 🐞 How to Report a Bug
 
 Found a bug? Let's squash it together! [Open an issue](https://github.com/ServiceNow/Fast-LLM/issues/new/choose) and select "Bug report." Please include as much information as possible:
 
--   Steps to reproduce the issue.
--   What you expected to happen versus what actually happened.
--   Logs, Fast-LLM configuration, and error messages.
--   Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
+-                                 Steps to reproduce the issue.
+-                                 What you expected to happen versus what actually happened.
+-                                 Logs, Fast-LLM configuration, and error messages.
+-                                 Details about your environment setup (e.g., CUDA hardware, PyTorch version, CUDA version).
 
 If you're familiar with the codebase, consider adding a failing unit test to demonstrate the problem (optional, but helpful!).
 
@@ -28,33 +28,33 @@ If you're familiar with the codebase, consider adding a failing unit test to dem
 
 Before diving into code, [open an issue](https://github.com/ServiceNow/Fast-LLM/issues) to discuss your proposal. This is especially important if you're planning significant changes or adding new dependencies. Once your idea is approved, follow these steps:
 
-1.  **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
-2.  **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
-3.  **Create a New Branch**: Name your branch descriptively, such as `fix/training-memory-leak` or `feature/rope-scaling`.
-4.  **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
-5.  **Push to Your Fork**: Push the branch to your GitHub fork.
-6.  **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
+1.                           **Fork the Repository**: [Fork Fast-LLM](https://github.com/ServiceNow/Fast-LLM/fork) to your own GitHub account.
+2.                           **Clone Your Fork Locally**: Use `git clone` to bring the code to your local machine.
+3.                           **Create a New Branch**: Name your branch descriptively, such as `fix/training-memory-leak` or `feature/rope-scaling`.
+4.                           **Make Your Changes**: Work your magic! Don't forget to add or update tests, benchmarks, or configurations as needed.
+5.                           **Push to Your Fork**: Push the branch to your GitHub fork.
+6.                           **Open a Pull Request**: [Submit a pull request](https://github.com/ServiceNow/Fast-LLM/compare) to the `main` branch. Reference the original issue number and provide a brief summary of your changes.
 
 ## 🏆 Guidelines for a Successful Pull Request
 
 Here are some tips to ensure your pull request gets reviewed and merged promptly:
 
--   **Follow our coding standards**: Stick to our [style guide and conventions](https://servicenow.github.io/Fast-LLM/developers/style-guide) to keep the code clean and consistent.
--   **Write tests**: Verify your changes with unit tests for new features or bug fixes.
--   **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
--   **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
--   **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
--   **Comment non-trivial code**: Make your code easy to understand for others.
--   **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
--   **Use a clear and descriptive title**: The PR title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code." Start with a keyword like `[feat]`, `[fix]`, `[docs]`, etc. to categorize the change. Reference the issue number if applicable (e.g., `[fix] resolve #123 memory leak in training loop`). This title will become the commit message for the squashed merge.
--   **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
+-                                 **Follow our coding standards**: Stick to our [style guide and conventions](https://servicenow.github.io/Fast-LLM/developers/style-guide) to keep the code clean and consistent.
+-                                 **Write tests**: Verify your changes with unit tests for new features or bug fixes.
+-                                 **Test on GPUs and real-world workloads**: Since Fast-LLM is all about training large language models, make sure your changes work smoothly in GPU environments and on typical training setups.
+-                                 **Run benchmarks and performance tests**: Make sure your changes don't slow things down. If there's any impact on performance, provide benchmark results to back it up.
+-                                 **Avoid introducing new issues**: Check that there are no new runtime warnings, type checker errors, linting problems, or unhandled edge cases.
+-                                 **Comment non-trivial code**: Make your code easy to understand for others.
+-                                 **Keep sensitive data out**: Make sure your code or commit messages don't expose private or proprietary information.
+-                                 **Use a clear and descriptive title**: The PR title should summarize the key change or feature introduced. Avoid vague titles like "Fix bug" or "Update code." Start with a keyword like `[feat]`, `[fix]`, `[docs]`, etc. to categorize the change. Reference the issue number if applicable (e.g., `[fix] resolve #123 memory leak in training loop`). This title will become the commit message for the squashed merge.
+-                                 **Use the [PR template](https://github.com/ServiceNow/Fast-LLM/blob/main/.github/PULL_REQUEST_TEMPLATE.md)**: Complete the checklist to make sure everything is in order before hitting submit.
 
 ## 🆘 Seeking Help or Clarification
 
 If you're unsure about something or need help, you've got options:
 
--   **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
--   **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
+-                                 **GitHub Discussions**: [Start a discussion](https://github.com/ServiceNow/Fast-LLM/discussions) if you need advice or just want to chat.
+-                                 **Project Maintainers**: Mention a maintainer in an issue or pull request if you need a review or guidance.
 
 ## 🌟 Contributors
 
diff --git a/docs/developers/style-guide.md b/docs/developers/style-guide.md
index bc780dd8..92320288 100644
--- a/docs/developers/style-guide.md
+++ b/docs/developers/style-guide.md
@@ -16,7 +16,7 @@ Most of the style choices below are based on this principle.
 Unless otherwise specified, Fast-LLM follows the [PEP 8](https://peps.python.org/pep-0008/) coding style.
 This style (and many other conventions) is enforced with automatic formatting through a [pre-commit](https://pre-commit.com/) git hook.
 
-Make sure these git hooks are installed by running
+Please make sure these git hooks are installed by running
 
 ```bash
 pip install pre-commit
@@ -24,45 +24,66 @@ pre-commit install
 ```
 
 !!! note "More on automated formatting"
-
     Fast-LLM's automated formatting includes [Black](https://black.readthedocs.io/en/stable/),
     [isort](https://pycqa.github.io/isort/), [autoflake](https://github.com/PyCQA/autoflake), and a few other packages.
     See Fast-LLM's [pre-commit configuration](https://github.com/ServiceNow/Fast-LLM/blob/main/.pre-commit-config.yaml) for more details.
 
 ## 📚 Naming Conventions
 
-Names should be as self-explanatory as possible, within reason.
-This includes python identifiers (classes, variables, methods, modules, etc.), file names and configuration parameters.
+In addition to PEP 8, we use the following naming conventions for python identifiers (classes, variables, methods, modules, etc.),
+file names and configuration parameters.
 For example:
 
-* Use meaningful, self-descriptive identifier names (ex. `x -> loss`). This gives other users a better chance to understand the code.
+*   Use meaningful, self-descriptive identifier names (ex. `x -> loss`).
 Abstract variable names such as `x` are however OK for generic methods where more descriptive names aren't appropriate (ex. `add(x, y)`).
-* Avoid abbreviations, especially domain-specific ones. Ex. `bs -> batch_size`.
-This gives everyone a chance to understand the code, regardless of their prior knowledge.
-* Avoid redundancies especially for configuration parameters, ex. `data.data_type` -> `data.type`.
-* Avoid name parts that refer to the data type, ex. `num`. Use type hints instead.
-
-Note that these conventions are enforced more strictly on user-facing names since they are more difficult to change,
+*   Please avoid abbreviations, especially domain-specific ones.
+This gives everyone a chance to understand the code, regardless of their prior knowledge. Ex. `bs -> batch_size`.
+*   Try to keep names concise, for example by eliminating redundancies
+and avoiding data type qualifiers such as `num` (covered by the type hint).
+This is especially important for configuration parameters as the fully qualified names can get very long.
+For example, `transformer.num_transformers_heads` can be simplified to `transformer.heads` without sacrificing clarity.
+
+Note that these conventions are especially important on user-facing names which are more difficult to change,
 for example configuration parameters and the public interface of core classes and modules.
 
+!!! note "Why this matters"
+    Using explicit, self-explanatory names gives other users a better chance to understand the code,
+    regardless of their prior knowledge, which facilitates collaboration and maintenance.
+    Our conventions follow this principle, while attempting to avoid excessively long names.
+
 ## 🛬 Imports
 
 We use the following conventions for imports (other than those enforced by isort):
 
-* Import standard library and third party modules by module (ex. `import package.module`, not `from package.module import method`).
-In addition to keeping the code consistent, this keeps identifier's origin explicit so anyone can tell where it came from with just a quick glance at the code. This is especially useful for identifiers that with otherwise ambiguous source (ex. `float32` may come from torch, numpy, triton, etc.; Fast-LLM's configuration scheme has many identifiers in common with `dataclasses`, `omegaconf` and `pydantic`)
-* Avoid renaming with `as`, except for some (arbitrarily chosen) common ones: `numpy as np`, `triton.language as tl`.
-* Import first-party modules through specific identifiers (ex. `from fast_llm.module import method`, not `import fast_llm.module`). This keeps Fast-LLM identifiers to a manageable length and makes it easier to track what is used in a given file.
-* Always use absolute imports (ex. no `from .module import method`)
-* Include all explicitly-imported third-party module to `setup.cfg`.
+*   Import standard library and third party modules by module (ex. `import package.module`, not `from package.module import method`).
+In addition to keeping the code consistent, this keeps identifier's origin explicit so anyone can tell where it came from with just a quick glance at the code.
+*   Avoid renaming with `as`, except for some (arbitrarily chosen) common ones: `numpy as np`, `triton.language as tl`.
+*   Import first-party modules through specific identifiers (ex. `from fast_llm.module import method`, not `import fast_llm.module`). This keeps Fast-LLM identifiers to a manageable length and makes it easier to track what is used in a given file.
+*   Always use absolute imports (ex. no `from .module import method`)
+*   Include all explicitly-imported third-party module to `setup.cfg`.
 Only add new requirements if they provide a substantial benefit,
 as we try to keep the requirements to a minimum.
-* Prefer file-level imports over imports inside methods, unless they significantly slow down the import process
+*   Prefer file-level imports over imports inside methods, unless they significantly slow down the import process
 or concern an optional dependency that should not be absolutely required to import the module (ex. `transformers`).
 If an offending import is only required for a type hint, include it in a `if typing.TYPE_CHECKING:` block.
 
-!!! warning "Configuration modules"
+!!! note "Why this matters"
+    Most python conventions make no clear recommendation concerning imports,
+    which can easily lead to inconsistent import formats across a repo, and can make it harder to understand.
+    Our conventions aim to avoid these arbitrary choices by providing an explicit prescription,
+    which should be good enough nearly everywhere. Our choice is justified as follows:
+
+    * For third-party and standard library packages, fully qualified identifiers are typically relatively short,
+    so it makes sense to keep them.
+    This also keeps identifier's origin explicit so anyone can tell where it came from with just a quick glance at the code.
+    This is especially useful for identifiers that with otherwise ambiguous source (ex. `float32` may come from torch, numpy, triton, etc.; Fast-LLM's configuration scheme has many identifiers in common with `dataclasses`, `omegaconf` and `pydantic`)
+    * For first-package, fully qualified names are generally too long to use in code,
+    since they include the entire directory structure to the Fast-LLM,
+    so first-party identifiers need to be imported by name.
+    There should be very little ambiguity, because name clashes are uncommon within Fast-LLM,
+    and external identifiers are already clearly marked as such.
 
+!!! warning "Configuration modules"
     Fast-LLM supports instantiation and validation of configurations with a barebone installation.
     Because of this, modules that contain configuration classes (usually named `config.py`)
     should not include any top-level third-party import (except for those installed in the [barebone install](https://github.com/ServiceNow/Fast-LLM/blob/main/setup.cfg)),
@@ -71,29 +92,43 @@ If an offending import is only required for a type hint, include it in a `if typ
 
 ## 🔓 Public and Private Interface
 
-Although good practices of object-oriented programming are generally ignored in python,
-Fast-LLM attempts to follow them to an extent, while avoiding unnecessary bloat:
+We use the following conventions for class and module interfaces:
 
-* Mark private and protected variables with an underscore `_` prefix.
+*   Mark private and protected variables with an underscore `_` prefix.
 As is customary in python, we make no distinction between the two and avoid the double-underscore `__` notation.
-* Keep public interfaces (methods and variables without underscore prefix) as lean as possible,
+*   Keep public interfaces (methods and variables without underscore prefix) as lean as possible,
 i.e. mark everything as private/protected unless there is a clear need to make it public.
 We can always add to the public interface later, but removing from it is difficult.
-* Use accessors sparingly through the `@property` decorator or equivalent,
+*   Use accessors sparingly through the `@property` decorator or equivalent,
 usually to define read-only public variables.
 
+!!! note "Why this matters"
+    Although good practices of object-oriented programming are generally ignored in python,
+    Fast-LLM attempts to follow them to an extent, while avoiding unnecessary bloat.
+    Public interfaces are expected to be stable,
+    which make further modifications difficult as they could break external code.
+    On the other hand, private interface are freely modifiable,
+    which provides more freedom for fixes, improvement, refactoring, etc.
+    Therefore, having lean public interfaces is critical for us to keep maintaining and improving Fast-LLM.
+
 ## 💡 Type Hints
 
 Fast-LLM uses type hints for several reasons, including code readability, type checking in IDEs,
 and type validation for configurations:
 
-* Always use type hints for the public interface of a classes and modules.
-Type hints for method outputs may be omitted if they can be easily inferred.
-* Prefer using type hints in private interfaces, especially if it improves readability and/or static type checking.
-* Use newer type hint formats when possible, ex. `typing.List -> list`, `typing.Union(A,B) -> A | B`.
+*   Always use type hints for the public interface of a classes and modules.
+Type hints for method outputs may be omitted if they can be trivially inferred,
+ex. if they return the input, an explicitly typed variable or nothing.
+*   Prefer using type hints in private interfaces, especially if it improves readability and/or static type checking.
+*   Prefer newer type hint formats over older ones, ex. `typing.List -> list`, `typing.Union(A,B) -> A | B`.
+
+!!! note "Why this matters"
+    We use type hints for various reasons. In addition to making the code more understandable,
+    they are used by IDEs such as VS Code or PyCharm to perform static type checking,
+    which speeds up development and is essential to keeping the code bug-free.
 
 ## 🗑️ Misc
 
-* Please add descriptions and comments as needed, especially for parts that would otherwise be difficult to understand.
-* Use `pathlib` rather than `os.path`.
-* We encourage the use of modern python features when beneficial, up to the minimum python version (3.12).
+*                                 Please add descriptions and comments as needed, especially for parts that would otherwise be difficult to understand.
+*                                 Use `pathlib` rather than `os.path`.
+*                                 We encourage the use of modern python features when beneficial, up to the minimum python version (3.12).
diff --git a/docs/index.md b/docs/index.md
index d60b405b..9a543d6f 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -50,9 +50,9 @@ Fast-LLM offers all the capabilities you need to accelerate your LLM training an
 
 Fast-LLM powers the world's most advanced AI projects:
 
--   **NLP Research and Development:** Train state-of-the-art language models for natural language understanding, summarization, and conversational AI.
--   **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
--   **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
+-                                 **NLP Research and Development:** Train state-of-the-art language models for natural language understanding, summarization, and conversational AI.
+-                                 **Enterprise AI Solutions:** Accelerate time-to-market for AI products by reducing training costs and enabling faster iteration.
+-                                 **Academic Collaborations:** Drive AI innovation with high-performance training capabilities that support cutting-edge research in machine learning.
 
 See how Fast-LLM has helped early adopters achieve faster results. [Explore use cases and success stories](success-stories/starcoder-2.md).
 
@@ -60,18 +60,18 @@ See how Fast-LLM has helped early adopters achieve faster results. [Explore use
 
 Fast-LLM is designed to be the **go-to solution** for those training the most sophisticated language models. Our objectives include:
 
--   **Accelerating Training Workflows:** Deliver the fastest LLM training experience with optimized kernel efficiency, parallelism, and memory management.
--   **Supporting a Broad Range of Architectures:** Offer built-in support for all major language model architectures, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
--   **Enabling Seamless Integration and Deployment:** Integrate effortlessly into existing ML pipelines, including [HuggingFace Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
--   **Advancing LLM Research and Production-Readiness:** Be suitable for both cutting-edge research and mission-critical production workloads.
+-                                 **Accelerating Training Workflows:** Deliver the fastest LLM training experience with optimized kernel efficiency, parallelism, and memory management.
+-                                 **Supporting a Broad Range of Architectures:** Offer built-in support for all major language model architectures, with an architecture-agnostic approach that allows users to easily adapt the framework to emerging models.
+-                                 **Enabling Seamless Integration and Deployment:** Integrate effortlessly into existing ML pipelines, including [HuggingFace Transformers](https://huggingface.co/transformers) and [Kubernetes](https://kubernetes.io)-based clusters.
+-                                 **Advancing LLM Research and Production-Readiness:** Be suitable for both cutting-edge research and mission-critical production workloads.
 
 ## Collaboration and Contribution
 
 As Fast-LLM evolves, we invite the community to contribute and help shape its future. We welcome:
 
--   **Testing and Bug Fixes:** Help us identify issues and improve stability.
--   **Feature Development:** Contribute new models, new training features, and new optimizations.
--   **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our documentation and writing practical guides.
+-                                 **Testing and Bug Fixes:** Help us identify issues and improve stability.
+-                                 **Feature Development:** Contribute new models, new training features, and new optimizations.
+-                                 **Documentation and Tutorials:** Make Fast-LLM more accessible by improving our documentation and writing practical guides.
 
 Fast-LLM is more than just software, it's a community. Get involved by exploring our [contribution guidelines](developers/contributing.md) and engaging with us on [GitHub Discussions](https://github.com/ServiceNow/Fast-LLM/discussions).
 
diff --git a/docs/join-us.md b/docs/join-us.md
index 31ff49ab..26154314 100644
--- a/docs/join-us.md
+++ b/docs/join-us.md
@@ -16,15 +16,15 @@ Want to keep up with the latest Fast-LLM updates and new opportunities to get in
 
 Fast-LLM thrives on collaboration, and we're excited to welcome new contributors! From fixing bugs to adding new features, every code contribution makes a difference. If you're just getting started, our [Good First Issues](https://github.com/ServiceNow/Fast-LLM/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) on GitHub are labeled to help newcomers find approachable tasks. To set up your development environment and get oriented with Fast-LLM, check out our **Developer's Corner** for everything you need:
 
--   [**Contributing**](developers/contributing.md) – for setup instructions and contributing guidelines
--   [**Best Practices**](developers/dev-practices.md) – for tips on writing clean, maintainable code
+-                                 [**Contributing**](developers/contributing.md) – for setup instructions and contributing guidelines
+-                                 [**Best Practices**](developers/dev-practices.md) – for tips on writing clean, maintainable code
 
 Here's a quick overview of the process:
 
-1.  **Fork & Clone**: Start by forking the repo and cloning it to your machine.
-2.  **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
-3.  **Write Awesome Code**: Make your changes, document them, and follow our best practices.
-4.  **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
+1.                           **Fork & Clone**: Start by forking the repo and cloning it to your machine.
+2.                           **Set Up Your Dev Environment**: The Developer's Corner guides you through configuring your environment for maximum productivity.
+3.                           **Write Awesome Code**: Make your changes, document them, and follow our best practices.
+4.                           **Open a Pull Request**: Submit a PR to showcase your work and get feedback from our team and the community.
 
 Explore our [Developer's Corner](developers/contributing.md) for everything you need to get started!
 
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 56189d0c..b4c208f4 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -10,9 +10,9 @@ To follow this guide, you'll need:
 
 -   **Hardware**: At least one NVIDIA GPU, preferably with Ampere architecture or newer. Note that this tutorial is designed for 80 GB A100s or H100 GPUs, and some adjustments are needed to run it with less memory or an earlier architecture.
 -   **Software**: Depending on your setup, you'll need one of the following:
-    -   **Docker**: If you're using the prebuilt Docker image on your local machine.
-    -   **Python 3.10**: If you're setting up a custom environment (virtual environment, bare-metal, etc.) on your local machine.
-    -   **Cluster Setup**: Access to a Docker-enabled Slurm cluster or to a Kubernetes cluster with Kubeflow if you're using those environments.
+    -                                 **Docker**: If you're using the prebuilt Docker image on your local machine.
+    -                                 **Python 3.10**: If you're setting up a custom environment (virtual environment, bare-metal, etc.) on your local machine.
+    -                                 **Cluster Setup**: Access to a Docker-enabled Slurm cluster or to a Kubernetes cluster with Kubeflow if you're using those environments.
 
 ## 🏗 Step 1: Initial Setup
 
diff --git a/docs/recipes/data-preparation.md b/docs/recipes/data-preparation.md
index be0f8ef0..412cafb2 100644
--- a/docs/recipes/data-preparation.md
+++ b/docs/recipes/data-preparation.md
@@ -11,9 +11,9 @@ For this guide, you would need:
 -   **Hardware**: Just a machine with CPUs will do. But having a large numbers of CPUs and nodes helps distribute the data preparation job and significantly speed things up.
 
 -   **Software**: Depending on your setup, you'll need one of the following:
-    -   **Docker**: If you're using the prebuilt Docker image on your local machine.
-    -   **Python 3.10**: If you're setting up a custom environment (virtual environment, bare-metal, etc.) on your local machine.
-    -   **Cluster Setup**: Access to a Docker-enabled Slurm cluster or to a Kubernetes cluster with Kubeflow if you're using those environments.
+    -                                 **Docker**: If you're using the prebuilt Docker image on your local machine.
+    -                                 **Python 3.10**: If you're setting up a custom environment (virtual environment, bare-metal, etc.) on your local machine.
+    -                                 **Cluster Setup**: Access to a Docker-enabled Slurm cluster or to a Kubernetes cluster with Kubeflow if you're using those environments.
 
 ## 📚 Step 1: Download the dataset from Huggingface
 
diff --git a/fast_llm/models/custom/readme.md b/fast_llm/models/custom/readme.md
index bb3330a3..ca005908 100644
--- a/fast_llm/models/custom/readme.md
+++ b/fast_llm/models/custom/readme.md
@@ -4,18 +4,17 @@ The "custom" model is a template for customized training of a GPT-style model,
 for example to fine-tune it for a particular class.
 This is typically done as follows:
 
-1. Create a copy of the `custom` model, and rename it appropriately, ex. `my_model`, `MyModelTrainer`, etc.
-2. If necessary, adjust the base classes to inherit from more abstract classes or another model.
+1.  Create a copy of the `custom` model, and rename it appropriately, ex. `my_model`, `MyModelTrainer`, etc.
+2.  If necessary, adjust the base classes to inherit from more abstract classes or another model.
 ex. `MyModelData(AbstractData)` to re-implement data processing from scratch.
-3. Add custom configuration fields in `config.py`.
-4. Adapt or re-implement the data loading scheme in `MyModelData`.
-5. Adapt or re-implement the preprocessing scheme in `MyModelBaseModel`.
-6. Adapt or re-implement the model head, ex. change the task and/or add a custom loss.
-7. If needed, adapt the huggingface interface to return outputs for the desired task.
-8. Apply other changes as needed.
-9. Add the new model to the registry (`models.auto.py`) so it can be used through the cli.
-10. Run training with the new model, ex. `fast-llm train my_model [...]`.
-
+3.  Add custom configuration fields in `config.py`.
+4.  Adapt or re-implement the data loading scheme in `MyModelData`.
+5.  Adapt or re-implement the preprocessing scheme in `MyModelBaseModel`.
+6.  Adapt or re-implement the model head, ex. change the task and/or add a custom loss.
+7.  If needed, adapt the huggingface interface to return outputs for the desired task.
+8.  Apply other changes as needed.
+9.  Add the new model to the registry (`models.auto.py`) so it can be used through the cli.
+10.  Run training with the new model, ex. `fast-llm train my_model [...]`.
 
 ## Preprocessing variables and kwargs
 
@@ -26,10 +25,10 @@ Those kwargs will be passed directly to the `forward` method of each layer and c
 In some cases, it may be desirable to modify the `kwargs` inside a layer,
 for example to pass additional data to other layers or to the backward pass.
 This possible with certain caveats:
-* There is no direct support for autograd. Detaching tensors is recommended to prevent memory losses.
-* Such modifications may be incompatible with pipeline parallelism,
-as the data will not be transferred to pipeline-parallel devices.
 
+*   There is no direct support for autograd. Detaching tensors is recommended to prevent memory losses.
+*   Such modifications may be incompatible with pipeline parallelism,
+as the data will not be transferred to pipeline-parallel devices.
 
 ## Disclaimer