Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for dedicated local storage in ilab cli #104

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 184 additions & 0 deletions docs/cli/ilab-local-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# InstructLab CLI Local Storage Proposal

## Why is this needed?

For tools to be useful to the user, they generally need to work on and know about data pertinent to the user. For example, podman and docker track image blobs, kubectl tracks cluster credentials, etc.
RobotSail marked this conversation as resolved.
Show resolved Hide resolved

The InstructLab CLI manages a number of settings today such as the teacher model, taxonomy location, inference model, etc. using a local file known as `config.yaml`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI itself doesn't manage any of this. It's user config. It generates an initial file for you, but never updates it from there. I only bring this up because user config isn't the same as data that is never touched directly by a user. (config vs data)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, I would say it makes a stronger point for the proposed structure of the ~/.local/share directory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be so @RobotSail but as @russellb mentioned the CLI doesn't manage this.


While this method allows ilab to avoid additional complexity of managing a dedicated directory, it’s limited by its ability to scale beyond constraining the user to run ilab in a single directory.
RobotSail marked this conversation as resolved.
Show resolved Hide resolved
Since the new ilab CLI will need to not only manage datasets, downloaded models, and any intermediary files, it'll also have to make it simple for the user to orchestrate these things.
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved

We presently store these paths directly in the config, however this
isn't sustainable as data is frequently moved around unbeknownst to the program. This leaves the user with two choices: either update
the config to correctly point to either a relative path in the format
of `../../uncle-dir/dataset.jsonl`, or worse, have to specify an absolute path.
RobotSail marked this conversation as resolved.
Show resolved Hide resolved


Check failure on line 17 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:17 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## What's being proposed?

We propose the following:

1. `ilab` will have a dedicated config directory seated at `~/.config/ilab` on Linux/MacOS systems, and the `%APPDATA%` equivalent if we decide to support Windows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presumably this works on windows with WSL? I know people have been using it on WSL, already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's WSL then we should still assume it's a linux system. The only WSL-specific bits are more on the CUDA drivers side. And even then, that works fine currently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more of "raw windows", but you're right - WSL is a great alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RobotSail Are qwe going to support native Windows? if not, we should be explicit that it is WSL.

RobotSail marked this conversation as resolved.
Show resolved Hide resolved
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved
2. `ilab` will have a dedicated program files directory seated at `~/.local/share/ilab` which will store model checkpoints, downloaded models, etc.

Check failure on line 23 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/cli/ilab-local-config.md:23:148 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md
RobotSail marked this conversation as resolved.
Show resolved Hide resolved
3. Expectation of having config migrations in the future as it's evolved on an as-needed basis.
RobotSail marked this conversation as resolved.
Show resolved Hide resolved


Check failure on line 26 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:26 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## Config Directory

The `ilab` CLI will have a dedicated config directory rooted at
`~/.config/ilab`.

This will just house the `config.yaml` as-is, and any other
configuration files that should live here.


Check failure on line 35 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:35 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## Data Directory

`ilab` needs to generate, store, and manage different pieces of data over the course of its lifecycle. This includes model checkpoints, generated data, and various logs from processes such as evaluation.

Files like the model bases are large and take a significant amount of time to download. Similarly, the generated datasets are expensive to re-generate and should be easier to manage.

For these reasons, we propose that this data will have a permanent home at `~/.local/share/ilab`.

Check failure on line 42 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/cli/ilab-local-config.md:42:98 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md

This is a rough outline of all of the data pieces consumed and generated by different points in the process:

| phase | consumed data | generated data |
|-------|---------------|----------------|
| SDG | QnA files from taxonomy, teacher model checkpoint | knowledge dataset for training, skills dataset for training |
| Training | Input dataset, base model | Model checkpoints, processed dataset to be consumed during training |
| Eval | Model checkpoint being evaluated | JSON data / logs containing the score |
RobotSail marked this conversation as resolved.
Show resolved Hide resolved
| Publish | Model checkpoints to be published | n/a |
RobotSail marked this conversation as resolved.
Show resolved Hide resolved


Check failure on line 53 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:53 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
Based on this, it's evident that we need a few different subdirectories:

| path | description |
|------|-------------|
| `base_models/` | Stores the base models that we download |
| `checkpoints/` | Houses the checkpoints we generate during training. Can potentially split these out further into `checkpoints/phase00`, `checkpoints/phase05`, and `checkpoints/phase10` |
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved
| `datasets` | Would contain the generated datasets. Each item here would be a directory consisting of: `train.jsonl` and `test.jsonl` |
| `logs` | Would contain logs from various `ilab` runs |
| `eval_data` | Contains scores from various evaluation runs for a given model ref |


Check failure on line 64 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:64 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
## How ilab would manage this data

There should be a ***simple and intuitive*** way to view & manage all of this data.

The basic idea is that each piece of information can be discovered through a `list` command in some form. Each item in the listing will have the corresponding associated information, along with a unique reference that the user can use when interacting with subsequent commands.
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved


Check failure on line 71 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:71 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
### Manging model data


Check failure on line 74 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/cli/ilab-local-config.md:74 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
New model checkpoints would be generated during training and
saved to the `checkpoints/` directory along with a unique ref.

Check failure on line 76 in docs/cli/ilab-local-config.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/cli/ilab-local-config.md:76:63 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md
When base models are downloaded, they would be saved in the `base_models` directory with their basename being the unique ref.


An `ilab model list` command would be added which allows the user to list out existing models. This would allow users to view all models, base models, trained checkpoints, etc. Additionally, the user could further filter out trained checkpoints based on which training phase they were generated in, i.e. `phase00`, `phase05`, and `phase10`.

We can iterate on the behavior and optimize for UX, but the
idea is you should be able to run either: `ilab model list` or
`ilab list models` which produces a full list of all models on the local system.

The output of `ilab model list` would look like the following:

```
MODEL ID BASE MODEL SIZE DATA TYPE DATASET TRAINED WITH
abcde ibm-granite/granite-7b-base 28Gi FP32 phase00-dataset-id
cbdea abcde 28Gi FP32 phase05-dataset-id
```



Finally, we would add a new `ilab model delete <ref>` command
which would enable us to pass a reference to the model we want
to delete.


### Managing datasets

New datasets would be generated through `ilab generate` (or whatever the current equivalent is) and the corresponding dataset would be saved under the `datasets/` directory. Each model would also have a unique ref associated with it.

Following this, we could list out the datasets with `ilab dataset list` and further filter them out with either `knowledge` and `skills`, or `phase00`, `phase05`, and `phase10` depending on which SDG scheme is being used.
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved

Here's an example output from `ilab dataset list`

```
DATASET ID PHASE TYPE SIZE NUM SAMPLES TEACHER MODEL
phase00-dataset-id knowledge_base 500MB 650k mistralai/mixtral-8x7B
phase05-dataset-id knowledge 230MB 330k mistralai/mixtral-8x7B
phase10-dataset-id skils 500MB 550k mistralai/mixtral-8x7B
```

An `ilab dataset delete <ref>` command would be added as well
so that the user can easily delete generated datasets.

Additionally, The `ilab model train` command would be able
RobotSail marked this conversation as resolved.
Show resolved Hide resolved
to receive a ref to a dataset for easy access.


### Managing Eval Runs

When we run evaluations, we expect to receive a certain value which is tied to a model checkpoint.

Under this new framework, the evaluation outputs could be saved under the `eval_data` subdirectory, where `ilab` could later use to get the score for a particular model.
RobotSail marked this conversation as resolved.
Show resolved Hide resolved

When we evaluate a particular model with `ilab eval <model-ref>`, we would then save the corresponding score in this new directory and link it back to the model that was evaluated and the metric.

By linking a given model with an evaluation score, we could provide an `ilab eval list` command that can output scores
for particular models.

Here's a sample output of the `ilab eval list` command:

```
MODEL REF BASE MODEL EVAL SCORE METRIC EVAL DATE
abcde ibm-granite/granite-7b-base 84.2% MMLU 2024-06-24
cbdea abcde 78.9% MMLU 2024-06-23
fghij ibm-granite/granite-7b-base 7.89 MT-Bench 2024-06-22
```

Since this would be JSON data, we don't expect it to be very expensive or consume a lot of space, so clearing it should not be a priority.
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved



But if it's desired, we could provide an `ilab eval clear` function that erases all of the evaluation data. We could also generate refs for the eval runs as well, and clear them using `ilab eval clear <eval-ref>`.
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved

The output would look something like
nathan-weinberg marked this conversation as resolved.
Show resolved Hide resolved


## Evolving configs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a distinct proposal, was there a reason you included it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is definitely distinct, but will be something we'll need as a feature in the future.



The config object is something we expect to change in the future. To maintain compatibility and decrease user frustration, we should have schema versions for our configs.
RobotSail marked this conversation as resolved.
Show resolved Hide resolved

For instance, our current one would be a `v1` and if we decide
to add something in the future we would update it to a `v2`.

RobotSail marked this conversation as resolved.
Show resolved Hide resolved

In the future, we will also want to have config migration as we evolve the library to avoid breaking things for existing users.



## Scope of this proposal


This proposal provides a long-term goal of numerous components and how we expect them to interact with each other.

The main point here is to move the config to live in a single place so that the `ilab` command can be executed from any directory with confidence. In addition, we require the use of refs and CRUD-style commands to simplify the management and interaction of data generated by `ilab`.

The steps outlined to implement this are:

1. Implement logic for `~/.config/ilab` and `~/.local/share/ilab` directories and expect the config to live and be read from there unless another config is manually specified. For simplicity, if the is read in from `~/.config/ilab` then we assume the relative paths should be realtive to the data directory `~/.local/share/ilab`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way this can work is have a data_dir that can be changed via a CLI flag or config file option, but always default to what you have specified here -- very similar to what you describe as a default location for the config dir unless otherwise specified via a flag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree with that.

RobotSail marked this conversation as resolved.
Show resolved Hide resolved
1. Implement logic to the existing commands so that datasets, models, and eval scores can be read from `ilab list`
1. Implement ref logic and update all existing commands to be able to resolve objects from refs


Based on these steps, #1 would be the trickiest because we assume that the paths referenced in the config file are relative to the calling process.

One approach here could be removing these paths from the default and forcing the user to specify them relatively.
Copy link
Member

@n1hility n1hility Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having trouble understanding what you mean on this sentence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n1hility My wording here is bad but the idea is simple:

  • When the config read is the default one ~/.config/ilab/config.yaml then parse paths relative to ~/.local/share/ilab.
  • When that's NOT the config, we would parse relative paths with respect to the calling process. E.g. if ilab is invoked in /foo and the config.model_path references a file in bar/granite_7b.gguf then it would resolve to /foo/bar/granite_7b.gguf.
  • All absolute paths will be parsed as-is


Another approach would be implementing the list commands at the same time as #1 and setting it up so that the relative paths passed to `ilab` are relative ***to the `~/.local/share/ilab` directory***

Loading