Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM Finetuning Toolkit #67

Merged
merged 24 commits into from
Jan 4, 2024
Merged

LLM Finetuning Toolkit #67

merged 24 commits into from
Jan 4, 2024

Conversation

benjaminye
Copy link
Contributor

@benjaminye benjaminye commented Jan 3, 2024

This PR introduces a config-driven LLM fine-tuning experience.

Key Features

  • Launch fine-tuning by defining a single config.yaml file
  • Interactive and helpful print outputs
  • Experiment checkpointing that saves and loads experiment artefacts at various stages, enabling one to pickup where left off

Key Modules

  • DatasetGenerator: loads model into a huggingface Dataset object based on file extension (csv, json, or huggingface repo)
  • ModelLoader: manages loading of base model, appending lora weights, and merging lora weights
  • InferenceRunner: runs merged model on test set, outputs live streaming predictions, and saves predictions into csv

Road Map

  • adopt poetry for dependency management
  • create a docker container for predictable deployment
  • multi-GPU support
  • stratified sampling for dataset
  • ablation studies
  • documentation

@RohitSaha
Copy link
Contributor

@benjaminye Aren't we doing stratified sampling now?

prompt_stub:
>- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
{output}
test_size: 10 # Proportion of test as % of total; if integer then # of samples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to differentiate between proportion and # of samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is uniquely determined by the data type passed in.

proportion: float;
# of samples: int;

this is built into huggingface dataset's train_test_split() method.

bnb_4bit_quant_type: "nf4"

# LoRA Params -------------------
lora:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we include neftune functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; you can enable it under sft_args.
You can see all supported config settings under config_model.py.

I have added explanations, expected type, and defaults for each setting (where applicable), and we can then use the model definitions to automatically generate documentation.

toolkit/config.yml Outdated Show resolved Hide resolved
@RohitSaha
Copy link
Contributor

@benjaminye Let's ensure the packages we use only contain permissive licenses, i.e, Apache 2.0 and MIT. We should remove any dependencies that have copyleft license.

@benjaminye
Copy link
Contributor Author

@benjaminye Aren't we doing stratified sampling now?

Unfortunately this isn't as simple. I used huggingface's recommended from_generator() method to read in files directly into a Dataset object; instead of loading the files into a pandas DataFrame first (which isn't as memory & compute efficient).

Not loading into DataFrame precludes us from using scikit-learn's train_test_split() method; instead we have to use the method from huggingface dataset. However, that method expects the stratify_by column to be a ClassLabel (which is an integer representation). A hacky method is to cast it to ClassLabel and then cast it back into string Value, an avenue that I'm exploring.

Moreover, since stratified sampling only applies to classification tasks, I decided to think about this a bit more and release it in the next update.

@benjaminye
Copy link
Contributor Author

@benjaminye Let's ensure the packages we use only contain permissive licenses, i.e, Apache 2.0 and MIT. We should remove any dependencies that have copyleft license.

I've checked all the new dependencies and they all have permissive licenses.

@RohitSaha
Copy link
Contributor

Thanks for making the changes!

@benjaminye benjaminye merged commit 8f53a54 into main Jan 4, 2024
2 checks passed
@benjaminye benjaminye added the enhancement New feature or request label Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants