LLM Finetuning Toolkit #67

benjaminye · 2024-01-03T20:30:04Z

This PR introduces a config-driven LLM fine-tuning experience.

Key Features

Launch fine-tuning by defining a single config.yaml file
Interactive and helpful print outputs
Experiment checkpointing that saves and loads experiment artefacts at various stages, enabling one to pickup where left off

Key Modules

DatasetGenerator: loads model into a huggingface Dataset object based on file extension (csv, json, or huggingface repo)
ModelLoader: manages loading of base model, appending lora weights, and merging lora weights
InferenceRunner: runs merged model on test set, outputs live streaming predictions, and saves predictions into csv

Road Map

adopt poetry for dependency management
create a docker container for predictable deployment
multi-GPU support
stratified sampling for dataset
ablation studies
documentation

…ng-Hub into toolkit

RohitSaha · 2024-01-04T16:05:56Z

@benjaminye Aren't we doing stratified sampling now?

RohitSaha · 2024-01-04T16:08:53Z

toolkit/config.yml

+  prompt_stub:
+    >- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
+    {output}
+  test_size: 10 # Proportion of test as % of total; if integer then # of samples


How to differentiate between proportion and # of samples?

This is uniquely determined by the data type passed in.

proportion: float;
# of samples: int;

this is built into huggingface dataset's train_test_split() method.

RohitSaha · 2024-01-04T16:09:52Z

toolkit/config.yml

+    bnb_4bit_quant_type: "nf4"
+
+# LoRA Params -------------------
+lora:


Did we include neftune functionality?

Yes; you can enable it under sft_args.
You can see all supported config settings under config_model.py.

I have added explanations, expected type, and defaults for each setting (where applicable), and we can then use the model definitions to automatically generate documentation.

toolkit/config.yml

RohitSaha · 2024-01-04T16:14:23Z

@benjaminye Let's ensure the packages we use only contain permissive licenses, i.e, Apache 2.0 and MIT. We should remove any dependencies that have copyleft license.

benjaminye · 2024-01-04T16:37:14Z

@benjaminye Aren't we doing stratified sampling now?

Unfortunately this isn't as simple. I used huggingface's recommended from_generator() method to read in files directly into a Dataset object; instead of loading the files into a pandas DataFrame first (which isn't as memory & compute efficient).

Not loading into DataFrame precludes us from using scikit-learn's train_test_split() method; instead we have to use the method from huggingface dataset. However, that method expects the stratify_by column to be a ClassLabel (which is an integer representation). A hacky method is to cast it to ClassLabel and then cast it back into string Value, an avenue that I'm exploring.

Moreover, since stratified sampling only applies to classification tasks, I decided to think about this a bit more and release it in the next update.

benjaminye · 2024-01-04T16:40:02Z

@benjaminye Let's ensure the packages we use only contain permissive licenses, i.e, Apache 2.0 and MIT. We should remove any dependencies that have copyleft license.

I've checked all the new dependencies and they all have permissive licenses.

RohitSaha · 2024-01-04T17:15:47Z

Thanks for making the changes!

benjaminye added 23 commits November 27, 2023 13:12

toolkit dataset tools

15fdb3f

use csv DictReader to get header info

543f59c

expanded example dataset

11e986b

updated gitignore to ignore temp files/artefacts

8cd4705

directory change

9b30617

updated requirements

45c055e

model loading

c913cd8

Merge branch 'toolkit' of https://github.com/georgian-io/LLM-Finetuni…

36976ea

…ng-Hub into toolkit

model loading complete

b157512

training cycle - done

3544830

remove old code

8d8c537

inference mode

df0976e

inference loop with streaming generation

a09ff1c

fixing progress bar not showing

bbb0a14

defining pydantic models & validation

72dbfdb

file operations helper

a7cab7d

refactoring file and print operations into modules

0bfcfb2

more descriptive module name

5551d9d

add todo

988d77b

refactoring out all training/inference logic

3e470a9

todo + minor bug fix

de9133a

shortening attribute names

2a1d761

bug fixes

d6a3b0b

benjaminye requested a review from RohitSaha January 3, 2024 20:30

RohitSaha reviewed Jan 4, 2024

View reviewed changes

toolkit/config.yml Outdated Show resolved Hide resolved

new defaults for top_p and temperature

161abc5

RohitSaha approved these changes Jan 4, 2024

View reviewed changes

benjaminye merged commit 8f53a54 into main Jan 4, 2024
2 checks passed

benjaminye added the enhancement New feature or request label Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Finetuning Toolkit #67

LLM Finetuning Toolkit #67

benjaminye commented Jan 3, 2024 •

edited

Loading

RohitSaha commented Jan 4, 2024

RohitSaha Jan 4, 2024

benjaminye Jan 4, 2024

RohitSaha Jan 4, 2024

benjaminye Jan 4, 2024

RohitSaha commented Jan 4, 2024

benjaminye commented Jan 4, 2024

benjaminye commented Jan 4, 2024

RohitSaha commented Jan 4, 2024

LLM Finetuning Toolkit #67

LLM Finetuning Toolkit #67

Conversation

benjaminye commented Jan 3, 2024 • edited Loading

RohitSaha commented Jan 4, 2024

RohitSaha Jan 4, 2024

Choose a reason for hiding this comment

benjaminye Jan 4, 2024

Choose a reason for hiding this comment

RohitSaha Jan 4, 2024

Choose a reason for hiding this comment

benjaminye Jan 4, 2024

Choose a reason for hiding this comment

RohitSaha commented Jan 4, 2024

benjaminye commented Jan 4, 2024

benjaminye commented Jan 4, 2024

RohitSaha commented Jan 4, 2024

benjaminye commented Jan 3, 2024 •

edited

Loading