Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM Finetuning Toolkit #67

Merged
merged 24 commits into from
Jan 4, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,49 @@
.DS_Store

# experiment files
*/experiments
*/experiment
*/archive
*/backup
*/baseline_results

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# Jupyter Notebook
.ipynb_checkpoints

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,9 @@ protobuf==3.20.*
ai21==1.2.7
openai==0.28.1
ujson==5.8.0


# for toolkit
pyyaml
ijson
rich
68 changes: 68 additions & 0 deletions toolkit/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
save_dir: "./experiment/"

# Data Ingestion -------------------
data:
file_type: "huggingface" # one of 'json', 'csv', 'huggingface'
path: "yahma/alpaca-cleaned"
prompt:
>- # prompt, make sure column inputs are enclosed in {} brackets and that they match your data
Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction: {instruction}
### Input: {input}
### Output:
prompt_stub:
>- # Stub to add for training at the end of prompt, for test set or inference, this is omitted; make sure only one variable is present
{output}
test_size: 10 # Proportion of test as % of total; if integer then # of samples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to differentiate between proportion and # of samples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is uniquely determined by the data type passed in.

proportion: float;
# of samples: int;

this is built into huggingface dataset's train_test_split() method.

train_size: 100 # Proportion of train as % of total; if integer then # of samples

# Model Definition -------------------
model:
hf_model_ckpt: "NousResearch/Llama-2-7b-hf"
quantize: true
bitsandbytes:
load_in_4bit: true
bnb_4bit_compute_dtype: "bf16"
bnb_4bit_quant_type: "nf4"

# LoRA Params -------------------
lora:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we include neftune functionality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; you can enable it under sft_args.
You can see all supported config settings under config_model.py.

I have added explanations, expected type, and defaults for each setting (where applicable), and we can then use the model definitions to automatically generate documentation.

task_type: "CAUSAL_LM"
r: 32
lora_dropout: 0.1
target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- up_proj
- down_proj
- gate_proj

# Training -------------------
training:
training_args:
num_train_epochs: 1
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
gradient_checkpointing: True
optim: "paged_adamw_32bit"
logging_steps: 100
learning_rate: 2.0e-4
bf16: true # Set to true for mixed precision training on Newer GPUs
tf32: true
# fp16: false # Set to true for mixed precision training on Older GPUs
max_grad_norm: 0.3
warmup_ratio: 0.03
lr_scheduler_type: "constant"
sft_args:
max_seq_length: 1024
# neftune_noise_alpha: None

inference:
max_new_tokens: 200
use_cache: True
do_sample: True
top_p: 0.6
benjaminye marked this conversation as resolved.
Show resolved Hide resolved
temperature: 0.2
57 changes: 57 additions & 0 deletions toolkit/examples/example_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
instruction,input,output
Give three tips for staying healthy.,,"1. Eat a balanced diet and make sure to include plenty of fruits and vegetables.
2. Exercise regularly to keep your body active and strong.
3. Get enough sleep and maintain a consistent sleep schedule."
What are the three primary colors?,,"The three primary colors are red, blue, and yellow."
Describe the structure of an atom.,,"An atom is made up of a nucleus, which contains protons and neutrons, surrounded by electrons that travel in orbits around the nucleus. The protons and neutrons have a positive charge, while the electrons have a negative charge, resulting in an overall neutral atom. The number of each particle determines the atomic number and the type of atom."
How can we reduce air pollution?,,"There are a number of ways to reduce air pollution, such as shifting to renewable energy sources, encouraging the use of public transportation, prohibiting the burning of fossil fuels, implementing policies to reduce emissions from industrial sources, and implementing vehicle emissions standards. Additionally, individuals can do their part to reduce air pollution by reducing car use, avoiding burning materials such as wood, and changing to energy efficient appliances."
Pretend you are a project manager of a construction company. Describe a time when you had to make a difficult decision.,,"I had to make a difficult decision when I was working as a project manager at a construction company. I was in charge of a project that needed to be completed by a certain date in order to meet the client’s expectations. However, due to unexpected delays, we were not able to meet the deadline and so I had to make a difficult decision. I decided to extend the deadline, but I had to stretch the team’s resources even further and increase the budget. Although it was a risky decision, I ultimately decided to go ahead with it to ensure that the project was completed on time and that the client’s expectations were met. The project was eventually successfully completed and this was seen as a testament to my leadership and decision-making abilities."
Identify the odd one out.,"Twitter, Instagram, Telegram",Telegram
Explain why the following fraction is equivalent to 1/4,4/16,The fraction 4/16 is equivalent to 1/4 because both numerators and denominators are divisible by 4. Dividing both the top and bottom numbers by 4 yields the fraction 1/4.
Write a short story in third person narration about a protagonist who has to make an important career decision.,,"John was at a crossroads in his life. He had just graduated college and was now facing the big decision of what career to pursue. After much deliberation, he decided that he wanted to be an accountant and help the financially disadvantaged. He had always been good with numbers and enjoyed seeing the tangible results of his work.

John enrolled in accounting courses and initially found it quite challenging. He had to learn multiple systems and regulations quickly, but he worked hard and eventually excelled in his studies. After a few years, John started working at an accounting firm in his city. He was eager to put his knowledge of taxes and accounting to use in a real-world setting.

John loved his job, as it let him express his creativity in finding strategies to save his clients money. After a few years at the firm, he became a senior accountant and was asked to manage bigger and more challenging cases. He was now a respected figure in the financial industry, but he still remembers when he was just a recent college graduate, unsure of the direction in which his life would take him."
Evaluate this sentence for spelling and grammar mistakes,He finnished his meal and left the resturant,He finished his meal and left the restaurant.
How did Julius Caesar die?,,"Julius Caesar was assassinated by a group of up to 60 conspirators, led by Gaius Cassius Longinus and Marcus Junius Brutus, in the Senate House on the Ides of March (15 March) of 44 BC."
What is the capital of France?,,The capital of France is Paris.
Generate a list of ten items a person might need for a camping trip,,"1. Tent
2. Sleeping bags
3. Flashlight
4. Matches/lighter
5. Insect repellent
6. Extra clothes
7. First aid kit
8. Camp stove
9. Cooking utensils
10. Biodegradable soaps/shampoos"
Discuss the causes of the Great Depression,,"The Great Depression was a severe global economic downturn that lasted from 1929 to 1939. It was caused by several factors, including the stock market crash of 1929, structural weaknesses in the economy, low consumer spending, high unemployment, and a lack of confidence in the banking system. Other contributing factors included overproduction of commodities, reliance on borrowing to finance investments, agricultural overproduction, and a decrease in international trade."
"Classify the following into animals, plants, and minerals","Oak tree, copper ore, elephant","Oak tree: Plant
Copper ore: Mineral
Elephant: Animal"
Explain the use of word embeddings in Natural Language Processing,,"Word embeddings are one of the most powerful tools available for Natural Language Processing (NLP). They are mathematical representations of words or phrases in a vector space, allowing similarities between words and the context in which they are used to be measured. Word embeddings are useful for tasks such as sentiment analysis, text classification, predicting the next word in a sequence, and understanding synonyms and analogies. They allow for words to be processed as numerical values, giving machines an easier way to perform NLP tasks."
Describe the function of a computer motherboard,,"A computer motherboard is the main circuit board in a computer. It is the backbone of the system, connecting all other components such as the processor, RAM, storage, graphics card and peripheral ports. It provides communication pathways between all of these components, allowing them to interact and work together. The motherboard is responsible for regulating power, and also houses slots for expansion cards, as well as ports for USB, audio, Ethernet, and other connection types."
Refactor this code and add comments.,"def factorialize(num):
factorial = 1
for i in range(1, num):
factorial *= i

return factorial","```
# This is a function that calculates the factorial of a given number
def factorialize(num):
# Set the variable 'factorial' to 1
factorial = 1
# Iterate from 1 to 'num' using the range function
for i in range(1, num):
# Multiply the 'factorial' variable by the current number in the iteration
factorial *= i
# Return the final value of 'factorial'
return factorial
```"
Propose an ethical solution to the problem of data privacy,,"One ethical solution to the problem of data privacy is to implement an opt-in data collection policy. This means that companies cannot collect and use personal data without the explicit permission of the data owners. This approach gives users the power to determine which data companies can use and for what purpose. Companies should also be transparent about how they use the collected data, so that users can make informed decisions and have control over their own data."
"Generate three verbs that mean the same as ""to apologize""",,"Acknowledge, express regret, make amends."
What does DNA stand for?,,DNA stands for deoxyribonucleic acid.
Compare and contrast the Cuban Missile Crisis and the Vietnam War.,,"The Cuban Missile Crisis and the Vietnam War were two of the largest and deadliest conflicts of the 20th century. The Cuban Missile Crisis involved a standoff between the United States and the Soviet Union over Soviet nuclear missile bases in Cuba, while the Vietnam War was a prolonged military conflict in Indochina between the North, which was backed by the Soviet Union and the South, which was backed by the United States. The Cuban Missile Crisis unfolded over a period of 13 days in October 1962, with the resolution coming in the form of a diplomatic agreement. The Vietnam War lasted for 19 years, beginning in 1954 and ending in 1973, with the signing of the Paris Peace Accords. While both of these conflicts were Cold War related, they were also very different in terms of causes, objectives, strategic objectives and outcomes. The Cuban Missile Crisis did not result in any direct military conflict and was resolved primarily through diplomacy and the decision to withdraw nuclear weapons from Cuba. On the other hand, the Vietnam War was a prolonged series of battles resulting in significant losses for both sides. The Vietnam War also had a significant impact on the United States, as it highlighted the realities of a prolonged conflict and the limits of U.S. military intervention. Additionally, the Vietnam War also drastically changed the way public opinion viewed the military and the use of force."
Generate a list of random words.,,"sneeze, conflict, ancestor, thunder, companion, amulet, forge, serpent, magnify, jester, cascade, universe, surge, nausea, oasis, anomaly, pursue, rapture, pagoda, mosaic."
Transcribe the recording into text.,Recording about the history of Waterloo,"The Battle of Waterloo, fought on June 18, 1815, was the last major battle of the Napoleonic Wars. It marked the end of Napoleon Bonaparte's rule. The battle was fought between Napoleon Bonaparte's French forces and an allied force of British, Dutch, Belgian and Prussian troops, led by the Duke of Wellington. It was fought near the town of Waterloo in the former Austrian Netherlands, now present-day Belgium. The battle was a defeat for Napoleon's forces and ended Napoleon's rule over France. Napoleon himself was exiled to St. Helena, where he would die in 1821. The battle marked the end of French domination of Europe and the beginning of a period of peace in the region. It was also the last major battle of the Napoleonic Wars and is remembered as one of the most important battles in history. The victory of the allied forces at Waterloo helped to bring an end to French imperial ambitions in Europe. It also established the British as one of the most powerful nations in Europe. The battle was fought with a combination of infantry, cavalry, and artillery tactics, and showed the beginning of the changing nature of warfare. Although the French forces greatly outnumbered the allies, their strategic and tactical mistake led to the loss of the battle. This defeat signaled the end of French imperial power in Europe. The legacy of Waterloo still stands today, and it cemented the British Empire's position for the next one hundred years."
Loading