Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some code is missing #1

Open
GaryStack opened this issue Aug 18, 2024 · 1 comment
Open

Some code is missing #1

GaryStack opened this issue Aug 18, 2024 · 1 comment

Comments

@GaryStack
Copy link

The code for processing sharegpt in the script prepare_train_data.sh uses sharegpt conversation splitter (split_sharegpt_conversations.py) , but there is no such code in the corresponding directory. Where can I find it? As shown below.

echo "Downloading ShareGPT dataset..."
wget -nc -P data/raw_train/sharegpt/ https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json
wget -nc -P data/raw_train/sharegpt/ https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part2_html_cleaned.json
echo "Splitting the ShareGPT dataset with 2048 max tokens per conversation..."
python split_sharegpt_conversations.py \
    --in-files data/raw_train/sharegpt/sg_90k_part1_html_cleaned.json data/raw_train/sharegpt/sg_90k_part2_html_cleaned.json \
    --out-file data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_2048.json \
    --model-name-or-path /data3/MODELS/llama-7b-hf \
    --max-length 2048
echo "Splitting the ShareGPT dataset with 4096 max tokens per conversation..."
python split_sharegpt_conversations.py \
    --in-files data/raw_train/sharegpt/sg_90k_part1_html_cleaned.json data/raw_train/sharegpt/sg_90k_part2_html_cleaned.json \
    --out-file data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_4096.json \
    --model-name-or-path /data3/MODELS/llama-7b-hf \
    --max-length 4096
@chenjianhuii
Copy link
Collaborator

Sorry for the missing file. I have added the file in the newest commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants