Some code is missing #1

GaryStack · 2024-08-18T12:27:01Z

The code for processing sharegpt in the script prepare_train_data.sh uses sharegpt conversation splitter (split_sharegpt_conversations.py) , but there is no such code in the corresponding directory. Where can I find it? As shown below.

echo "Downloading ShareGPT dataset..."
wget -nc -P data/raw_train/sharegpt/ https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part1_html_cleaned.json
wget -nc -P data/raw_train/sharegpt/ https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/HTML_cleaned_raw_dataset/sg_90k_part2_html_cleaned.json
echo "Splitting the ShareGPT dataset with 2048 max tokens per conversation..."
python split_sharegpt_conversations.py \
    --in-files data/raw_train/sharegpt/sg_90k_part1_html_cleaned.json data/raw_train/sharegpt/sg_90k_part2_html_cleaned.json \
    --out-file data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_2048.json \
    --model-name-or-path /data3/MODELS/llama-7b-hf \
    --max-length 2048
echo "Splitting the ShareGPT dataset with 4096 max tokens per conversation..."
python split_sharegpt_conversations.py \
    --in-files data/raw_train/sharegpt/sg_90k_part1_html_cleaned.json data/raw_train/sharegpt/sg_90k_part2_html_cleaned.json \
    --out-file data/raw_train/sharegpt/sharegpt_html_cleaned_and_split_4096.json \
    --model-name-or-path /data3/MODELS/llama-7b-hf \
    --max-length 4096

The text was updated successfully, but these errors were encountered:

chenjianhuii · 2024-09-21T03:08:35Z

Sorry for the missing file. I have added the file in the newest commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some code is missing #1

Some code is missing #1

GaryStack commented Aug 18, 2024

chenjianhuii commented Sep 21, 2024

Some code is missing #1

Some code is missing #1

Comments

GaryStack commented Aug 18, 2024

chenjianhuii commented Sep 21, 2024