Closes #583 | Add Dataloader multilingual-NLI-26lang-2mil7 #598

akhdanfadh · 2024-04-01T18:28:30Z

Closes #583

There are 10 subsets. Configs will look like this: multilingual_nli_26lang_id_anli_source, multilingual_nli_26lang_vi_ling_seacrowd_pairs, etc. When testing, pass multilingual_nli_26lang_<subset> to the --subset_id parameter.

Here is a useful script to test all subsets:

To run this script, save it to a file (e.g., mnli_tests.sh), make it executable with chmod +x mnli_tests.sh, and execute it with ./mnli_tests.sh. Ensure you run the script from the seacrowd root directory.

#!/bin/bash

DATASET="multilingual_nli_26lang"
LANGS=("id" "vi")
SUBSETS=("anli" "fever" "ling" "mnli" "wanli")

mkdir -p data/${DATASET}

success_count=0
fail_count=0
declare -a failed_tests

for lang in "${LANGS[@]}"; do
    for subset in "${SUBSETS[@]}"; do
        subset_id="${lang}_${subset}"
        python_command="python -m tests.test_seacrowd seacrowd/sea_datasets/${DATASET}/${DATASET}.py --subset_id=${DATASET}_${subset_id}"
        output_file="data/${DATASET}/${subset_id}.txt"
        temp_output_file="data/${DATASET}/${subset_id}_temp.txt"  # for cleaner cli output

        echo "Testing subset id: $subset_id"
        # run the test, save the output, and redirect verbose output to a temporary file
        script -q -c "$python_command" "$temp_output_file" > /dev/null
        cat "$temp_output_file" > "$output_file"
        rm "$temp_output_file"

        # check if the test was successful
        if grep -q "OK" "$output_file"; then
            echo "Test for $subset_id: SUCCESS"
            ((success_count++))
        else
            echo "Test for $subset_id: FAILURE"
            failed_tests+=("$subset_id")
            ((fail_count++))
        fi
    done
done

echo "-----------------------"
echo "SUMMARY: $((success_count + fail_count)) tests total"
echo "Success: $success_count"
echo "Failure: $fail_count"
if [ ${#failed_tests[@]} -gt 0 ]; then
    echo "Failed tests:"
    for test in "${failed_tests[@]}"; do
        echo "- $test"
    done
fi

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

holylovenia

Hi @akhdanfadh, everything works well on my side! Thanks for your dataloader. I have one minor suggestion:

seacrowd/sea_datasets/multilingual_nli_26lang/multilingual_nli_26lang.py

akhdanfadh · 2024-04-25T23:51:08Z

@holylovenia Done!

holylovenia

Good work, @akhdanfadh! Thanks for the change. Let's wait for @yongzx's review.

yongzx · 2024-05-05T17:36:44Z

Everything runs on my end as well. I will merge this

init commit

c8304e6

akhdanfadh requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners April 1, 2024 18:28

holylovenia removed request for tellarin, gentaiscool, jamesjaya, SamuelCahyawijaya, ljvmiranda921, MJonibek, danjohnvelasco, jensan-1 and sabilmakbar April 17, 2024 07:37

holylovenia assigned holylovenia and yongzx Apr 17, 2024

holylovenia requested changes Apr 21, 2024

View reviewed changes

seacrowd/sea_datasets/multilingual_nli_26lang/multilingual_nli_26lang.py Show resolved Hide resolved

akhdanfadh added 2 commits April 26, 2024 08:49

change subset name to match ISO

40ae1c4

run make check file

ac287c3

holylovenia approved these changes Apr 27, 2024

View reviewed changes

yongzx merged commit 9c671fa into SEACrowd:master May 5, 2024
1 check passed

akhdanfadh deleted the multilingual_nli_26lang branch May 6, 2024 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #583 | Add Dataloader multilingual-NLI-26lang-2mil7 #598

Closes #583 | Add Dataloader multilingual-NLI-26lang-2mil7 #598

akhdanfadh commented Apr 1, 2024

holylovenia left a comment

akhdanfadh commented Apr 25, 2024

holylovenia left a comment

yongzx commented May 5, 2024

Closes #583 | Add Dataloader multilingual-NLI-26lang-2mil7 #598

Closes #583 | Add Dataloader multilingual-NLI-26lang-2mil7 #598

Conversation

akhdanfadh commented Apr 1, 2024

Checkbox

holylovenia left a comment

Choose a reason for hiding this comment

akhdanfadh commented Apr 25, 2024

holylovenia left a comment

Choose a reason for hiding this comment

yongzx commented May 5, 2024