Skip to content

Commit

Permalink
Remove CONTRIBUTING.md, update PR Message Template, and add bash to…
Browse files Browse the repository at this point in the history
… initialize dataset (SEACrowd#468)

* add bash to initialize dataset

* delete CONTRIBUTING.md since it's duplicated with DATALOADER.md

* update the docs slightly on suggesting new dataloader contributors to use template

* fix few wordings

* Add info on required vars '_LOCAL'

* Add checklist on __init__.py

* fix wording on 2nd checklist regarding 'my_dataset' that should've been a var instead of static val

* fix wordings on first section of PR msg

* add newline separator for better readability

* add info on some to-dos
  • Loading branch information
sabilmakbar authored Mar 4, 2024
1 parent 2131edb commit 6f0a2e0
Show file tree
Hide file tree
Showing 5 changed files with 42 additions and 222 deletions.
14 changes: 10 additions & 4 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples:

**Title**: Closes #{ISSUE_NUMBER} | Add/Update Dataloader {DATALOADER_NAME}

**First line PR Message**: Closes #{ISSUE_NUMBER}

where you replace the {ISSUE_NUMBER} with the one corresponding to your dataset.

### Checkbox
- [ ] Confirm that this PR is linked to the dataset issue.
- [ ] Create the dataloader script `seacrowd/sea_datasets/my_dataset/my_dataset.py` (please use only lowercase and underscore for dataset naming).
- [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables.
- [ ] Create the dataloader script `seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py` (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its `__init__.py` within `{my_dataset}` folder.
- [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_LOCAL`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables.
- [ ] Implement `_info()`, `_split_generators()` and `_generate_examples()` in dataloader script.
- [ ] Make sure that the `BUILDER_CONFIGS` class attribute is a list with at least one `SEACrowdConfig` for the source schema and one for a seacrowd schema.
- [ ] Confirm dataloader script works with `datasets.load_dataset` function.
- [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py`.
- [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py` or `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}`.
- [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
208 changes: 0 additions & 208 deletions CONTRIBUTING.md

This file was deleted.

18 changes: 10 additions & 8 deletions DATALOADER.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,20 +100,21 @@ Make sure your `pip` package points to your environment's source.

### 3. Implement your dataloader

Make a new directory within the `SEACrowd/seacrowd-datahub/sea_datasets` directory:
Use this bash script to initialize your new dataloader folder along with template of your dataloader script under `SEACrowd/seacrowd-datahub/sea_datasets` directory using this:

mkdir seacrowd-datahub/sea_datasets/<dataset_name>
sh templates/initiate_seacrowd_dataloader.sh <YOUR_DATALOADER_NAME>
The value of `<YOUR_DATALODER_NAME>` can be checked on the issue ticket that you were assigned to.

Please use lowercase letters and underscores when choosing a `<dataset_name>`.
i.e: for this [issue ticket](https://github.com/SEACrowd/seacrowd-datahub/issues/32), the dataloader name indicates `Dataloader name: xl_sum/xl_sum.py`, hence the value of `<YOUR_DATALOADER_NAME>` is `xl_sum`.

Please use PascalCase when choosing a `<dataset_name>`.
To implement your dataset, there are three key methods that are important:

* `_info`: Specifies the schema of the expected dataloader
* `_split_generators`: Downloads and extracts data for each split (e.g. train/val/test) or associate local data with each split.
* `_generate_examples`: Create examples from data that conform to each schema defined in `_info`.

To start, copy [templates/template.py](templates/template.py) to your `seacrowd/sea_datasets/<dataset_name>` directory with the name `<dataset_name>.py`. Within this file, fill out all the TODOs.

cp templates/template.py seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py
After the bash above has been executed, you'll have your `seacrowd/sea_datasets/<dataset_name>` directory existed with the name `<dataset_name>.py`. Within this file, fill out all the TODOs based on the template.

For the `_info_` function, you will need to define `features` for your
`DatasetInfo` object. For the `bigbio` config, choose the right schema from our list of examples. You can find a description of these in the [Task Schemas Document](task_schemas.md). You can find the actual schemas in the [schemas directory](seacrowd/utils/schemas).
Expand All @@ -133,7 +134,7 @@ To help you implement a dataset, you can see the implementation of [other datase

#### Running & Debugging:
You can run your data loader script during development by appending the following
statement to your code ([templates/template.py](templates/template.py) already includes this):
statement to your code (if you have your dataloader folder initialized using previous bash script, it already includes this, else you may add these by yourself):

```python
if __name__ == "__main__":
Expand All @@ -157,7 +158,7 @@ from datasets import load_dataset
data = load_dataset("seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py", name="<dataset_name>_seacrowd_<schema>")
```

Run these commands from the top level of the `nusa-crowd` repo (i.e. the same directory that contains the `requirements.txt` file).
Run these commands from the top level of the `seacrowd/seacrowd-datahub` repo (i.e. the same directory that contains the `requirements.txt` file).

Once this is done, please also check if your dataloader satisfies our unit tests as follows by using this command in the terminal:

Expand Down Expand Up @@ -195,6 +196,7 @@ Then, run the following commands to incorporate any new changes in the master br
Or you can install the pre-commit hooks to automatically pre-check before commit by:

pre-commit install

**Run these commands in your custom branch**.

Push these changes to **your fork** with the following command:
Expand Down
20 changes: 20 additions & 0 deletions templates/initiate_seacrowd_dataloader.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/bin/bash

# this simple bash will create a template and making necessary files and copying dataloader template script into dataloader folder dest

if [[ "$1" == "" ]]; then
echo "Error: Missing the dataset name to be created"
echo "sh \${YOUR_SEACROWD_ROOT_PATH}/template/initiate_seacrowd_dataloader.sh <dataset name>"
exit
fi

if [[ "$2" == "" ]]; then
root_path=./
else
root_path=$2
fi

(cd $root_path/seacrowd/sea_datasets && mkdir $1 && cd $1 && touch __init__.py)
cp $root_path/templates/template.py $root_path/seacrowd/sea_datasets/$1/$1.py

echo "Initialization is done. Exiting..."
4 changes: 2 additions & 2 deletions templates/template.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@
# TODO: Add languages related to this dataset
_LANGUAGES = [] # We follow ISO639-3 language code (https://iso639-3.sil.org/code_tables/639/data)

# TODO: Add the licence for the dataset here
# TODO: Add the licence for the dataset here (see constant choices in https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/utils/constants.py)
# Note that this doesn't have to be a common open source license.
# In the case of the dataset intentionally is built without license, please use `Licenses.UNLICENSE.value`
# In the case that it's not clear whether the dataset has a license or not, please use `Licenses.UNKNOWN.value`
Expand All @@ -90,7 +90,7 @@
_DATASETNAME: "url or list of urls or ... ",
}

# TODO: add supported task by dataset. One dataset may support multiple tasks
# TODO: add supported task by dataset. One dataset may support multiple tasks --> # TODO: add supported task by dataset. One dataset may support multiple tasks (see constant choices in https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/utils/constants.py)
_SUPPORTED_TASKS = [] # example: [Tasks.TRANSLATION, Tasks.NAMED_ENTITY_RECOGNITION, Tasks.RELATION_EXTRACTION]

# TODO: set this to a version that is associated with the dataset. if none exists use "1.0.0"
Expand Down

0 comments on commit 6f0a2e0

Please sign in to comment.