Remove CONTRIBUTING.md, update PR Message Template, and add bash to…

… initialize dataset (SEACrowd#468) * add bash to initialize dataset * delete CONTRIBUTING.md since it's duplicated with DATALOADER.md * update the docs slightly on suggesting new dataloader contributors to use template * fix few wordings * Add info on required vars '_LOCAL' * Add checklist on __init__.py * fix wording on 2nd checklist regarding 'my_dataset' that should've been a var instead of static val * fix wordings on first section of PR msg * add newline separator for better readability * add info on some to-dos
sabilmakbar · Mar 4, 2024 · 6f0a2e0 · 6f0a2e0
1 parent 2131edb
commit 6f0a2e0
Show file tree

Hide file tree

Showing 5 changed files with 42 additions and 222 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -1,11 +1,17 @@
-Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset.
+Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples:
+
+**Title**: Closes #{ISSUE_NUMBER} | Add/Update Dataloader {DATALOADER_NAME}
+
+**First line PR Message**: Closes #{ISSUE_NUMBER}
+
+where you replace the {ISSUE_NUMBER} with the one corresponding to your dataset.
 
 ### Checkbox
 - [ ] Confirm that this PR is linked to the dataset issue.
-- [ ] Create the dataloader script `seacrowd/sea_datasets/my_dataset/my_dataset.py` (please use only lowercase and underscore for dataset naming).
-- [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables.
+- [ ] Create the dataloader script `seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py` (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its `__init__.py` within `{my_dataset}` folder.
+- [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_LOCAL`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables.
 - [ ] Implement `_info()`, `_split_generators()` and `_generate_examples()` in dataloader script.
 - [ ] Make sure that the `BUILDER_CONFIGS` class attribute is a list with at least one `SEACrowdConfig` for the source schema and one for a seacrowd schema.
 - [ ] Confirm dataloader script works with `datasets.load_dataset` function.
-- [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py`.
+- [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py` or `python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}`.
 - [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
diff --git a/DATALOADER.md b/DATALOADER.md
@@ -100,20 +100,21 @@ Make sure your `pip` package points to your environment's source.
 
 ### 3. Implement your dataloader
 
-Make a new directory within the `SEACrowd/seacrowd-datahub/sea_datasets` directory:
+Use this bash script to initialize your new dataloader folder along with template of your dataloader script under `SEACrowd/seacrowd-datahub/sea_datasets` directory using this:
 
-    mkdir seacrowd-datahub/sea_datasets/<dataset_name>
+    sh templates/initiate_seacrowd_dataloader.sh <YOUR_DATALOADER_NAME>
+The value of `<YOUR_DATALODER_NAME>` can be checked on the issue ticket that you were assigned to.
 
-Please use lowercase letters and underscores when choosing a `<dataset_name>`.
+i.e: for this [issue ticket](https://github.com/SEACrowd/seacrowd-datahub/issues/32), the dataloader name indicates `Dataloader name: xl_sum/xl_sum.py`, hence the value of `<YOUR_DATALOADER_NAME>` is `xl_sum`.
+
+Please use PascalCase when choosing a `<dataset_name>`.
 To implement your dataset, there are three key methods that are important:
 
   * `_info`: Specifies the schema of the expected dataloader
   * `_split_generators`: Downloads and extracts data for each split (e.g. train/val/test) or associate local data with each split.
   * `_generate_examples`: Create examples from data that conform to each schema defined in `_info`.
 
-To start, copy [templates/template.py](templates/template.py) to your `seacrowd/sea_datasets/<dataset_name>` directory with the name `<dataset_name>.py`. Within this file, fill out all the TODOs.
-
-    cp templates/template.py seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py
+After the bash above has been executed, you'll have your `seacrowd/sea_datasets/<dataset_name>` directory existed with the name `<dataset_name>.py`. Within this file, fill out all the TODOs based on the template.
 
 For the `_info_` function, you will need to define `features` for your
 `DatasetInfo` object. For the `bigbio` config, choose the right schema from our list of examples. You can find a description of these in the [Task Schemas Document](task_schemas.md). You can find the actual schemas in the [schemas directory](seacrowd/utils/schemas).
@@ -133,7 +134,7 @@ To help you implement a dataset, you can see the implementation of [other datase
 
 #### Running & Debugging:
 You can run your data loader script during development by appending the following
-statement to your code ([templates/template.py](templates/template.py) already includes this):
+statement to your code (if you have your dataloader folder initialized using previous bash script, it already includes this, else you may add these by yourself):
 
 ```python
 if __name__ == "__main__":
@@ -157,7 +158,7 @@ from datasets import load_dataset
 data = load_dataset("seacrowd/sea_datasets/<dataset_name>/<dataset_name>.py", name="<dataset_name>_seacrowd_<schema>")
 ```
 
-Run these commands from the top level of the `nusa-crowd` repo (i.e. the same directory that contains the `requirements.txt` file).
+Run these commands from the top level of the `seacrowd/seacrowd-datahub` repo (i.e. the same directory that contains the `requirements.txt` file).
 
 Once this is done, please also check if your dataloader satisfies our unit tests as follows by using this command in the terminal:
 
@@ -195,6 +196,7 @@ Then, run the following commands to incorporate any new changes in the master br
 Or you can install the pre-commit hooks to automatically pre-check before commit by:
 
     pre-commit install
+
 **Run these commands in your custom branch**.
 
 Push these changes to **your fork** with the following command:

diff --git a/templates/initiate_seacrowd_dataloader.sh b/templates/initiate_seacrowd_dataloader.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+
+# this simple bash will create a template and making necessary files and copying dataloader template script into dataloader folder dest
+
+if [[ "$1" == "" ]]; then
+    echo "Error: Missing the dataset name to be created"
+    echo "sh \${YOUR_SEACROWD_ROOT_PATH}/template/initiate_seacrowd_dataloader.sh <dataset name>"
+    exit
+fi
+
+if [[ "$2" == "" ]]; then
+    root_path=./
+else
+    root_path=$2
+fi
+
+(cd $root_path/seacrowd/sea_datasets && mkdir $1 && cd $1 && touch __init__.py)
+cp $root_path/templates/template.py $root_path/seacrowd/sea_datasets/$1/$1.py
+
+echo "Initialization is done. Exiting..."
diff --git a/templates/template.py b/templates/template.py
@@ -68,7 +68,7 @@
 # TODO: Add languages related to this dataset
 _LANGUAGES = []  # We follow ISO639-3 language code (https://iso639-3.sil.org/code_tables/639/data)
 
-# TODO: Add the licence for the dataset here 
+# TODO: Add the licence for the dataset here (see constant choices in https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/utils/constants.py)
 # Note that this doesn't have to be a common open source license.
 # In the case of the dataset intentionally is built without license, please use `Licenses.UNLICENSE.value`
 # In the case that it's not clear whether the dataset has a license or not, please use `Licenses.UNKNOWN.value`
@@ -90,7 +90,7 @@
     _DATASETNAME: "url or list of urls or ... ",
 }
 
-# TODO: add supported task by dataset. One dataset may support multiple tasks
+# TODO: add supported task by dataset. One dataset may support multiple tasks --> # TODO: add supported task by dataset. One dataset may support multiple tasks (see constant choices in https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/utils/constants.py)
 _SUPPORTED_TASKS = []  # example: [Tasks.TRANSLATION, Tasks.NAMED_ENTITY_RECOGNITION, Tasks.RELATION_EXTRACTION]
 
 # TODO: set this to a version that is associated with the dataset. if none exists use "1.0.0"