From 6f0a2e0f2d686ed77f48ff1f2998b2e34ac0de13 Mon Sep 17 00:00:00 2001 From: Salsabil Maulana Akbar Date: Mon, 4 Mar 2024 15:27:36 +0700 Subject: [PATCH] Remove `CONTRIBUTING.md`, update PR Message Template, and add bash to initialize dataset (#468) * add bash to initialize dataset * delete CONTRIBUTING.md since it's duplicated with DATALOADER.md * update the docs slightly on suggesting new dataloader contributors to use template * fix few wordings * Add info on required vars '_LOCAL' * Add checklist on __init__.py * fix wording on 2nd checklist regarding 'my_dataset' that should've been a var instead of static val * fix wordings on first section of PR msg * add newline separator for better readability * add info on some to-dos --- .github/PULL_REQUEST_TEMPLATE.md | 14 +- CONTRIBUTING.md | 208 ---------------------- DATALOADER.md | 18 +- templates/initiate_seacrowd_dataloader.sh | 20 +++ templates/template.py | 4 +- 5 files changed, 42 insertions(+), 222 deletions(-) delete mode 100644 CONTRIBUTING.md create mode 100644 templates/initiate_seacrowd_dataloader.sh diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index d60dc2956..a8f500579 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -1,11 +1,17 @@ -Please name your PR after the issue it closes. You can use the following line: "Closes #ISSUE-NUMBER" where you replace the ISSUE-NUMBER with the one corresponding to your dataset. +Please name your PR title and the first line of PR message after the issue it will close. You can use the following examples: + +**Title**: Closes #{ISSUE_NUMBER} | Add/Update Dataloader {DATALOADER_NAME} + +**First line PR Message**: Closes #{ISSUE_NUMBER} + +where you replace the {ISSUE_NUMBER} with the one corresponding to your dataset. ### Checkbox - [ ] Confirm that this PR is linked to the dataset issue. -- [ ] Create the dataloader script `seacrowd/sea_datasets/my_dataset/my_dataset.py` (please use only lowercase and underscore for dataset naming). -- [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables. +- [ ] Create the dataloader script `seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py` (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its `__init__.py` within `{my_dataset}` folder. +- [ ] Provide values for the `_CITATION`, `_DATASETNAME`, `_DESCRIPTION`, `_HOMEPAGE`, `_LICENSE`, `_LOCAL`, `_URLs`, `_SUPPORTED_TASKS`, `_SOURCE_VERSION`, and `_SEACROWD_VERSION` variables. - [ ] Implement `_info()`, `_split_generators()` and `_generate_examples()` in dataloader script. - [ ] Make sure that the `BUILDER_CONFIGS` class attribute is a list with at least one `SEACrowdConfig` for the source schema and one for a seacrowd schema. - [ ] Confirm dataloader script works with `datasets.load_dataset` function. -- [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets//.py`. +- [ ] Confirm that your dataloader script passes the test suite run with `python -m tests.test_seacrowd seacrowd/sea_datasets//.py` or `python -m tests.test_seacrowd seacrowd/sea_datasets//.py --subset_id {subset_name_without_source_or_seacrowd_suffix}`. - [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md deleted file mode 100644 index 5093649f5..000000000 --- a/CONTRIBUTING.md +++ /dev/null @@ -1,208 +0,0 @@ -# Guideline for contributing a dataloader implementation - -## Pre-Requisites - -Please make a GitHub account prior to implementing a dataset; you can follow the instructions to install git [here](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git). - -You will also need at least Python 3.6+. If you are installing Python, we recommend downloading [anaconda](https://docs.anaconda.com/anaconda/install/index.html) to curate a Python environment with the necessary packages. **We strongly recommend Python 3.8+ for stability**. - -**Optional** Setup your GitHub account with SSH ([instructions here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh).) - -### 1. **Assigning a dataloader** -- Choose a dataset from the [list of SEACrowd datasets](https://github.com/orgs/SEACrowd/projects/1/views/1). -

- -

- -- Assign yourself an issue by commenting `#self-assign` under the issue. **Please assign yourself to issues with no other collaborators assigned**. You should see your GitHub username associated with the issue within 1-2 minutes of making a comment. - -

- -

- -- Search to see if the dataset exists in the 🤗 [Hub](https://huggingface.co/datasets). If it exists, please use the current implementation as the `source` and focus on implementing the [task-specific seacrowd schema](https://github.com/SEACrowd/seacrowd-datahub/blob/master/task_schemas.md). - -- If not, find the dataset online, usually uploaded in Github or Google Drive. - -### 2. **Setup a local version of the SEACrowd repo** -Fork the seacrowd-datahub [repository](https://github.com/SEACrowd/seacrowd-datahub) to your local Github account. To do this, click the link to the repository and click "fork" in the upper-right corner. - -After you fork, clone the repository locally. You can do so as follows: - - git clone git@github.com:/seacrowd-datahub.git - cd seacrowd-datahub # enter the directory - -Next, you want to set your `upstream` location to enable you to push/pull (add or receive updates). You can do so as follows: - - git remote add upstream git@github.com:SEACrowd/seacrowd-datahub.git - -You can optionally check that this was set properly by running the following command: - - git remote -v - -The output of this command should look as follows: - - origin git@github.com:/seacrowd-datahub.git (fetch) - origin git@github.com:/seacrowd-datahub.git (push) - upstream git@github.com:SEACrowd/seacrowd-datahub.git (fetch) - upstream git@github.com:SEACrowd/seacrowd-datahub.git (push) - -If you do NOT have an `origin` for whatever reason, then run: - - git remote add origin git@github.com:/seacrowd-datahub.git - -The goal of `upstream` is to keep your repository up-to-date with any changes made officially to the datasets library. You can do this as follows by running the following commands: - - git fetch upstream - git pull - -Provided you have no *merge conflicts*, this will ensure the library stays up-to-date as you make changes. However, before you make changes, you should make a custom branch to implement your changes. - -You can make a new branch as such: - - git checkout -b - -

Please do not make changes on the master branch!

- -Always make sure you're on the right branch with the following command: - - git branch - -The correct branch will have an asterisk \* in front of it. - -### 2. **Create a development environment** -You can make an environment in any way you choose. We highlight two possible options: - -#### 2a) Create a conda environment - -The following instructions will create an Anaconda `env-seacrowd-datahub` environment. - -- Install [anaconda](https://docs.anaconda.com/anaconda/install/) for your appropriate operating system. -- Run the following command while in the `sea_datasets` folder (you can pick your python version): - -``` -conda env create -f conda.yml # Creates a conda env -conda activate env-seacrowd-datahub # Activate your conda environment -``` - -You can deactivate your environment at any time by either exiting your terminal or using `conda deactivate`. - -#### 2b) Create a venv environment - -Python 3.3+ has venv automatically installed; official information is found [here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/). - -``` -python3 -m venv -source /bin/activate # activate environment -pip install -r requirements.txt # Install this while in the datasets folder -``` -Make sure your `pip` package points to your environment's source. - -### 3. Implement your dataloader - -Make a new directory within the `SEACrowd/seacrowd-datahub/sea_datasets` directory: - - mkdir seacrowd-datahub/sea_datasets/ - -Please use lowercase letters and underscores when choosing a ``. -To implement your dataset, there are three key methods that are important: - - * `_info`: Specifies the schema of the expected dataloader - * `_split_generators`: Downloads and extracts data for each split (e.g. train/val/test) or associate local data with each split. - * `_generate_examples`: Create examples from data that conform to each schema defined in `_info`. - -To start, copy [templates/template.py](templates/template.py) to your `seacrowd/sea_datasets/` directory with the name `.py`. Within this file, fill out all the TODOs. - - cp templates/template.py seacrowd/sea_datasets//.py - -For the `_info_` function, you will need to define `features` for your -`DatasetInfo` object. For the `bigbio` config, choose the right schema from our list of examples. You can find a description of these in the [Task Schemas Document](task_schemas.md). You can find the actual schemas in the [schemas directory](seacrowd/utils/schemas). - -You will use this schema in the `_generate_examples` return value. - -Populate the information in the dataset according to this schema; some fields may be empty. - -To enable quality control, please add the following line in your file before the class definition: -```python -from seacrowd.utils.constants import Tasks -_SUPPORTED_TASKS = [Tasks.NAMED_ENTITY_RECOGNITION, Tasks.DEPENDENCY_PARSING] -``` - -##### Example scripts: -To help you implement a dataset, you can see the implementation of [other dataset scripts](seacrowd/sea_datasets). - -#### Running & Debugging: -You can run your data loader script during development by appending the following -statement to your code ([templates/template.py](templates/template.py) already includes this): - -```python -if __name__ == "__main__": - datasets.load_dataset(__file__) -``` - -If you want to use an interactive debugger during development, you will have to use -`breakpoint()` instead of setting breakpoints directly in your IDE. Most IDEs will -recognize the `breakpoint()` statement and pause there during debugging. If your prefered -IDE doesn't support this, you can always run the script in your terminal and debug with -`pdb`. - - -### 4. Check if your dataloader works - -Make sure your dataset is implemented correctly by checking in python the following commands: - -```python -from datasets import load_dataset - -data = load_dataset("seacrowd/sea_datasets//.py", name="_seacrowd_") -``` - -Run these commands from the top level of the `nusa-crowd` repo (i.e. the same directory that contains the `requirements.txt` file). - -Once this is done, please also check if your dataloader satisfies our unit tests as follows by using this command in the terminal: - -```bash -python -m tests.test_seacrowd seacrowd/sea_datasets//.py [--data_dir /path/to/local/data] -``` - -Your particular dataset may require use of some of the other command line args in the test script. -To view full usage instructions you can use the `--help` command, - -```bash -python -m tests.test_seacrowd --help -``` - -### 5. Format your code - -From the main directory, run the Makefile via the following command: - - make check_file=seacrowd/sea_datasets//.py - -This runs the black formatter, isort, and lints to ensure that the code is readable and looks nice. Flake8 linting errors may require manual changes. - -### 6. Commit your changes - -First, commit your changes to the branch to "add" the work: - - git add seacrowd/sea_datasets//.py - git commit -m "A message describing your commits" - -Then, run the following commands to incorporate any new changes in the master branch of datasets as follows: - - git fetch upstream - git rebase upstream/master - -Or you can install the pre-commit hooks to automatically pre-check before commit by: - - pre-commit install -**Run these commands in your custom branch**. - -Push these changes to **your fork** with the following command: - - git push -u origin - -### 7. **Make a pull request** - -Make a Pull Request to implement your changes on the main repository [here](https://github.com/SEACrowd/seacrowd-datahub/pulls). To do so, click "New Pull Request". Then, choose your branch from your fork to push into "base:master". - -When opening a PR, please link the [issue](https://github.com/SEACrowd/seacrowd-datahub/issues) corresponding to your dataset using [closing keywords](https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue) in the PR's description, e.g. `resolves #17`. diff --git a/DATALOADER.md b/DATALOADER.md index 5093649f5..1a94104ee 100644 --- a/DATALOADER.md +++ b/DATALOADER.md @@ -100,20 +100,21 @@ Make sure your `pip` package points to your environment's source. ### 3. Implement your dataloader -Make a new directory within the `SEACrowd/seacrowd-datahub/sea_datasets` directory: +Use this bash script to initialize your new dataloader folder along with template of your dataloader script under `SEACrowd/seacrowd-datahub/sea_datasets` directory using this: - mkdir seacrowd-datahub/sea_datasets/ + sh templates/initiate_seacrowd_dataloader.sh +The value of `` can be checked on the issue ticket that you were assigned to. -Please use lowercase letters and underscores when choosing a ``. +i.e: for this [issue ticket](https://github.com/SEACrowd/seacrowd-datahub/issues/32), the dataloader name indicates `Dataloader name: xl_sum/xl_sum.py`, hence the value of `` is `xl_sum`. + +Please use PascalCase when choosing a ``. To implement your dataset, there are three key methods that are important: * `_info`: Specifies the schema of the expected dataloader * `_split_generators`: Downloads and extracts data for each split (e.g. train/val/test) or associate local data with each split. * `_generate_examples`: Create examples from data that conform to each schema defined in `_info`. -To start, copy [templates/template.py](templates/template.py) to your `seacrowd/sea_datasets/` directory with the name `.py`. Within this file, fill out all the TODOs. - - cp templates/template.py seacrowd/sea_datasets//.py +After the bash above has been executed, you'll have your `seacrowd/sea_datasets/` directory existed with the name `.py`. Within this file, fill out all the TODOs based on the template. For the `_info_` function, you will need to define `features` for your `DatasetInfo` object. For the `bigbio` config, choose the right schema from our list of examples. You can find a description of these in the [Task Schemas Document](task_schemas.md). You can find the actual schemas in the [schemas directory](seacrowd/utils/schemas). @@ -133,7 +134,7 @@ To help you implement a dataset, you can see the implementation of [other datase #### Running & Debugging: You can run your data loader script during development by appending the following -statement to your code ([templates/template.py](templates/template.py) already includes this): +statement to your code (if you have your dataloader folder initialized using previous bash script, it already includes this, else you may add these by yourself): ```python if __name__ == "__main__": @@ -157,7 +158,7 @@ from datasets import load_dataset data = load_dataset("seacrowd/sea_datasets//.py", name="_seacrowd_") ``` -Run these commands from the top level of the `nusa-crowd` repo (i.e. the same directory that contains the `requirements.txt` file). +Run these commands from the top level of the `seacrowd/seacrowd-datahub` repo (i.e. the same directory that contains the `requirements.txt` file). Once this is done, please also check if your dataloader satisfies our unit tests as follows by using this command in the terminal: @@ -195,6 +196,7 @@ Then, run the following commands to incorporate any new changes in the master br Or you can install the pre-commit hooks to automatically pre-check before commit by: pre-commit install + **Run these commands in your custom branch**. Push these changes to **your fork** with the following command: diff --git a/templates/initiate_seacrowd_dataloader.sh b/templates/initiate_seacrowd_dataloader.sh new file mode 100644 index 000000000..ec3014b3e --- /dev/null +++ b/templates/initiate_seacrowd_dataloader.sh @@ -0,0 +1,20 @@ +#!/bin/bash + +# this simple bash will create a template and making necessary files and copying dataloader template script into dataloader folder dest + +if [[ "$1" == "" ]]; then + echo "Error: Missing the dataset name to be created" + echo "sh \${YOUR_SEACROWD_ROOT_PATH}/template/initiate_seacrowd_dataloader.sh " + exit +fi + +if [[ "$2" == "" ]]; then + root_path=./ +else + root_path=$2 +fi + +(cd $root_path/seacrowd/sea_datasets && mkdir $1 && cd $1 && touch __init__.py) +cp $root_path/templates/template.py $root_path/seacrowd/sea_datasets/$1/$1.py + +echo "Initialization is done. Exiting..." diff --git a/templates/template.py b/templates/template.py index 94e2c8052..19270d11e 100644 --- a/templates/template.py +++ b/templates/template.py @@ -68,7 +68,7 @@ # TODO: Add languages related to this dataset _LANGUAGES = [] # We follow ISO639-3 language code (https://iso639-3.sil.org/code_tables/639/data) -# TODO: Add the licence for the dataset here +# TODO: Add the licence for the dataset here (see constant choices in https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/utils/constants.py) # Note that this doesn't have to be a common open source license. # In the case of the dataset intentionally is built without license, please use `Licenses.UNLICENSE.value` # In the case that it's not clear whether the dataset has a license or not, please use `Licenses.UNKNOWN.value` @@ -90,7 +90,7 @@ _DATASETNAME: "url or list of urls or ... ", } -# TODO: add supported task by dataset. One dataset may support multiple tasks +# TODO: add supported task by dataset. One dataset may support multiple tasks --> # TODO: add supported task by dataset. One dataset may support multiple tasks (see constant choices in https://github.com/SEACrowd/seacrowd-datahub/blob/master/seacrowd/utils/constants.py) _SUPPORTED_TASKS = [] # example: [Tasks.TRANSLATION, Tasks.NAMED_ENTITY_RECOGNITION, Tasks.RELATION_EXTRACTION] # TODO: set this to a version that is associated with the dataset. if none exists use "1.0.0"