forked from SEACrowd/seacrowd-datahub
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Dataloader Reviewing SOP to REVIEW.md (SEACrowd#230)
- Loading branch information
1 parent
a3a6a84
commit c646fad
Showing
1 changed file
with
36 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,3 +51,39 @@ Check the following before approving: | |
3. Check the scoring guide and see which languages gets additional points (if any). | ||
4. Add the dataloader name (use Python snake case) | ||
5. Wait for a GitHub issue to be generated for the approved datasheet. | ||
|
||
|
||
# Dataloader Reviewer SOP | ||
|
||
The objective of datasheet review is to ensure that all dataloaders in SEACrowd has correctness to the HF Dataloader Structure & SEACrowd defined schema and config and follow similar code format and/or style. | ||
|
||
### Dataloader Check | ||
1. Metadata correctness. (ensure Tasks, Languages, HOME_URL, DATA_URL is used). | ||
2. All subsets are implemented correctly to respective dataloader issue and according to SEACrowd Schema definition (has both `source` and `seacrowd` schema -- if a given task has its SEACrowd Schema, else can raise it to reviewers/mods). | ||
3. Pass the test scripts defined in `tests` folder. | ||
4. Pass manual check. | ||
a. Perform a sampling of configs based on Lang and/or Task combinations | ||
b. Execute `datasets.load_dataset` check based on config list (a) | ||
c. Check on the dataset schema & few first examples for plausibility. | ||
5. Follows some general rules/conventions: | ||
1. `PascalCase` for dataloader class name (and “Dataset” is contained in the suffix of the class name). | ||
2. Lowercase word characters (regex identifier: `\w`) for schema column names, including the `source` schema if the original dataset doesn’t follow it. | ||
6. The code aligns with the `black` formatter: | ||
use this `make check_file=seacrowd/sea_datasets/{dataloader}/{dataloader}.py` | ||
7. Follows Dataloader Config Rule | ||
The dataset can be divided into different cases based on the compulsory Dataloader Configs: | ||
a. Single Language, Single Task (Type 1) | ||
b. Multiple Language, Single Task (Type 2) | ||
c. Single Language, Multiple Task (Type 3) | ||
d. Multiple Language, Multiple Task (Type 4) | ||
For a multilingual dataset of Lang Identification (or Linguistic Features/Unit Identification), it’s not considered Multilingual Dataset since the Lang itself is used as the label, or it doesn’t make sense to split the data based on the languages. | ||
|
||
Based on afromentioned types, the checklist for Config Correctness: | ||
1. For type 1 & 3, both config of `f”{_DATASETNAME}_source”` and `f”{_DATASETNAME}_seacrowd_{TASK_TO_SCHEMA}”` must be implemented. | ||
2. For type 2 & 4, the dataloader config in (1) shouldn't be implemented, consequently it must cover all possible languages as a replacement. The formatting for Multilingual are `f”{_DATASETNAME}_{ISO_639_3_LANG}_source”` and `f”{_DATASETNAME}_{ISO_639_3_LANG}_seacrowd_{TASK_TO_SCHEMA}”`. | ||
3. For point (2), since it won't pass the test-cases using the default args, a custom arg must be provided by the Dataloader PR creator to ensure the reproducibility of Testing among reviewers. | ||
|
||
## Approval process | ||
1. For every dataloader, it requires 2 reviewers per issue (the assignee must not review their own dataloader). | ||
2. Once the second reviewer approved, the PR should be merged to the `master` branch using `squash and merge` strategy for cleaner commit history. | ||
3. Reviewers will be assigned from the reviewer pool once/twice a week (by [email protected]) or any reviewers can take any unassigned review process as long as it can be done in timely manner. |