Add Dataloader Reviewing SOP to REVIEW.md (SEACrowd#230)

sabilmakbar · Dec 27, 2023 · c646fad · c646fad
1 parent a3a6a84
commit c646fad
Showing 1 changed file with 36 additions and 0 deletions.
diff --git a/REVIEWING.md b/REVIEWING.md
@@ -51,3 +51,39 @@ Check the following before approving:
 3. Check the scoring guide and see which languages gets additional points (if any).
 4. Add the dataloader name (use Python snake case)
 5. Wait for a GitHub issue to be generated for the approved datasheet.
+
+
+# Dataloader Reviewer SOP
+
+The objective of datasheet review is to ensure that all dataloaders in SEACrowd has correctness to the HF Dataloader Structure & SEACrowd defined schema and config and follow similar code format and/or style.
+
+### Dataloader Check
+1. Metadata correctness. (ensure Tasks, Languages, HOME_URL, DATA_URL is used).
+2. All subsets are implemented correctly to respective dataloader issue and according to SEACrowd Schema definition (has both `source` and `seacrowd` schema -- if a given task has its SEACrowd Schema, else can raise it to reviewers/mods).
+3. Pass the test scripts defined in `tests` folder.
+4. Pass manual check.
+ a. Perform a sampling of configs based on Lang and/or Task combinations
+ b. Execute `datasets.load_dataset` check based on config list (a)
+ c. Check on the dataset schema & few first examples for plausibility.
+5. Follows some general rules/conventions:
+    1. `PascalCase` for dataloader class name (and “Dataset” is contained in the suffix of the class name).
+    2. Lowercase word characters (regex identifier: `\w`) for schema column names, including the `source` schema if the original dataset doesn’t follow it.
+6. The code aligns with the `black` formatter:
+use this `make check_file=seacrowd/sea_datasets/{dataloader}/{dataloader}.py`
+7. Follows Dataloader Config Rule
+The dataset can be divided into different cases based on the compulsory Dataloader Configs:
+a. Single Language, Single Task (Type 1)
+b. Multiple Language, Single Task (Type 2)
+c. Single Language, Multiple Task (Type 3)
+d. Multiple Language, Multiple Task (Type 4)
+For a multilingual dataset of Lang Identification (or Linguistic Features/Unit Identification), it’s not considered Multilingual Dataset since the Lang itself is used as the label, or it doesn’t make sense to split the data based on the languages.
+
+   Based on afromentioned types, the checklist for Config Correctness:
+    1. For type 1 & 3, both config of `f”{_DATASETNAME}_source”` and `f”{_DATASETNAME}_seacrowd_{TASK_TO_SCHEMA}”` must be implemented.
+    2. For type 2 & 4, the dataloader config in (1) shouldn't be implemented, consequently it must cover all possible languages as a replacement. The formatting for Multilingual are `f”{_DATASETNAME}_{ISO_639_3_LANG}_source”` and `f”{_DATASETNAME}_{ISO_639_3_LANG}_seacrowd_{TASK_TO_SCHEMA}”`.
+    3. For point (2), since it won't pass the test-cases using the default args, a custom arg must be provided by the Dataloader PR creator to ensure the reproducibility of Testing among reviewers.
+
+## Approval process
+1. For every dataloader, it requires 2 reviewers per issue (the assignee must not review their own dataloader).
+2. Once the second reviewer approved, the PR should be merged to the `master` branch using `squash and merge` strategy for cleaner commit history.
+3. Reviewers will be assigned from the reviewer pool once/twice a week (by [email protected]) or any reviewers can take any unassigned review process as long as it can be done in timely manner.