Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
holylovenia authored Jun 20, 2024
1 parent 7261ffb commit 3645340
Showing 1 changed file with 58 additions and 19 deletions.
77 changes: 58 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,40 +2,79 @@

# Welcome to SEACrowd!

<!--
<h3>158 datasets registered</h3>
Southeast Asia is home to more than 1,000 native languages. Nevertheless, Southeast Asian NLP, vision-language, and speech processing is underrepresented in the research community, and one of the reasons is the lack of access to public datasets ([Aji et al., 2022](https://aclanthology.org/2022.acl-long.500/)). To address this issue, we initiate **SEACrowd**, a joint collaboration to collect NLP datasets for Southeast Asian languages. Help us collect and centralize Southeast Asian datasets, and be a co-author of our upcoming paper.

![Dataset claimed](https://progress-bar.dev/83/?title=Datasets%20Claimed%20(119%20Datasets%20Claimed))
## How to Use

<!-- milestone starts
![Milestone 1](https://progress-bar.dev/100/?title=Milestone%201%20(30%20Datasets%20Completed))
> Coming soon!
![Milestone 2](https://progress-bar.dev/100/?title=Milestone%202%20(60%20Datasets%20Completed))
### Library Installation

![Milestone 3](https://progress-bar.dev/100/?title=Milestone%203%20(100%20Datasets%20Completed))
Find seacrowd library (v0.1.3) at https://pypi.org/project/seacrowd/. (See our release notes [here](https://github.com/SEACrowd/seacrowd-datahub/releases/tag/0.1.3).)

![Milestone 4](https://progress-bar.dev/84/?title=Milestone%204%20(150%20Datasets%20Completed))
<!-- milestone ends -->
To install SEACrowd, install the `seacrowd` package in your python environment via `pip`.

Southeast Asia is home to more than 1,000 native languages. Nevertheless, Southeast Asian NLP, vision-language, and speech processing is underrepresented in the research community, and one of the reasons is the lack of access to public datasets ([Aji et al., 2022](https://aclanthology.org/2022.acl-long.500/)). To address this issue, we initiate **SEACrowd**, a joint collaboration to collect NLP datasets for Southeast Asian languages. Help us collect and centralize Southeast Asian datasets, and be a co-author of our upcoming paper.
```
pip install seacrowd
```

## How to use dataloaders from SEACrowd Data Hub?
### Using `seacrowd` library

> Coming soon!
To use the `seacrowd` package, simply import it in your code:
```
import seacrowd as sc
````
### Library Installation
### List & Load Dataset
SEACrowd provides functions for listing and loading all datasets that are implemented in NusaCrowd
```
# List all datasets
dset_names = sc.list_datasets()

Find seacrowd library (v0.1.3) at https://pypi.org/project/seacrowd/. (See our release notes [here](https://github.com/SEACrowd/seacrowd-datahub/releases/tag/0.1.3).)
# List all datasets with their config names
dset_configs_dict = sc.list_datasets(with_config=True)

# Load a single dataset based on the dataset name
khpos_dset = sc.load_dataset("khpos", schema="seacrowd")

# Load multiple datasets based on the dataset names
dsets = sc.load_datasets(["thai_sum", "vsolscsum"], schema="seacrowd_t2t")
```
pip install seacrowd
### List & Load Benchmark
In addition to dataset-related functions, SEACrowd provides additional functions for listing and loading some SEA benchmarks.
```
# List all benchmarks
benchmark_names = sc.list_benchmarks()

### Usage examples
# Load all datasets in a benchmark
seacrowd_vl_dsets = sc.load_benchmark("SEACrowd-VL")
```
> Coming soon!
### Load Metadata
Aside from loading datasets and benchmarks, `seacrowd` also supports loading the metadata (e.g., license, description, citation, etc.) of the dataloaders.
```
# Load metadata of a dataloader
khpos_meta = sc.for_dataset("khpos")

# Load metadata of multiple dataloaders
meta_dsets = sc.for_datasets(["thai_sum", "vsolscsum"])

# Load metadata of a config name
nusaparagraph_meta = sc.for_config_name("nusaparagraph_emot_jav_seacrowd_text")

# Load metadata of multiple config names
meta_dsets = sc.for_config_names(["sentiment_nathasa_review_seacrowd_text", "indonli_seacrowd_pairs"])
```
We can also load the dataloader from the metadata if we want.
```
# Load dataset from metadata
khpos_dset = khpos_meta.load_dataset()
```
## How to contribute?
## How to Contribute
Check out our [CONTRIBUTING.md](https://github.com/SEACrowd/seacrowd-datahub/blob/master/CONTRIBUTING.md) for a gentle introduction to contributing in SEACrowd. Jump straight ahead to [DATALOADER.md](https://github.com/SEACrowd/seacrowd-datahub/blob/master/DATALOADER.md) if you have decided to contribute by implementing dataloaders for our Data Hub!
Expand All @@ -55,4 +94,4 @@ If you are using any resources from SEACrowd, including datasheets, dataloaders,
## Acknowledgements
Our initiative is heavily inspired by [NusaCrowd](https://github.com/IndoNLP/nusa-crowd/tree/master/nusacrowd) which provides open access data to 100+ Indonesian NLP corpora. You can check NusaCrowd paper on the following [link](https://aclanthology.org/2023.findings-acl.868/).
Our initiative is heavily inspired by [NusaCrowd](https://github.com/IndoNLP/nusa-crowd/tree/master/nusacrowd) which provides open access data to 100+ Indonesian NLP corpora. You can check NusaCrowd paper (published in ACL Findings 2023) on the following [link](https://aclanthology.org/2023.findings-acl.868/).

0 comments on commit 3645340

Please sign in to comment.