Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for Leipzig Corpora Collection #339

Open
SamuelCahyawijaya opened this issue Jan 22, 2024 · 3 comments · May be fixed by #483
Open

Create dataset loader for Leipzig Corpora Collection #339

SamuelCahyawijaya opened this issue Jan 22, 2024 · 3 comments · May be fixed by #483
Assignees
Labels
bonus +3 pr-ready A PR that closes this issue is Ready to be reviewed

Comments

@SamuelCahyawijaya
Copy link
Collaborator

SamuelCahyawijaya commented Jan 22, 2024

Dataloader name: leipzig_corpora/leipzig_corpora.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?leipzig_copora

Dataset leipzig_copora
Description This is a collection of corpora in different languages, all built by randomly selecting sentences from web and newspaper sources. Each language has its own directory containing .txt files that list the words and sentences in the corpus, map words or sentences to their sources, and show the cooccurrence of words. The 2017 Community version of the collection contains text material crawled from different websites and contains data for 20 SEA languages.
Subsets -
Languages ban, bjn, bew, bcl, mya, ceb, hil, ind, khm, lao, zsm, min, pam, pag, ksw, tgl, tha, vie, war, jav, mad
Tasks Language Modeling
License Creative Commons Attribution 4.0 (cc-by-4.0)
Homepage https://wortschatz.uni-leipzig.de/en/download
HF URL -
Paper URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf
@SamuelCahyawijaya SamuelCahyawijaya converted this from a draft issue Jan 22, 2024
@sabilmakbar sabilmakbar added the help wanted Extra attention is needed label Jan 30, 2024
@TysonYu
Copy link
Collaborator

TysonYu commented Feb 5, 2024

#self-assign

Copy link

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@holylovenia holylovenia linked a pull request Mar 11, 2024 that will close this issue
8 tasks
@holylovenia holylovenia added pr-ready A PR that closes this issue is Ready to be reviewed and removed help wanted Extra attention is needed staled-issue labels Mar 11, 2024
@holylovenia
Copy link
Contributor

Not sure why this wasn't linked to the PR. I already linked it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonus +3 pr-ready A PR that closes this issue is Ready to be reviewed
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants