The first online catalogue for SEACrowd datasheets. This catalogue contains 498 datasets with metadata annotations for each dataset. You can view the list of all datasets seacrowd.github.io/seacrowd-catalogue.
No.
dataset numberName
name of the datasetSubsets
subsets of the datasetsLink
direct link to the dataset or instructions on how to download itLicense
license of the datasetYear
year of the publishing the dataset/paperLanguage
ar or multilingualDialect
region ar-LEV: (Arabic(Levant)), country ar-EGY: (Arabic (Egypt)) or type ar-MSA: (Arabic (Modern Standard Arabic))Domain
social media, news articles, reviews, commentary, books, transcribed audio or otherForm
text, audio or sign languageCollection style
crawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or otherDescription
short statement describing the datasetVolume
the size of the dataset in numbersUnit
unit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or otherProvider
company or university providing the datasetRelated Datasets
any datasets that is related in terms of content to the datasetPaper Title
title of the paperPaper Link
direct link to the paper pdfScript
writing system either Arab, Latn, Arab-Latn or otherTokenized
whether the dataset is segmented using morphology: Yes or NoHost
the host website for the data i.e GitHubAccess
the data is either free, upon-request or with-fee.Cost
cost of the data is with-fee.Test split
does the data contain test split: Yes or NoTasks
the tasks included in the dataset spearated by commaEvaluation Set
the data included in the evaluation suit by BigScienceVenue Title
the venue title i.e ACLCitations
the number of citationsVenue Type
conference, workshop, journal or preprintVenue Name
full name of the venue i.e Associations of computation linguisticsauthors
list of the paper authors separated by commaaffiliations
list of the paper authors' affiliations separated by commaabstract
abstract of the paperAdded by
name of the person who added the entryNotes
any extra notes on the dataset
You can access the annoated dataset using datasets
TO DO
which gives the following output
The catalogue will be updated regularly.
Prepare dev environment by changing nextjs config (next.config.mjs)
/**
* Enable static exports for the App Router.
*
* @see https://nextjs.org/docs/app/building-your-application/deploying/static-exports
*/
// output: "export", comment this
/**
* Set base path. This is usually the slug of your repository.
*
* @see https://nextjs.org/docs/app/api-reference/next-config-js/basePath
*/
// basePath: "/seacrowd-catalogue", comment this
Use nodejs version 20, run following command
npm install
npm run dev
Will be added