96 Labeled Datasets

In this repository, we provide 96 publicly available labeled dataset. The datasets were originally collected to be utilized in the paper "Measuring the Validity of Clustering Validation Datasets", previously entitled "Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measure", as a potential candidate for external clustering validation. However, it sill can be used for various purposes (e.g., classification, dimensionality reduction, etc.) For better applicability, we provide datasets in both numpy (.npy) and compressed (.bin) format. We also provided a reader code for the compressed files.

A full list of the datasets is available at this website and the Appendix of our reference paper (TBA).

Reader API

API

The reader of the compressed files is written in reader.py. We assume that the relative path of the reader file and the compressed datasets is identical to the one of this repository. The reader code depends on numpy and zlib.

read_dataset(name)

returns the designated dataset as a form of numpy arrays holding the data and labels

(INPUT) name: str, the name of a dataset (directory name)

(OUTPUT) (data, labels):

data: ndarray, 2D numpy array holding the data values

'label`: ndarray, 1D numpy array holding the class labels

read_dataset_by_path(path)

returns the designated dataset as a form of numpy arrays holding the data and labels

(INPUT) path: str, the relative path to a directory containing the datasets

(OUTPUT) (data, labels): identical to read_dataset

read_multiple_datasets(names)

returns the dictionary holding the data and labels of the designated datasets

(INPUT) 'names': list, the list holding the names of datasets

(OUTPUT) (data, labels):

data: dict, dictionary holding the data values; the value of a certain dataset can be accessed by using the name of the dataset as a key

labels: dict, dictionary holding the labels; the label of a certain dataset can be accessed by using the name of the dataset as a key

read_all_datasets()

returns the dictionary tholding the data and labels of entire 96 datasets

(OUTPUT) (data, labels): identical to read_multiple_datasets

Example

import reader as rd
import numpy as np

data, label = rd.read_dataset("cifar10")

Contact

If you have any issue exploiting the datasets, feel free to contact us via [email protected].

Reference

TBA

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
compressed		compressed
npy		npy
.gitignore		.gitignore
README.md		README.md
check.py		check.py
reader.py		reader.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

96 Labeled Datasets

Reader API

API

Example

Contact

Reference

About

Releases

Packages

Contributors 3

Languages

hj-n/labeled-datasets

Folders and files

Latest commit

History

Repository files navigation

96 Labeled Datasets

Reader API

API

Example

Contact

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages