Collection of free datasets hosted with ReductStore.
The goal of this repository is to provide a collection of free datasets that can be used for testing and benchmarking machine learning algorithms.
All datasets are hosted on ReductStore and can be downloaded using Reduct CLI or one of the client libraries:
Inspite of the fact that ReductStore is a time series database, we use it to store datasets as a collection of records and use timestamp is a unique identifier. This approcah have the following advantages:
- The database is fast and free, you can mirror datasets on your own instance and use them locally.
- You can download partial datasets
- You can use databases directly from Python, Rust, C++, or Node.js
- You can use annotations as a dictionary of labels, no need to parse them manually.
Credentials to obtain the datasets:
- Host: https://play.reduct.store
- Bucket: datasets
- API Token: reductstore
You can export datasets to your local machine using Reduct CLI:
# Install the tool
wget https://github.com/reductstore/reduct-cli/releases/latest/download/reduct-cli.linux-amd64.tar.gz
tar -xvf reduct-cli.linux-amd64.tar.gz
chmod +x reduct-cli
sudo mv reduct-cli /usr/local/bin
# Add the ReductStore instance to aliases
reduct-cli alias add play -L https://play.reduct.store -t reductstore
# Download dataset(s) specified in --entry. Each sample will have a JSON document with metadata and anotations.
reduct-cli cp play/datasets . --entries=<Dataset Name> --with-meta
You can integrate ReductStore into your Python code and use the datasets directly:
import asyncio
from reduct import Client
HOST = "https://play.reduct.store"
API_TOKEN = "reductstore"
DATASET = "cats"
async def main():
client = Client(HOST, API_TOKEN)
bucket = await client.get_bucket("datasets")
async for record in bucket.query(DATASET):
print(record.labels)
jpeg = await record.read_all()
# Do something with the JPEG image
if __name__ == "__main__":
asyncio.run(main())
Entry Name | Description | Data Type | Labels | Original Source | Export Script |
---|---|---|---|---|---|
cats | Over 9,000 images of cats with annotated facial features | jpeg | left-eye-x,left-eye-y,right-eye-x,right-eye-y,mouth-x,mouth-y,left-ear-1-x,left-ear-1-y,left-ear-2-x,left-ear-2-y,left-ear-3-x,left-ear-3-y,right-ear-1-x,right-ear-1-y,right-ear-2-x,right-ear-2-y,right-ear-3-x,right-ear-3-y | kaggle | export.py |
mnist_training, mnist_test | MNIST handwritten digits | png | digit | MNIST | export.py |
imdb | ~50,000 photos from IMBD with face location, age and gender | jpeg | dob,photo_taken,gender,name,face_location_{x,y,w,h},face_score,second_face_score,celeb_names,celeb_id | IMDB-WIKI | export.py |