This repository aims to gather our work on dataset creation
The main objectives are the following
- Gather the datasets we have created / collected
- Gather several annotations for these datasets, the annotation of a smoke cloud is very complex and can answer several types of strategies.
- Visualize datasets and their annotations using fiftyone
- Create new datasets by combining available ones
- Benchmark our models on these datasets.
The annotations here are annotations for object detection, so we can evaluate models in classification and object detection
This repository use dvc to store data. To fully use this repository you need access to our dvc storage which is currently reserved for Pyronear members. We hope to make it public soon. However you can access all public data listed below
First clone the repo and install requirements
git clone https://github.com/pyronear/pyro-dataset.git
cd pyro-dataset
pip install -r requirements.txt
Then pull the data using dvc
dvc pull
Fiftyone is an open-source tool to build and visualize datasets, please refer here for more information.
To load datasets, run
python fiftyone/create_datasets.py
then go to http://localhost:5151, to use their app
Once datasets are created in fiftyone, you can re-lunch the app using
python fiftyone/run.py
You can add a new dataset using:
python fiftyone/add_dataset.py
You can create a combination of available datasets using
python datasets/make_dataset.py
this combination is defined by the configuration file dataset_config.yaml
You can preview the combination with the dry option
python datasets/make_dataset.py --dry
Each dataset has 3 folders:
images with the images of the dataset Labels, with one or more subfolders with the various annotations for this dataset Subset, text files containing a list of images of the dataset to extract a subset
Each dataset can have several annotations, this allows to propose a new annotation in case of task change or new annotation strategy.
To create our combined dataset we have to fill in the dataset_config file with three parameters for each dataset: Labels, the name of the labeling to use Ratio, the percentage of the dataset to use Subset, a potential subset to use
Today we have identified 3 main data sources, two of which are publicly available (Wildfire Alert & HPWREN):
ALERTWildfire is a consortium of three universities -- The University of Nevada, Reno (UNR), University of California San Diego (UCSD), and the University of Oregon (UO) -- providing access to state-of-the-art Pan-Tilt-Zoom (PTZ) fire cameras and associated tools to help firefighters and first responders
The High Performance Wireless Research and Education Network HPWREN is a network research program, funded by the National Science Foundation. The program includes the creation, demonstration, and evaluation of a non-commercial, prototype, high-performance, wide-area, wireless network in its Southern California service area.
Our camera network is in development which allows us to start building an image database. This database does not contain any fire images for the moment, but it does contain a large number of false positive cases, which are quite challenging for a network.
Pyronear has the ambition to become one day a public data source as important as the two presented above.
In addition to these 3 sources, we gather under the name UNKNOWN all other sources of images coming from the internet without a properly defined source or in too small quantity. Among these images we find those of Center for Wildfire Research of University of Split, Croatia
From these data sources we have created or collected several datasets:
A dataset was created from the Nevada Seismological Laboratory YouTube channel by Rodrigue de Schaetzen, Raphael Chang Menoni, Yifu Chen, and Drijon Hasani of the University of British Columbia, Canada their research paper detailing their work is available here
They have semi-automatically labeled (by video interpolation) 1.3M frames, you can download the whole dataset here. The code of their experimentation is available here and allows to extract a subset of 56K frames.
We added to this repository an extract of this 56K frames set, we took only 2807 frames of this subset.
HPWREN | Size | Smoke Images | Non Smoke Images |
---|---|---|---|
Nvseismolab_set1 | 2807 | 1375 | 1432 |
5 datasets have been created by AiforMankind:
Two training datasets were created during two hackathons, we name here these datasets AiForManKind_v1 (hackaton 1) and AiForManKind_v2 (hackaton 2).
To test the performance of their models on challenging false positive examples, Ai for mankind also proposes 3 small datasets each containing one of the main error sources in automated forest fire detection. We called these datasets AiForManKind_sunrise, AiForManKind_fog and AiForManKind_clouds.
A dataset is also proposed by the fuego project
HPWREN | Size | Smoke Images | Non Smoke Images |
---|---|---|---|
Fuego | 1739 | 1739 | 0 |
AiForManKind_v1 | 744 | 744 | 0 |
AiForManKind_v2 | 2191 | 2191 | 0 |
AiForManKind_cloud | 1080 | 0 | 0 |
AiForManKind_sunrise | 180 | 0 | 0 |
AiForManKind_fog | 180 | 0 | 0 |
Pyronear starts to deploy its network of cameras, which allows us to create new datasets. We propose here two datasets ardeche_set0 and gironde_set0 named after the french regions where the cameras are located. These datasets do not contain any smoke images but many potential false positives which are quite challenging.
PYRONEAR | Size | Smoke Images | Non Smoke Images |
---|---|---|---|
ardeche_set0 | 20587 | 0 | 20587 |
gironde_set0 | 1205 | 0 | 1205 |
We propose here two datasets from a mix of images collected on internet, fog_clouds to evaluate a model on challenging non-smoke images and smoke to test the hability of a model to detect a wildfire
UNKOWN | Size | Smoke Images | Non Smoke Images |
---|---|---|---|
fog_clouds | 453 | 0 | 453 |
smoke | 333 | 333 | 0 |
If you wish to cite this project, feel free to use this BibTeX reference:
@misc{pyrodataset2019,
title={Pyrodataset: wildfire early detection},
author={Pyronear contributors},
year={2019},
month={October},
publisher = {GitHub},
howpublished = {\url{https://github.com/pyronear/pyro-dataset}}
}
Please refer to CONTRIBUTING
to help grow this project!
Distributed under the Apache 2 License. See LICENSE
for more information.