The Bavard NLU Meta-Dataset

This intent classification dataset is a conglomeration of 6 task-oriented dialogue datasets and intent classification datasets, all unified under a common, simple format. It is suitable for training few-shot intent classification models as it spans 78 dialogue domains and has 1629 intents, with each intent having between 11 and 100 natural language utterance examples. There are a total of 84212 utterances in the dataset. For more information on how this dataset was curated, please see our article announcing the release of this dataset.

Here is a tabulation of the datasets used to create this dataset:

Dataset Name	# Domains	# Intents	# Utterances	Original Paper
CLINC150	10	150	15000	An Evaluation Dataset for Intent Classification and Out-of-Scope Prediction.
DSTC8-SGD	46	1282	57582	Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset.
HINT3	3	59	1736	HINT3: Raising the Bar for Intent Detection in the Wild.
HWU64	14	51	4740	Benchmarking natural language understanding services for building conversational agents.
MultiWOZ 2.2	3	27	1246	Multiwoz 2.2: A dialogue dataset with additional annotation corrections and state tracking baselines.
Taskmaster2	2	60	3908	Taskmaster-1: Toward a realistic and diverse dialog dataset.

Dataset Organization

The dataset is accessible in the ./data directory. Each source dataset has its own subdirectory, and each source dataset's domain has its own json file under that subdirectory's dedicated data directory, for example:

data/
    dataset1/
        LICENSE.txt
        data/
            domainA.json
            domainB.json
            ...
    dataset2/
        LICENSE.txt
        data/
            domainC.json
            domainD.json
            ...
    ...

The Data Format

Each json file contains a list of intent classification examples, which look like this:

[
    {
        "intent": "rewards_balance",
        "utterance": "what is the amount of rewards points on my visa card",
        "origin": "CLINC150",
        "domain": "credit_cards"
    },
    {
        "intent": "CONSULT_START",
        "utterance": "Can you tell me what diet is perfect for me",
        "origin": "HINT3",
        "domain": "curekart"
    }
]

In this first example, the utterance "what is the amount of rewards points on my visa card" is associated with the intent rewards_balance. The origin field indicates the dataset this example came from, and the domain field indicates which task domain the example is associated with.

Loading the Data

Here is a Python 3 example of loading all the data from all source datasets into a single array:

import os
from glob import glob
import json

dataset = []
for domain_file_path in glob(os.path.join("data", "*", "data", "*.json")):
    with open(domain_file_path) as f:
        dataset += json.load(f)

print(len(dataset))
# 84212

Licensing and Attribution

Here are links to the licenses of the original datasets. To respect those licenses, we release each derivative work of each dataset in this meta-dataset under the same license as the original dataset. This should not inhibit use of our meta-dataset as a whole. Each individual license should be consulted, but in general, this meta-dataset can be used for any purpose, even commercially, so long as proper attribution is made, a link to the license is shared, and any derivative works are open-sourced under the same license.

Dataset Name	License URL
CLINC150	Attribution 3.0 Unported (CC BY 3.0)
DSTC8-SGD	Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
HINT3	Open Data Commons Open Database License (ODbL) v1.0
HWU64	Attribution 4.0 International (CC BY 4.0)
MultiWOZ 2.2	MIT
Taskmaster2	Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

To cite this dataset:

@misc{peterson_2021,
    title={Announcing The New Bavard NLU Intent Service},
    url={https://figshare.com/articles/preprint/Announcing_The_New_Bavard_NLU_Intent_Service/14403380/1},
    DOI={10.6084/m9.figshare.14403380.v1},
    publisher={figshare},
    author={Peterson, Evan},
    year={2021},
    month={Apr}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Bavard NLU Meta-Dataset

Dataset Organization

The Data Format

Loading the Data

Licensing and Attribution

About

Releases

bavard-ai/nlu-meta-dataset

Folders and files

Latest commit

History

Repository files navigation

The Bavard NLU Meta-Dataset

Dataset Organization

The Data Format

Loading the Data

Licensing and Attribution

About

Topics

Resources

Stars

Watchers

Forks

Releases