Skip to content

Latest commit

 

History

History
94 lines (70 loc) · 3.74 KB

README.md

File metadata and controls

94 lines (70 loc) · 3.74 KB

LC-QuAD

Largescale Complex Question Answering Dataset

📢 Announcement: LCQUAD 2.0 is now released, checkout our website http://lc-quad.sda.tech .

Download

🐣 Train, Test Data

Links

🌍 Webpage | 📄 Paper | 🏢 Lab

Introduction

We release, and maintain a gold standard KBQA (Question Answering over Knowledge Base) dataset containing 5000 Question and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB.

Usage

License: You can download the dataset (released with a GPL 3.0 License), or read below to know more.

Versioning: We use DBpedia version 04-2016 as our target KB. The public DBpedia endpoint (http://dbpedia.org/sparql) no longer uses this version, which might cause many SPARQL queries to not retrieve any answer. We strongly recommend hosting this version locally. To do so, see this guide

Splits: We release the dataset split into training, and test in a 80:20 fashion.

Format: The dataset is released in JSON dumps, where the key corrected_question contains the question, and query contains the corresponding SPARQL query.

The dataset generated has the following JSON structure, kept intact for .

{
 	'_id': 'Unique ID of this datapoint',
  	'corrected_question': 'Corrected, Final Question',
	'id': 'Template ID',
	'query': 'SPARQL Query',
	'template': 'Template used to create SPARQL Query',
	'intermediary_question': 'Automatically generated, grammatically incorrect question'
}

Cite

@inproceedings{trivedi2017lc,
  title={Lc-quad: A corpus for complex question answering over knowledge graphs},
  author={Trivedi, Priyansh and Maheshwari, Gaurav and Dubey, Mohnish and Lehmann, Jens},
  booktitle={International Semantic Web Conference},
  pages={210--218},
  year={2017},
  organization={Springer}
}

Benchmarking/Leaderboard

We're in the process of automating the benchmarking process (and updating results on our webpage). In the meantime, please get in touch with us at [email protected], and we'll do it manually. Apologies for this inconvinience.

Methodology

Overview

  • Automatically create SPARQL queries.
  • Convert SPARQL queries to intermediary NLQs.
  • Manually correct intermediary NLQs to create Questions

We start with a set of Seed Entities, and Predicate Whitelist. Using the whitelist, we generate 2-hop subgraphs around seed entities. With a seed entity as supposed answer, we juxtapose SPARQL Templates onto the subgraph, and generate SPARQL queries.

Corresponding to SPARQL template, and based on certain conditions, we assign hand-made NL question templates to the SPARQLs. Refer to this diagram to understand the nomenclature used in templates.

Finally, we follow a two-step (Correct, Review) system to generate a grammatically correct question for every template-generated one.

Changelog

0.1.3 - 19-06-2018

  • Published train-test splits
  • Website Updated

0.1.2 - 28-01-2018

  • Updated public website
  • Dataset now available in QALD format
  • Leaderboard underway

0.1.1 - 27-10-2017

  • Fixed a bug with rdf:type filter in SPARQL
  • data_set.json updated
  • updated templates.py

0.1.0 - 01-05-2017