Skip to content

Commit

Permalink
Merge pull request #21 from LiberAI/master
Browse files Browse the repository at this point in the history
Rebasing on LiberAI/NSpM.
  • Loading branch information
mommi84 authored Jun 12, 2020
2 parents b61923f + 60dedaa commit de8de83
Show file tree
Hide file tree
Showing 13 changed files with 20,078 additions and 234 deletions.
73 changes: 36 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# 🤖 Neural SPARQL Machines
A LSTM-based Machine Translation Approach for Question Answering.

![British flag.](http://www.liberai.org/img/flag-uk-160px.png "English")
![Seq2Seq neural network.](http://www.liberai.org/img/seq2seq-webexport-160px.png "seq2seq")
![Semantic triple flag.](http://www.liberai.org/img/flag-sparql-160px.png "SPARQL")
[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)

A LSTM-based Machine Translation Approach for Question Answering over Knowledge Graphs.

![What does a NSpM do?](http://www.liberai.org/img/NSpM-image.png "What does a NSpM do?")

## Code

Expand All @@ -15,76 +16,71 @@ git lfs checkout
git submodule update --init
```

### Python Setup
### Python setup

```bash
pip install -r requirements.txt
```
Note: Tensorflow version must >= 1.2.1

#### Make sure to use python2.7 for these steps to avoid errors
### The Generator module

### Data preparation
#### Pre-generated data

You can extract pre-generated data from `data/monument_300.zip` and `data/monument_600.zip` in folders having the respective names.

#### Manual Generation (Alternative to using pre-generated data)

The template used in the paper can be found in a file such as `annotations_monument.tsv`. To generate the training data, launch the following command.
The template used in the paper can be found in a file such as `annotations_monument.tsv`. `data/monument_300` will be the ID of the working dataset used throughout the tutorial. To generate the training data, launch the following command.

<!-- Made monument_300 directory in data directory due to absence of monument_300 folder in data directory -->
```bash
mkdir data/monument_300
python generator.py --templates data/annotations_monument.csv --output data/monument_300
python generator.py --templates data/annotations_monument.csv --output data/monument_300
```

Build the vocabularies for the two languages (i.e., English and SPARQL) with:
Launch the command to build the vocabularies for the two languages (i.e., English and SPARQL) and split into train, dev, and test sets.

```bash
python build_vocab.py data/monument_300/data_300.en > data/monument_300/vocab.en
python build_vocab.py data/monument_300/data_300.sparql > data/monument_300/vocab.sparql
./generate.sh data/monument_300
```

Count lines in `data_.*`
<!-- Fixing the bash related error pertaining to assigning value to NUMLINES here -->
```bash
NUMLINES=$(echo awk '{ print $1}' | cat data/monument_300/data_300.sparql | wc -l)
echo $NUMLINES
# 7097 (Don't worry if it varies)
```
### The Learner module

Split the `data_.*` files into `train_.*`, `dev_.*`, and `test_.*` (usually 80-10-10%).
<!-- Just a simple note to go back to the initial directory.-->
Now go back to the initial directory and launch `train.sh` to train the model. The first parameter is the prefix of the data directory and the second parameter is the number of training epochs.

<!-- Making this instruction consistent with the previous instructions by changing data.sparql to data_300.sparql -->
```bash
cd data/monument_300/
python ../../split_in_train_dev_test.py --lines $NUMLINES --dataset data_300.sparql
./train.sh data/monument_300 12000
```

### Training
This command will create a model directory called `data/monument_300_model`.

### The Interpreter module

<!-- Just a simple note to go back to the initial directory.-->
Now go back to the initial directory and launch `train.sh` to train the model. The first parameter is the prefix of the data directory and the second parameter is the number of training epochs.
Predict the SPARQL query for a given question with a given model.

```bash
sh train.sh data/monument_300 12000
./ask.sh data/monument_300 "where is edward vii monument located in?"
```

This command will create a model directory called `data/monument_300_model`.

### Inference
### Unit tests

Predict the SPARQL sentence for a given question with a given model.
Tests can be run, but exclusively within the root directory.

```bash
sh ask.sh data/monument_300 "where is edward vii monument located in?"
py.test *.py
```

## Use cases & integrations

* The [Telegram NSpM chatbot](https://github.com/AKSW/NSpM/wiki/NSpM-Telegram-Bot) offers an integration of NSpM with the Telegram messaging platform.
* A [neural question answering model for DBpedia](https://github.com/dbpedia/neural-qa) is a project supported by the [Google Summer of Code](https://summerofcode.withgoogle.com/) program that relies on NSpM.
* A [question answering system](https://github.com/qasim9872/question-answering-system) was implemented on top of NSpM by [Muhammad Qasim](https://github.com/qasim9872).

## Papers

### Soru and Marx et al., 2017

* Permanent URI: http://w3id.org/neural-sparql-machines/soru-marx-semantics2017.html
* arXiv: https://arxiv.org/abs/1708.07624

```
Expand All @@ -93,13 +89,13 @@ sh ask.sh data/monument_300 "where is edward vii monument located in?"
title = "{SPARQL} as a Foreign Language",
year = "2017",
journal = "13th International Conference on Semantic Systems (SEMANTiCS 2017) - Posters and Demos",
url = "http://w3id.org/neural-sparql-machines/soru-marx-semantics2017.html",
url = "https://arxiv.org/abs/1708.07624",
}
```

### Soru et al., 2018

* NAMPI Website: https://uclmr.github.io/nampi/
* NAMPI Website: https://uclnlp.github.io/nampi/
* arXiv: https://arxiv.org/abs/1806.10478

```
Expand All @@ -116,4 +112,7 @@ sh ask.sh data/monument_300 "where is edward vii monument located in?"

* Primary contacts: [Tommaso Soru](http://tommaso-soru.it) and [Edgard Marx](http://emarx.org).
* Neural SPARQL Machines [mailing list](https://groups.google.com/forum/#!forum/neural-sparql-machines).
* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
* Follow [Liber AI Research](http://liberai.org) on [Twitter](https://twitter.com/theLiberAI).

![Liber AI logo.](http://www.liberai.org/img/Liber-AI-logo-name-200px.png "Liber AI")
61 changes: 30 additions & 31 deletions analyse.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,9 @@
Neural SPARQL Machines - Analysis and validation of translated questions into queries.
'SPARQL as a Foreign Language' by Tommaso Soru and Edgard Marx et al., SEMANTiCS 2017
https://w3id.org/neural-sparql-machines/soru-marx-semantics2017.html
https://arxiv.org/abs/1708.07624
Version 0.1.0-akaha
Version 1.0.0
"""
import argparse
Expand All @@ -16,19 +15,21 @@
import os
import re
import sys
import urllib
import urllib.request, urllib.parse, urllib.error
from pyparsing import ParseException
from rdflib.plugins.sparql import parser

from generator_utils import decode, extract_entities, extract_predicates
from functools import reduce
import importlib


def analyse( translation ):
result = {}
for test in TESTS:
result[test] = TESTS[test](translation)

everything_okay = all(map(lambda test: result[test], TESTS))
everything_okay = all([result[test] for test in TESTS])
details['everything_okay'].update([everything_okay])

return result
Expand All @@ -41,18 +42,18 @@ def validate( translation ):
match = re.search(entity_with_attribute, query)
if match:
entity = match.group(0)
entity_encoded = re.sub(r'\(<?', '\(', entity)
entity_encoded = re.sub(r'>?\)', '\)', entity_encoded)
entity_encoded = re.sub(r'\(<?', r'\(', entity)
entity_encoded = re.sub(r'>?\)', r'\)', entity_encoded)
query = query.replace(entity, entity_encoded)
try:
parser.parseQuery(query)
except ParseException as exception:
print '{} in "{}", loc: {}'.format(exception.msg, exception.line, exception.loc)
print('{} in "{}", loc: {}'.format(exception.msg, exception.line, exception.loc))
details['parse_exception'].update([exception.msg])
return False
except Exception as exception:
msg = str(exception)
print '{}'.format(msg)
print('{}'.format(msg))
details['other_exception'].update([msg])
return False
else:
Expand Down Expand Up @@ -88,16 +89,16 @@ def check_entities ( translation ):
entities = extract_entities(target)
if not entities:
return False
entities_detected = map(lambda entity : entity in generated, entities)
entities_with_occurence_count = map(lambda entity: '{} [{}]'.format(entity, get_occurence_count(entity)), entities)
entities_detected = [entity in generated for entity in entities]
entities_with_occurence_count = ['{} [{}]'.format(entity, get_occurence_count(entity)) for entity in entities]
if all(entities_detected):
details['detected_entity'].update(entities_with_occurence_count)
return True

if any(entities_detected):
details['partly_detected_entities'].update([True])

details['undetected_entity'].update(map(lambda (entity, detected) : entity, filter(lambda (entity, detected) : not detected, zip(entities_with_occurence_count, entities_detected))))
details['undetected_entity'].update([entity_detected1[0] for entity_detected1 in [entity_detected for entity_detected in zip(entities_with_occurence_count, entities_detected) if not entity_detected[1]]])
return False


Expand All @@ -108,20 +109,18 @@ def check_predicates ( translation, ignore_prefix=True, ignore_case=True ):
if not predicates:
return False
if ignore_prefix:
predicates = map(strip_prefix, predicates)
predicates = list(map(strip_prefix, predicates))
if ignore_case:
predicates = map(str.lower, predicates)
predicates = list(map(str.lower, predicates))
generated = str.lower(generated)
predicates_detected = map(lambda predicate: predicate in generated, predicates)
predicates_detected = [predicate in generated for predicate in predicates]
if all(predicates_detected):
return True

if any(predicates_detected):
details['partly_detected_predicates'].update([True])

details['undetected_predicates'].update(map(lambda (predicate, detected): predicate,
filter(lambda (predicate, detected): not detected,
zip(predicates, predicates_detected))))
details['undetected_predicates'].update([predicate_detected2[0] for predicate_detected2 in [predicate_detected for predicate_detected in zip(predicates, predicates_detected) if not predicate_detected[1]]])
return False


Expand All @@ -133,15 +132,15 @@ def summarise( summary, current_evaluation ):


def log_summary( summary, details, org_file, ask_output_file ):
print '\n\nSummary\n'
print 'Analysis based on {} and {}'.format(org_file, ask_output_file)
print('\n\nSummary\n')
print('Analysis based on {} and {}'.format(org_file, ask_output_file))
for test in TESTS:
print '{:30}: {:6d} True / {:6d} False'.format(test, summary[test][True], summary[test][False])
print '{:30}: {:6d} True / {:6d} False'.format('everything_okay', details['everything_okay'][True], details['everything_okay'][False])
print '\n\nDetails\n'
print('{:30}: {:6d} True / {:6d} False'.format(test, summary[test][True], summary[test][False]))
print('{:30}: {:6d} True / {:6d} False'.format('everything_okay', details['everything_okay'][True], details['everything_okay'][False]))
print('\n\nDetails\n')
for detail in details:
for key in details[detail]:
print '{:30}: {:6d} {}'.format(detail, details[detail][key], key)
print('{:30}: {:6d} {}'.format(detail, details[detail][key], key))


def read( file_name ):
Expand All @@ -151,13 +150,13 @@ def read( file_name ):


def get_occurence_count ( entity ):
key = unicode(entity)
key = str(entity)
occurence_count = used_entities_counter[key] if key in used_entities_counter else 0
if not occurence_count:
key += '.'
occurence_count = used_entities_counter[key] if key in used_entities_counter else 0
if not occurence_count:
print 'not found: {}'.format(entity)
print('not found: {}'.format(entity))
return occurence_count


Expand All @@ -171,7 +170,7 @@ def get_occurence_count ( entity ):
targets_file = args.target
ask_output_file = args.generated

reload(sys)
importlib.reload(sys)
sys.setdefaultencoding("utf-8")

TESTS = {
Expand All @@ -198,13 +197,13 @@ def get_occurence_count ( entity ):
encoded_generated = read(ask_output_file)

if len(encoded_targets) != len(encoded_generated):
print 'Some translations are missing'
print('Some translations are missing')
sys.exit(1)

targets = map(decode, encoded_targets)
generated = map(decode, encoded_generated)
translations = zip(targets, generated)
evaluation = map(analyse, translations)
targets = list(map(decode, encoded_targets))
generated = list(map(decode, encoded_generated))
translations = list(zip(targets, generated))
evaluation = list(map(analyse, translations))
summary_obj = {}
for test in TESTS:
summary_obj[test] = collections.Counter()
Expand Down
Loading

0 comments on commit de8de83

Please sign in to comment.