Merge pull request #21 from LiberAI/master

Rebasing on LiberAI/NSpM.
LiberAI · Jun 12, 2020 · de8de83 · de8de83
2 parents b61923f + 60dedaa
commit de8de83
Show file tree

Hide file tree

Showing 13 changed files with 20,078 additions and 234 deletions.
diff --git a/README.md b/README.md
@@ -1,9 +1,10 @@
 # 🤖 Neural SPARQL Machines
-A LSTM-based Machine Translation Approach for Question Answering.
 
-![British flag.](http://www.liberai.org/img/flag-uk-160px.png "English")
-![Seq2Seq neural network.](http://www.liberai.org/img/seq2seq-webexport-160px.png "seq2seq")
-![Semantic triple flag.](http://www.liberai.org/img/flag-sparql-160px.png "SPARQL")
+[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)
+
+A LSTM-based Machine Translation Approach for Question Answering over Knowledge Graphs.
+
+![What does a NSpM do?](http://www.liberai.org/img/NSpM-image.png "What does a NSpM do?")
 
 ## Code
 
@@ -15,76 +16,71 @@ git lfs checkout
 git submodule update --init
 ```
 
-### Python Setup
+### Python setup
+
 ```bash
 pip install -r requirements.txt
 ```
-Note: Tensorflow version must >= 1.2.1
 
-#### Make sure to use python2.7 for these steps to avoid errors
+### The Generator module
 
-### Data preparation
 #### Pre-generated data
 
 You can extract pre-generated data from `data/monument_300.zip` and `data/monument_600.zip` in folders having the respective names.
 
 #### Manual Generation (Alternative to using pre-generated data)
 
-The template used in the paper can be found in a file such as `annotations_monument.tsv`. To generate the training data, launch the following command.
+The template used in the paper can be found in a file such as `annotations_monument.tsv`. `data/monument_300` will be the ID of the working dataset used throughout the tutorial. To generate the training data, launch the following command.
 
 <!-- Made monument_300 directory in data directory due to absence of monument_300 folder in data directory  -->
 ```bash
 mkdir data/monument_300
-python generator.py --templates data/annotations_monument.csv  --output data/monument_300
+python generator.py --templates data/annotations_monument.csv --output data/monument_300
 ```
 
-Build the vocabularies for the two languages (i.e., English and SPARQL) with:
+Launch the command to build the vocabularies for the two languages (i.e., English and SPARQL) and split into train, dev, and test sets.
 
 ```bash
-python build_vocab.py data/monument_300/data_300.en > data/monument_300/vocab.en
-python build_vocab.py data/monument_300/data_300.sparql > data/monument_300/vocab.sparql
+./generate.sh data/monument_300
 ```
 
-Count lines in `data_.*`
-<!-- Fixing the bash related error pertaining to assigning value to NUMLINES here -->
-```bash
-NUMLINES=$(echo awk '{ print $1}' | cat data/monument_300/data_300.sparql |  wc -l)
-echo $NUMLINES
-# 7097 (Don't worry if it varies)
-```
+### The Learner module
 
-Split the `data_.*` files into `train_.*`, `dev_.*`, and `test_.*` (usually 80-10-10%).
+<!-- Just a simple note to go back to the initial directory.-->
+Now go back to the initial directory and launch `train.sh` to train the model. The first parameter is the prefix of the data directory and the second parameter is the number of training epochs.
 
-<!-- Making this instruction consistent with the previous instructions by changing data.sparql to data_300.sparql -->
 ```bash
-cd data/monument_300/
-python ../../split_in_train_dev_test.py --lines $NUMLINES  --dataset data_300.sparql
+./train.sh data/monument_300 12000
 ```
 
-### Training
+This command will create a model directory called `data/monument_300_model`.
+
+### The Interpreter module
 
-<!-- Just a simple note to go back to the initial directory.-->
-Now go back to the initial directory and launch `train.sh` to train the model. The first parameter is the prefix of the data directory and the second parameter is the number of training epochs.
+Predict the SPARQL query for a given question with a given model.
 
 ```bash
-sh train.sh data/monument_300 12000
+./ask.sh data/monument_300 "where is edward vii monument located in?"
 ```
 
-This command will create a model directory called `data/monument_300_model`.
-
-### Inference
+### Unit tests
 
-Predict the SPARQL sentence for a given question with a given model.
+Tests can be run, but exclusively within the root directory.
 
 ```bash
-sh ask.sh data/monument_300 "where is edward vii monument located in?"
+py.test *.py
 ```
 
+## Use cases & integrations
+
+* The [Telegram NSpM chatbot](https://github.com/AKSW/NSpM/wiki/NSpM-Telegram-Bot) offers an integration of NSpM with the Telegram messaging platform.
+* A [neural question answering model for DBpedia](https://github.com/dbpedia/neural-qa) is a project supported by the [Google Summer of Code](https://summerofcode.withgoogle.com/) program that relies on NSpM.
+* A [question answering system](https://github.com/qasim9872/question-answering-system) was implemented on top of NSpM by [Muhammad Qasim](https://github.com/qasim9872).
+
 ## Papers
 
 ### Soru and Marx et al., 2017
 
-* Permanent URI: http://w3id.org/neural-sparql-machines/soru-marx-semantics2017.html
 * arXiv: https://arxiv.org/abs/1708.07624
 
 ```
@@ -93,13 +89,13 @@ sh ask.sh data/monument_300 "where is edward vii monument located in?"
     title = "{SPARQL} as a Foreign Language",
     year = "2017",
     journal = "13th International Conference on Semantic Systems (SEMANTiCS 2017) - Posters and Demos",
-    url = "http://w3id.org/neural-sparql-machines/soru-marx-semantics2017.html",
+    url = "https://arxiv.org/abs/1708.07624",
 }
 ```
 
 ### Soru et al., 2018
 
-* NAMPI Website: https://uclmr.github.io/nampi/
+* NAMPI Website: https://uclnlp.github.io/nampi/
 * arXiv: https://arxiv.org/abs/1806.10478
 
 ```
@@ -116,4 +112,7 @@ sh ask.sh data/monument_300 "where is edward vii monument located in?"
 
 * Primary contacts: [Tommaso Soru](http://tommaso-soru.it) and [Edgard Marx](http://emarx.org).
 * Neural SPARQL Machines [mailing list](https://groups.google.com/forum/#!forum/neural-sparql-machines).
-* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
+* Follow the [project on ResearchGate](https://www.researchgate.net/project/Neural-SPARQL-Machines).
+* Follow [Liber AI Research](http://liberai.org) on [Twitter](https://twitter.com/theLiberAI).
+
+    ![Liber AI logo.](http://www.liberai.org/img/Liber-AI-logo-name-200px.png "Liber AI")
diff --git a/analyse.py b/analyse.py
@@ -4,10 +4,9 @@
 Neural SPARQL Machines - Analysis and validation of translated questions into queries.
 
 'SPARQL as a Foreign Language' by Tommaso Soru and Edgard Marx et al., SEMANTiCS 2017
-https://w3id.org/neural-sparql-machines/soru-marx-semantics2017.html
 https://arxiv.org/abs/1708.07624
 
-Version 0.1.0-akaha
+Version 1.0.0
 
 """
 import argparse
@@ -16,19 +15,21 @@
 import os
 import re
 import sys
-import urllib
+import urllib.request, urllib.parse, urllib.error
 from pyparsing import ParseException
 from rdflib.plugins.sparql import parser
 
 from generator_utils import decode, extract_entities, extract_predicates
+from functools import reduce
+import importlib
 
 
 def analyse( translation ):
     result = {}
     for test in TESTS:
         result[test] = TESTS[test](translation)
 
-    everything_okay = all(map(lambda test: result[test], TESTS))
+    everything_okay = all([result[test] for test in TESTS])
     details['everything_okay'].update([everything_okay])
 
     return result
@@ -41,18 +42,18 @@ def validate( translation ):
     match = re.search(entity_with_attribute, query)
     if match:
         entity = match.group(0)
-        entity_encoded = re.sub(r'\(<?', '\(', entity)
-        entity_encoded = re.sub(r'>?\)', '\)', entity_encoded)
+        entity_encoded = re.sub(r'\(<?', r'\(', entity)
+        entity_encoded = re.sub(r'>?\)', r'\)', entity_encoded)
         query = query.replace(entity, entity_encoded)
     try:
         parser.parseQuery(query)
     except ParseException as exception:
-        print '{} in "{}", loc: {}'.format(exception.msg, exception.line, exception.loc)
+        print('{} in "{}", loc: {}'.format(exception.msg, exception.line, exception.loc))
         details['parse_exception'].update([exception.msg])
         return False
     except Exception as exception:
         msg = str(exception)
-        print '{}'.format(msg)
+        print('{}'.format(msg))
         details['other_exception'].update([msg])
         return False
     else:
@@ -88,16 +89,16 @@ def check_entities ( translation ):
     entities = extract_entities(target)
     if not entities:
         return False
-    entities_detected = map(lambda entity : entity in generated, entities)
-    entities_with_occurence_count = map(lambda entity: '{} [{}]'.format(entity, get_occurence_count(entity)), entities)
+    entities_detected = [entity in generated for entity in entities]
+    entities_with_occurence_count = ['{} [{}]'.format(entity, get_occurence_count(entity)) for entity in entities]
     if all(entities_detected):
         details['detected_entity'].update(entities_with_occurence_count)
         return True
 
     if any(entities_detected):
         details['partly_detected_entities'].update([True])
 
-    details['undetected_entity'].update(map(lambda (entity, detected) : entity, filter(lambda (entity, detected) : not detected, zip(entities_with_occurence_count, entities_detected))))
+    details['undetected_entity'].update([entity_detected1[0] for entity_detected1 in [entity_detected for entity_detected in zip(entities_with_occurence_count, entities_detected) if not entity_detected[1]]])
     return False
 
 
@@ -108,20 +109,18 @@ def check_predicates ( translation, ignore_prefix=True, ignore_case=True ):
     if not predicates:
         return False
     if ignore_prefix:
-        predicates = map(strip_prefix, predicates)
+        predicates = list(map(strip_prefix, predicates))
     if ignore_case:
-        predicates = map(str.lower, predicates)
+        predicates = list(map(str.lower, predicates))
         generated = str.lower(generated)
-    predicates_detected = map(lambda predicate: predicate in generated, predicates)
+    predicates_detected = [predicate in generated for predicate in predicates]
     if all(predicates_detected):
         return True
 
     if any(predicates_detected):
         details['partly_detected_predicates'].update([True])
 
-    details['undetected_predicates'].update(map(lambda (predicate, detected): predicate,
-                                            filter(lambda (predicate, detected): not detected,
-                                                   zip(predicates, predicates_detected))))
+    details['undetected_predicates'].update([predicate_detected2[0] for predicate_detected2 in [predicate_detected for predicate_detected in zip(predicates, predicates_detected) if not predicate_detected[1]]])
     return False
 
 
@@ -133,15 +132,15 @@ def summarise( summary, current_evaluation ):
 
 
 def log_summary( summary, details, org_file, ask_output_file ):
-    print '\n\nSummary\n'
-    print 'Analysis based on {} and {}'.format(org_file, ask_output_file)
+    print('\n\nSummary\n')
+    print('Analysis based on {} and {}'.format(org_file, ask_output_file))
     for test in TESTS:
-        print '{:30}: {:6d} True / {:6d} False'.format(test, summary[test][True], summary[test][False])
-    print '{:30}: {:6d} True / {:6d} False'.format('everything_okay', details['everything_okay'][True], details['everything_okay'][False])
-    print '\n\nDetails\n'
+        print('{:30}: {:6d} True / {:6d} False'.format(test, summary[test][True], summary[test][False]))
+    print('{:30}: {:6d} True / {:6d} False'.format('everything_okay', details['everything_okay'][True], details['everything_okay'][False]))
+    print('\n\nDetails\n')
     for detail in details:
         for key in details[detail]:
-            print '{:30}: {:6d} {}'.format(detail, details[detail][key], key)
+            print('{:30}: {:6d} {}'.format(detail, details[detail][key], key))
 
 
 def read( file_name ):
@@ -151,13 +150,13 @@ def read( file_name ):
 
 
 def get_occurence_count ( entity ):
-    key = unicode(entity)
+    key = str(entity)
     occurence_count = used_entities_counter[key] if key in used_entities_counter else 0
     if not occurence_count:
         key += '.'
         occurence_count = used_entities_counter[key] if key in used_entities_counter else 0
         if not occurence_count:
-            print 'not found: {}'.format(entity)
+            print('not found: {}'.format(entity))
     return occurence_count
 
 
@@ -171,7 +170,7 @@ def get_occurence_count ( entity ):
     targets_file = args.target
     ask_output_file = args.generated
 
-    reload(sys)
+    importlib.reload(sys)
     sys.setdefaultencoding("utf-8")
 
     TESTS = {
@@ -198,13 +197,13 @@ def get_occurence_count ( entity ):
     encoded_generated = read(ask_output_file)
 
     if len(encoded_targets) != len(encoded_generated):
-        print 'Some translations are missing'
+        print('Some translations are missing')
         sys.exit(1)
 
-    targets = map(decode, encoded_targets)
-    generated = map(decode, encoded_generated)
-    translations = zip(targets, generated)
-    evaluation = map(analyse, translations)
+    targets = list(map(decode, encoded_targets))
+    generated = list(map(decode, encoded_generated))
+    translations = list(zip(targets, generated))
+    evaluation = list(map(analyse, translations))
     summary_obj = {}
     for test in TESTS:
         summary_obj[test] = collections.Counter()