GSoC 2017: Chatbot for DBpedia

First of all I would like to thank everyone in the DBpedia team for selecting me for this Chatbot project. I would also like to thank my mentor Ricardo for helping me by giving valuable feedback during the proposal phase as well as selecting me for working on this project.

Project Description

The project presents a conversational Chatbot for DBpedia which can be accessed through the following platforms:

A Web Interface
Slack
Facebook Messenger

There are three main challenges in this task. First is understanding the query presented by the user, second is fetching relevant information based on the query through DBpedia or other sources and finally tailoring the responses based on the standards of each platform and developing subsequent user interactions with the Chatbot.

How to Run

Final Application Link

Core Capabilities

The bot is capable of responding to users in the form of simple short text messages or through more elaborate interactive messages. Users can communicate or respond to the bot through text and also through interactions (such as clicking on buttons/links).

There are 4 main purposes for the bot. They are:

Answering factual questions
Answering questions related to DBpedia
Expose the research work being done in DBpedia as product features. For example:
- AKSW Genesis: We use APIs from the Genesis project to show similar and related information for a particular entity.
- WDAqua QANARY: We use WDAqua's QANARY Service to answer some of the factual questions that are posed to the bot.
Casual conversation/banter

Text Based Questions

The bot tries to answer text based questions of the following types:

Natural Language Questions

Give me the capital of Germany
Who is Obama?

Location Information

Where is the Eiffel Tower?
Where is France's capital?

Service Checks

Users can ask the bot to check if vital DBpedia services are operational.

Is DBpedia down?
Is lookup online?

Language Chapters

Users can ask basic information about specific DBpedia local chapters.

DBpedia Arabic
German DBpedia

Templates

These are predominantly questions related to DBpedia for which the bot provides predefined templatized answers. Some examples include:

What is DBpedia?
How can I contribute?
Where can I find the mapping tool?

Banter

Messages which are casual in nature fall under this category. For example:

Hi
What is your name?

Architecture

Chatbot Architecture

The process by which the Chatbot server handles requests can be divided into 6 steps as follows:

Incoming Request: Webhooks that handle incoming requests from each platform
Request Routing: Incoming requests are routing based on the type of request which could be a pure text request or a parameterized Request. Pure text requests requests are handled by the Text Handler and parameterized requests are handled by the Template Handler
- Pure Text Requests: A pure text request is basically a text message from the user. We use RiveScript to identify the intent of the message and classify it into the following types:
  1. Natural Language Question
  2. Location Requests
  3. Service Checks
  4. Language Chapters
  5. Banter
  6. Prepared/Template Responses
- Parameterized Requests: When user clicks on links in information already presented. For example clicking on a Learn More button when presented information about Germany
Generate Response: The response from either handler is converted to a format that is suitable for each platform.
Send Response: Finally the response is sent back to each platform. Additionally, for the web interface, the client side code to handle the responses are written using standard frontend technologies such as HTML, CSS and Javascript.

High Level Design

This section details the workflow for both text based requests and parameterized requests through flowcharts.

Pure Text Request Workflow

Text Message Workflow

Natural Language Question Workflow

Location Question Workflow

Template Workflow

Parameterized Request Workflow

Release Management (CI & CD)

Version Control

For version control we use Git + GitHub along with Git-Flow for overall branch management. We primarily use the develop branch for development and staging and the master branch for production deployments. More details on how GitFlow works can be found here.

Continuous Integration & Continuous Deployment (CI & CD)

We use GitLab for Continuous Integration and Continuous Deployment. Once a commit is made either to the develop or master branch, a GitLab pipeline is executed which consists of three stages namely:

Test
Package
Deploy

In the test stage all tests associated with the project are executed and its an atomic operation. Only when all tests pass does the pipeline move to the next stage.

In the package stage we create a Docker image after downloading both Maven and Node dependencies using the mvn clean install and npm install commands. Finally the Docker image is uploaded to GitLab.

In the deploy stage we login to the nc9 server and use the created Docker image to deploy the application with appropriate Environment Configurations.

Data Engineering

Infobox Properties

The bot can show important attributes about an entity similar to the Infobox properties shown in Wikipedia. To develop this feature we took a list of all DBpedia classes (namespace http://dbpedia.org/ontology/) that could be potential rdf:types for a given entity.

For a given class we found the total number of occurrences of that class in the entire Knowledge Graph. Then we extracted all rdfs:domain properties for that class. We calculated the number of distinct occurrences of each individual property in the Knowledge Graph. We used both these information to develop a Relevance Score (between 0-1) for each property for the given class which is basically:

Relevance Score Equation

where Np is the number of distinct occurrences of the Property and Nc is the number distinct occurrences of the Class in the Knowledge Base.

For a given entity we take all the rdf:types in the http://dbpedia.org/ontology/ namespace and all available properties of the entity. We then find the top properties for each class and verify if they exist for the given entity. If they do then we shortlist those properties and display the top N properties to the user which are ranked by their Relevance Score.

Prepared/Templatized Responses (RiveScript)

For answering questions related to DBpedia we used DBpedia's mailing lists to craft rule based responses with the help of RiveScript. The next few sections detail the process in detail.

Data Sources

DBpedia Discussion and Developers Mailing Lists: Collected mailing list to find interesting question answer threads that could be used for creating conversational scenarios for the bot.

Data Cleanup Tasks

The mailing list dump (mbox file) was taken as input and pre-processed to remove undesired messages based on the criteria mentioned in subsequent sections. The result from pre-processing was stored in a JSON file with the key being the subject and all associated messages were stored as an array for further processing.

Exclusions

Removed all messages that are request for comments, call for papers, announcements etc.
Removed messages that do not have question words in their subject or body. Question words considered are:
- What
- When
- Why
- Which
- Who
- How
- Whose
- Whom

Message Subject

Removed words such as reply, fwd etc.

Message Body

Removed reply sections to reduce redundancy
Removed unnecessary HTML tags, Whitespaces, Newlines, etc.

Tf-idf Vectorization

The messages were converted to CSV and loaded into a Pandas Dataframe. Then the subject of each message was tokenized and stemmed using Porter's Stemmer. This stemmed output was used as input to a Tf-idf Vectorizer to convert the text input to a matrix array containing frequencies of each term in every message. The total number of features extracted were ~135

K-Means Clustering

The Tf-idf Vector was passed as input to the K-Means algorithm to cluster interesting topics or categories of questions which we could program into the bot. Some of the major categories that were identified and clustered through the algorithm are:

About DBpedia
DBpedia Lookup
DBpedia Datasets Download/Dump
DBpedia Release
DBpedia Extraction Framework

Tools & Technologies

Following list of tools and technologies have been finalized.

Server Side Technologies

Java: Web Server Language:
Spring: REST/Web Framework
Maven: Java Dependency Management

Chat Libraries

Rivescript: Chat Library
Eliza: Conversational Bot Library

Front End Technologies

Node & NPM: Installing and managing front end packages
Bootstrap: Responsive CSS Framework
React: For Javascript Interactions and building interactive UI Components
WebPack: Bundler used for compiling React JSX to browser compatible and minified JS as well as LESS to CSS.

Platform Wrappers

Messenger4j: Facebook Messenger Wrapper
jSlack: Slack Wrapper

APIs

Jena: For Querying DBpedia using SPARQL
Genesis: For entity summarization and fetching related and similar entities
DBpedia Lookup: Resolving text to DBpedia Entities
DBpedia Spotlight: Resolving text to DBpedia Entities
TMDB: For fetching Movie and TV Show information

Tools

IntelliJ: Java IDE

DevOps

Git: Version Control
GitHub: Version Control Management
GitLab: Continuous Integration
Docker: Containerization
Testing: jUnit
Logging: CouchDB

Weekly Updates

The following section tracks the weekly progress that was completed.

Week 1: May 4 to May 10

Touch base with mentor (Ricardo)
Subscribed to DBpedia Developer and Discussion Mailing Lists
Created GitHub Repository
Determine Initial System Architecture and Technologies needed. Following were chosen:
- Java with Spring for the Server Side Language
- Rivescript as a Chat framework for canned responses
- Git for version control along with GitHub for managing repo
- GitLab for Continuous Integration

Week 2: May 11 to May 17

Uploaded progress page
Created initial REST application using Java and Spring and deployed a simple echo bot on Facebook.
Migrated from Gradle to Maven
Added support for static pages. Created index page
Integrated node, npm and webpack as part of maven since it is needed for frontend support.
Modified code to be compatible with Heroku which is used for initial testing

Week 3: May 18 to May 24

Created Chat UI based on Bootstrap Material Design
Added Favicon
Made chat interface mobile compatible
Styling of Chat Bubbles and Animations
Migrated LESS compilation to WebPack and removed Grunt completely from the project. Grunt was initially used for LESS compilation.
Added Starter conversation template for the Chatbot so as to set initial expectations for the user
Created a general library for handling text and carousel responses across platforms

Week 4: May 25 to May 31

Received mailbox dump of dbpedia-discussion mailing list.
Wrote pre-processing scripts in Python to extract interesting question answering threads that can be used for Machine Learning.
The pre-processed data is stored in JSON with the subject of the messages as the key and the corresponding messages as an array.

Week 5: Jun 1 to Jun 7

Integrated QANARY API
Passed incoming requests to QANARY and used the responses to query DBpedia using Jena
Created basic generic responses using result from DBpedia based on common properties such as abstract, label, wikipedia link etc
Created corresponding card and button interface

Week 6: Jun 8 to Jun 14

Performed clustering on subjects using TFIDF
Identified interesting clusters which can be converted to RiveScript
Created RiveScript for handling DBpedia queries such as:
- What is DBpedia
- Check if DBpedia is live or not
Created new type of component called ButtonText which combines text with button
Generalized RiveScript responses to include JSON objects as well as text messages to support more sophisticated functionality
Added UUID support to uniquely identify a user in Web Interface
Added more bot substitutions

Week 7: Jun 15 to Jun 21

Added React Constants for front end which are mirrors of Java constants
Modified width of bubbles depending on device. For smaller screens bubble size is relatively larger
Added DBpedia card to helper template shown when the bot starts
Now asking bot if DBpedia is live makes multiple checks (DBpedia, Resource, SPARQL)
NL Queries are pre-processed in RiveScript for example tell [me] [about] * => *
Handled Disambiguation Scenario
Loading Animation for Web Interface
Added similar entities using Genesis
Improved test coverage and added Test Runner
Added Learn More option which shows Similar and Related as Quick Reply bubbles

Week 8: Jun 22 to Jun 28

Added Spring Data Repository Support
Added Smart Replies to both Web and FB
Added Feedback for every interaction for fine grained user feedback
Created UI for Feature Request or Feedback
Added Tests and more RiveScript Scenarios
UI changes to make options menu more presentable in Web Interface by implementing an overlay
Minor Bug Fixes

Week 9: Jun 29 to Jul 5

CouchDB Integration for Feedback and Chat History
Integrated WolframAlpha API for Question Answering
Integrated DBpedia Lookup and Spotlight for grounding entities
Integrated TMDB API for Movie and TV Shows

Week 10: Jul 6 to Jul 12

Slack Integration
Standalone Feedback Page
Login & Admin Pages
RiveScript for DBpedia Lookup, Datasets

Week 11: Jul 13 to Jul 19

Chat Reporting Interface in Admin Section
Adding Tests and fixing issues

Week 12: Jul 20 to Jul 26

Added RiveScript for Mappings & GSoC
Added dct:description for cards
Added icon for Slack
Added Infobox Properties
Added Test Cases

Week 13: Jul 27 to Aug 2

Added Location Card based on Nomatim API and OpenStreet Map
Added About Section
Added Spell Check and Ignore Words

Week 14: Aug 3 to Aug 9

Started integration with GitLab CI
Updated Tests to be compliant with GitLab
Added Embed Functionality

Week 15: Aug 10 to Aug 16

Writing Final Documentation

If you have any questions about your project or related issues you are encouraged to pose them via our support page.