Skip to content

GSoC 2017: Chatbot for DBpedia

Ram G Athreya edited this page Aug 27, 2017 · 87 revisions

First of all I would like to thank everyone in the DBpedia team for selecting me for this Chatbot project. I would also like to thank my mentor Ricardo for helping me by giving valuable feedback during the proposal phase as well as selecting me for working on this project.

Project Description

DBpedia Chatbot is a conversational chatbot for DBpedia which is accessible through the following platforms:

  • A Web Interface
  • Slack
  • Facebook Messenger

There are three main challenges in this task. First is understanding the query presented by the user, second is fetching relevant information based on the query through DBpedia or other sources and finally tailoring the responses based on the standards of each platform and developing subsequent user interactions with the Chatbot.

Core Capabilities

The bot is capable of responding to users in the form of simple short text messages or through more elaborate interactive messages. Users can communicate or respond to the bot through text and also through interactions (such as clicking on buttons/links).

There are 4 main purposes for the bot. They are:

  1. Answering factual questions
  2. Answering questions related to DBpedia
  3. Expose the research work being done in DBpedia as product features. For example:
    • AKSW Genesis: We use APIs from the Genesis project to show similar and related information for a particular entity.
    • WDAqua QANARY: We use WDAqua's QANARY Service to answer some of the factual questions that are posed to the bot.
  4. Casual conversation/banter

Text Based Questions

The bot tries to answer text based questions of the following types:

Natural Language Questions

  • Give me the capital of Germany
  • Who is Obama?

Location Information

  • Where is the Eiffel Tower?
  • Where is France's capital?

Service Checks

Users can ask the bot to check if vital DBpedia services are operational.

  • Is DBpedia down?
  • Is lookup online?

Language Chapters

Users can ask basic information about specific DBpedia local chapters.

  • DBpedia Arabic
  • German DBpedia

Templates

These are predominantly questions related to DBpedia for which the bot provides predefined templatized answers. Some examples include:

  • What is DBpedia?
  • How can I contribute?
  • Where can I find the mapping tool?

Banter

Messages which are casual in nature fall under this category. For example:

  • Hi
  • What is your name?

Architecture

Chatbot Architecture

The process by which the Chatbot server handles requests can be divided into 6 steps as follows:

  • Incoming Request: Webhooks that handle incoming requests from each platform
  • Request Routing: Incoming requests are routing based on the type of request which could be a pure text request or a parameterized Request. Pure text requests requests are handled by the Text Handler and parameterized requests are handled by the Template Handler
    • Pure Text Requests: A pure text request is basically a text message from the user. We use RiveScript to identify the intent of the message and classify it into the following types:
      1. Natural Language Question
      2. Location Requests
      3. Service Checks
      4. Language Chapters
      5. Banter
      6. Prepared/Template Responses
    • Parameterized Requests: When user clicks on links in information already presented. For example clicking on a Learn More button when presented information about Germany
  • Generate Response: The response from either handler is converted to a format that is suitable for each platform.
  • Send Response: Finally the response is sent back to each platform. Additionally, for the web interface, the client side code to handle the responses are written using standard frontend technologies such as HTML, CSS and Javascript.

High Level Design

This section details the workflow for both text based requests and parameterized requests through flowcharts.

Pure Text Request Workflow

Text Message Workflow

Natural Language Question Workflow

Natural Language Question Workflow

Location Question Workflow

Location Question Workflow

Template Workflow

Template Workflow

Parameterized Request Workflow

Parameterized Request Workflow

Release Management (CI & CD)

Version Control

For version control we use Git + GitHub along with Git-Flow for overall branch management. We primarily use the develop branch for development and staging and the master branch for production deployments. More details on how GitFlow works can be found here.

Continuous Integration & Continuous Deployment (CI & CD)

We use GitLab for Continuous Integration and Continuous Deployment. Once a commit is made either to the develop or master branch, a GitLab pipeline is executed which consists of three stages namely:

  1. Test
  2. Package
  3. Deploy

In the test stage all tests associated with the project are executed and its an atomic operation. Only when all tests pass does the pipeline move to the next stage.

In the package stage we create a Docker image after downloading both Maven and Node dependencies using the mvn clean install and npm install commands. Finally the Docker image is uploaded to GitLab.

In the deploy stage we login to the nc9 server and use the created Docker image to deploy the application with appropriate Environment Configurations.

Data Engineering

Infobox Properties

The bot can show important attributes about an entity similar to the Infobox properties shown in Wikipedia. To develop this feature we took a list of all DBpedia classes (namespace http://dbpedia.org/ontology/) that could be potential rdf:types for a given entity.

For a given class we found the total number of occurrences of that class in the entire Knowledge Graph. Then we extracted all rdfs:domain properties for that class. We calculated the number of distinct occurrences of each individual property in the Knowledge Graph. We used both these information to develop a Relevance Score (between 0-1) for each property for the given class which is basically:

Relevance Score Equation

where Np is the number of distinct occurrences of the Property and Nc is the number of distinct occurrences of the Class in the Knowledge Base.

For a given entity we take all the rdf:types in the http://dbpedia.org/ontology/ namespace and all available properties of the entity. We then find the top properties for each class and verify if they exist for the given entity. If they do then we shortlist those properties and display the top N properties to the user which are ranked by their Relevance Score.

Prepared/Templatized Responses (RiveScript)

For answering questions related to DBpedia we used DBpedia's mailing lists to craft rule based responses with the help of RiveScript. The next few sections detail the process in detail.

Data Sources

  • DBpedia Discussion and Developers Mailing Lists: Collected mailing list to find interesting question answer threads that could be used for creating conversational scenarios for the bot.

Data Cleanup Tasks

The mailing list dump (mbox file) was taken as input and pre-processed to remove undesired messages based on the criteria mentioned in subsequent sections. The result from pre-processing was stored in a JSON file with the key being the subject and all associated messages were stored as an array for further processing.

Exclusions

  • Removed all messages that are request for comments, call for papers, announcements etc.
  • Removed messages that do not have question words in their subject or body. Question words considered are:
    • What
    • When
    • Why
    • Which
    • Who
    • How
    • Whose
    • Whom

Message Subject

  • Removed words such as reply, fwd etc.

Message Body

  • Removed reply sections to reduce redundancy
  • Removed unnecessary HTML tags, Whitespaces, Newlines, etc.

Tf-idf Vectorization

The messages were converted to CSV and loaded into a Pandas Dataframe. Then the subject of each message was tokenized and stemmed using Porter's Stemmer. This stemmed output was used as input to a Tf-idf Vectorizer to convert the text input to a matrix array containing frequencies of each term in every message. The total number of features extracted were ~135

K-Means Clustering

The Tf-idf Vector was passed as input to the K-Means algorithm to cluster interesting topics or categories of questions which we could program into the bot. Some of the major categories that were identified and clustered through the algorithm are:

  • About DBpedia
  • DBpedia Lookup
  • DBpedia Datasets Download/Dump
  • DBpedia Release
  • DBpedia Extraction Framework

Tools & Technologies

Following list of tools and technologies have been finalized.

Server Side Technologies

  • Java: Web Server Language:
  • Spring: REST/Web Framework
  • Maven: Java Dependency Management

Chat Libraries

  • Rivescript: Chat Library
  • Eliza: Conversational Bot Library

Front End Technologies

  • Node & NPM: Installing and managing front end packages
  • Bootstrap: Responsive CSS Framework
  • React: For Javascript Interactions and building interactive UI Components
  • WebPack: Bundler used for compiling React JSX to browser compatible and minified JS as well as LESS to CSS.

Platform Wrappers

  • Messenger4j: Facebook Messenger Wrapper
  • jSlack: Slack Wrapper

APIs

  • Jena: For Querying DBpedia using SPARQL
  • Genesis: For entity summarization and fetching related and similar entities
  • DBpedia Lookup: Resolving text to DBpedia Entities
  • DBpedia Spotlight: Resolving text to DBpedia Entities
  • TMDB: For fetching Movie and TV Show information

Tools

  • IntelliJ: Java IDE

DevOps

  • Git: Version Control
  • GitHub: Version Control Management
  • GitLab: Continuous Integration
  • Docker: Containerization
  • Testing: jUnit
  • Logging: CouchDB

Weekly Updates

The following section tracks the weekly progress that was completed.

Week 1: May 4 to May 10

  • Touch base with mentor (Ricardo)
  • Subscribed to DBpedia Developer and Discussion Mailing Lists
  • Created GitHub Repository
  • Determine Initial System Architecture and Technologies needed. Following were chosen:
    • Java with Spring for the Server Side Language
    • Rivescript as a Chat framework for canned responses
    • Git for version control along with GitHub for managing repo
    • GitLab for Continuous Integration

Week 2: May 11 to May 17

  • Uploaded progress page
  • Created initial REST application using Java and Spring and deployed a simple echo bot on Facebook.
  • Migrated from Gradle to Maven
  • Added support for static pages. Created index page
  • Integrated node, npm and webpack as part of maven since it is needed for frontend support.
  • Modified code to be compatible with Heroku which is used for initial testing

Week 3: May 18 to May 24

  • Created Chat UI based on Bootstrap Material Design
  • Added Favicon
  • Made chat interface mobile compatible
  • Styling of Chat Bubbles and Animations
  • Migrated LESS compilation to WebPack and removed Grunt completely from the project. Grunt was initially used for LESS compilation.
  • Added Starter conversation template for the Chatbot so as to set initial expectations for the user
  • Created a general library for handling text and carousel responses across platforms

Week 4: May 25 to May 31

  • Received mailbox dump of dbpedia-discussion mailing list.
  • Wrote pre-processing scripts in Python to extract interesting question answering threads that can be used for Machine Learning.
  • The pre-processed data is stored in JSON with the subject of the messages as the key and the corresponding messages as an array.

Week 5: Jun 1 to Jun 7

  • Integrated QANARY API
  • Passed incoming requests to QANARY and used the responses to query DBpedia using Jena
  • Created basic generic responses using result from DBpedia based on common properties such as abstract, label, wikipedia link etc
  • Created corresponding card and button interface

Week 6: Jun 8 to Jun 14

  • Performed clustering on subjects using TFIDF
  • Identified interesting clusters which can be converted to RiveScript
  • Created RiveScript for handling DBpedia queries such as:
    • What is DBpedia
    • Check if DBpedia is live or not
  • Created new type of component called ButtonText which combines text with button
  • Generalized RiveScript responses to include JSON objects as well as text messages to support more sophisticated functionality
  • Added UUID support to uniquely identify a user in Web Interface
  • Added more bot substitutions

Week 7: Jun 15 to Jun 21

  • Added React Constants for front end which are mirrors of Java constants
  • Modified width of bubbles depending on device. For smaller screens bubble size is relatively larger
  • Added DBpedia card to helper template shown when the bot starts
  • Now asking bot if DBpedia is live makes multiple checks (DBpedia, Resource, SPARQL)
  • NL Queries are pre-processed in RiveScript for example tell [me] [about] * => *
  • Handled Disambiguation Scenario
  • Loading Animation for Web Interface
  • Added similar entities using Genesis
  • Improved test coverage and added Test Runner
  • Added Learn More option which shows Similar and Related as Quick Reply bubbles

Week 8: Jun 22 to Jun 28

  • Added Spring Data Repository Support
  • Added Smart Replies to both Web and FB
  • Added Feedback for every interaction for fine grained user feedback
  • Created UI for Feature Request or Feedback
  • Added Tests and more RiveScript Scenarios
  • UI changes to make options menu more presentable in Web Interface by implementing an overlay
  • Minor Bug Fixes

Week 9: Jun 29 to Jul 5

  • CouchDB Integration for Feedback and Chat History
  • Integrated WolframAlpha API for Question Answering
  • Integrated DBpedia Lookup and Spotlight for grounding entities
  • Integrated TMDB API for Movie and TV Shows

Week 10: Jul 6 to Jul 12

  • Slack Integration
  • Standalone Feedback Page
  • Login & Admin Pages
  • RiveScript for DBpedia Lookup, Datasets

Week 11: Jul 13 to Jul 19

  • Chat Reporting Interface in Admin Section
  • Adding Tests and fixing issues

Week 12: Jul 20 to Jul 26

  • Added RiveScript for Mappings & GSoC
  • Added dct:description for cards
  • Added icon for Slack
  • Added Infobox Properties
  • Added Test Cases

Week 13: Jul 27 to Aug 2

  • Added Location Card based on Nomatim API and OpenStreet Map
  • Added About Section
  • Added Spell Check and Ignore Words

Week 14: Aug 3 to Aug 9

  • Started integration with GitLab CI
  • Updated Tests to be compliant with GitLab
  • Added Embed Functionality

Week 15: Aug 10 to Aug 16

  • Writing Final Documentation
  • Deployment to DBpedia's AKSW NC9 Servers
Clone this wiki locally