-
Notifications
You must be signed in to change notification settings - Fork 39
GSoC 2017: Chatbot for DBpedia
First of all I would like to thank everyone in the DBpedia team for selecting me for this Chatbot project. I would also like to thank my mentor Ricardo for helping me by giving valuable feedback during the proposal phase as well as selecting me for working on this project.
The project presents a conversational Chatbot for DBpedia which can be accessed through the following platforms:
- A Web Interface
- Slack
- Facebook Messenger
There are three main challenges in this task. First is understanding the query presented by the user, second is fetching relevant information based on the query through DBpedia or other sources and finally tailoring the responses based on the standards of each platform and developing subsequent user interactions with the Chatbot.
The bot is capable of responding to users in the form of simple short text messages or through more elaborate interactive messages. Users can communicate or respond to the bot through text and also through interactions (such as clicking on buttons/links).
There are 4 main purposes for the bot. They are:
- Answering factual questions
- Answering questions related to DBpedia
- Expose the research work being done in DBpedia as product features. For example:
- AKSW Genesis: We use APIs from the Genesis project to show similar and related information for a particular entity.
- WDAqua QANARY: We use WDAqua's QANARY Service to answer some of the factual questions that are posed to the bot.
- Casual conversation/banter
The bot tries to answer text based questions of the following types:
- Give me the capital of Germany
- Who is Obama?
- Where is the Eiffel Tower?
- Where is France's capital?
Users can ask the bot to check if vital DBpedia services are operational.
- Is DBpedia down?
- Is lookup online?
Users can ask basic information about specific DBpedia local chapters.
- DBpedia Arabic
- German DBpedia
These are predominantly questions related to DBpedia for which the bot provides predefined templatized answers. Some examples include:
- What is DBpedia?
- How can I contribute?
- Where can I find the mapping tool?
Messages which are casual in nature fall under this category. For example:
- Hi
- What is your name?
The process by which the Chatbot server handles requests can be divided into 6 steps as follows:
- Incoming Request: Webhooks that handle incoming requests from each platform
-
Request Routing: Incoming requests are routing based on the type of request which could be a pure text request or a parameterized Request. Pure text requests requests are handled by the Text Handler and parameterized requests are handled by the Template Handler
-
Pure Text Requests: A pure text request is basically a text message from the user. We use RiveScript to identify the intent of the message and classify it into the following types:
- Natural Language Question
- Location Requests
- Service Checks
- Language Chapters
- Banter
- Prepared/Template Responses
- Parameterized Requests: When user clicks on links in information already presented. For example clicking on a Learn More button when presented information about Germany
-
Pure Text Requests: A pure text request is basically a text message from the user. We use RiveScript to identify the intent of the message and classify it into the following types:
- Generate Response: The response from either handler is converted to a format that is suitable for each platform.
- Send Response: Finally the response is sent back to each platform. Additionally, for the web interface, the client side code to handle the responses are written using standard frontend technologies such as HTML, CSS and Javascript.
This section details the workflow for both text based requests and parameterized requests through flowcharts.
For version control we use Git + GitHub along with Git-Flow for overall branch management. We primarily use the develop branch for development and staging and the master branch for production deployments. More details on how GitFlow works can be found here.
We use GitLab for Continuous Integration and Continuous Deployment. Once a commit is made either to the develop or master branch, a GitLab pipeline is executed which consists of three stages namely:
- Test
- Package
- Deploy In the test stage all tests associated with the project are executed and its an atomic operation. Only when all tests pass does the pipeline move to the next stage.
In the package stage we create a Docker image after downloading both Maven and Node dependencies using the mvn clean install
and npm install
commands. Finally the Docker image is uploaded to GitLab.
In the deploy stage we login to the nc9 server and use the created Docker image to deploy the application with appropriate Environment Configurations.
The bot can show important attributes about an entity similar to the Infobox properties shown in Wikipedia. To develop this feature we took a list of all DBpedia classes (namespace http://dbpedia.org/ontology/
) that could be potential rdf:types
for a given entity.
For a given class we found the total number of occurrences of that class in the entire Knowledge Graph. Then we extracted all rdfs:domain
properties for that class. We calculated the number of distinct occurrences of each individual property in the Knowledge Graph. We used both these information to develop a Relevance Score (between 0-1) for each property for the given class which is basically:
where Np is the number of distinct occurrences of the Property and Nc is the number distinct occurrences of the Class in the Knowledge Base.
For a given entity we take all the rdf:types
in the http://dbpedia.org/ontology/
namespace and all available properties of the entity. We then find the top properties for each class and verify if they exist for the given entity. If they do then we shortlist those properties and display the top N properties to the user which are ranked by their Relevance Score.
For answering questions related to DBpedia we used DBpedia's mailing lists to craft rule based responses with the help of RiveScript. The next few sections detail the process in detail.
- DBpedia Discussion and Developers Mailing Lists: Collected mailing list to find interesting question answer threads that could be used for creating conversational scenarios for the bot.
The mailing list dump (mbox file) was taken as input and pre-processed to remove undesired messages based on the criteria mentioned in subsequent sections. The result from pre-processing was stored in a JSON file with the key being the subject and all associated messages were stored as an array for further processing.
- Removed all messages that are request for comments, call for papers, announcements etc.
- Removed messages that do not have question words in their subject or body. Question words considered are:
- What
- When
- Why
- Which
- Who
- How
- Whose
- Whom
- Removed words such as reply, fwd etc.
- Removed reply sections to reduce redundancy
- Removed unnecessary HTML tags, Whitespaces, Newlines, etc.
The messages were converted to CSV and loaded into a Pandas Dataframe. Then the subject of each message was tokenized and stemmed using Porter's Stemmer. This stemmed output was used as input to a Tf-idf Vectorizer to convert the text input to a matrix array containing frequencies of each term in every message. The total number of features extracted were ~135
The Tf-idf Vector was passed as input to the K-Means algorithm to cluster interesting topics or categories of questions which we could program into the bot. Some of the major categories that were identified and clustered through the algorithm are:
- About DBpedia
- DBpedia Lookup
- DBpedia Datasets Download/Dump
- DBpedia Release
- DBpedia Extraction Framework
Following list of tools and technologies have been finalized.
- Java: Web Server Language:
- Spring: REST/Web Framework
- Maven: Java Dependency Management
- Rivescript: Chat Library
- Eliza: Conversational Bot Library
- Node & NPM: Installing and managing front end packages
- Bootstrap: Responsive CSS Framework
- React: For Javascript Interactions and building interactive UI Components
- WebPack: Bundler used for compiling React JSX to browser compatible and minified JS as well as LESS to CSS.
- Messenger4j: Facebook Messenger Wrapper
- jSlack: Slack Wrapper
- Jena: For Querying DBpedia using SPARQL
- Genesis: For entity summarization and fetching related and similar entities
- DBpedia Lookup: Resolving text to DBpedia Entities
- DBpedia Spotlight: Resolving text to DBpedia Entities
- TMDB: For fetching Movie and TV Show information
- IntelliJ: Java IDE
- Git: Version Control
- GitHub: Version Control Management
- GitLab: Continuous Integration
- Docker: Containerization
- Testing: jUnit
- Logging: CouchDB
The following section tracks the weekly progress that was completed.
- Touch base with mentor (Ricardo)
- Subscribed to DBpedia Developer and Discussion Mailing Lists
- Created GitHub Repository
- Determine Initial System Architecture and Technologies needed. Following were chosen:
- Java with Spring for the Server Side Language
- Rivescript as a Chat framework for canned responses
- Git for version control along with GitHub for managing repo
- GitLab for Continuous Integration
- Uploaded progress page
- Created initial REST application using Java and Spring and deployed a simple echo bot on Facebook.
- Migrated from Gradle to Maven
- Added support for static pages. Created index page
- Integrated node, npm and webpack as part of maven since it is needed for frontend support.
- Modified code to be compatible with Heroku which is used for initial testing
- Created Chat UI based on Bootstrap Material Design
- Added Favicon
- Made chat interface mobile compatible
- Styling of Chat Bubbles and Animations
- Migrated LESS compilation to WebPack and removed Grunt completely from the project. Grunt was initially used for LESS compilation.
- Added Starter conversation template for the Chatbot so as to set initial expectations for the user
- Created a general library for handling text and carousel responses across platforms
- Received mailbox dump of dbpedia-discussion mailing list.
- Wrote pre-processing scripts in Python to extract interesting question answering threads that can be used for Machine Learning.
- The pre-processed data is stored in JSON with the subject of the messages as the key and the corresponding messages as an array.
- Integrated QANARY API
- Passed incoming requests to QANARY and used the responses to query DBpedia using Jena
- Created basic generic responses using result from DBpedia based on common properties such as abstract, label, wikipedia link etc
- Created corresponding card and button interface
- Performed clustering on subjects using TFIDF
- Identified interesting clusters which can be converted to RiveScript
- Created RiveScript for handling DBpedia queries such as:
- What is DBpedia
- Check if DBpedia is live or not
- Created new type of component called ButtonText which combines text with button
- Generalized RiveScript responses to include JSON objects as well as text messages to support more sophisticated functionality
- Added UUID support to uniquely identify a user in Web Interface
- Added more bot substitutions
- Added React Constants for front end which are mirrors of Java constants
- Modified width of bubbles depending on device. For smaller screens bubble size is relatively larger
- Added DBpedia card to helper template shown when the bot starts
- Now asking bot if DBpedia is live makes multiple checks (DBpedia, Resource, SPARQL)
- NL Queries are pre-processed in RiveScript for example tell [me] [about] * => *
- Handled Disambiguation Scenario
- Loading Animation for Web Interface
- Added similar entities using Genesis
- Improved test coverage and added Test Runner
- Added Learn More option which shows Similar and Related as Quick Reply bubbles
- Added Spring Data Repository Support
- Added Smart Replies to both Web and FB
- Added Feedback for every interaction for fine grained user feedback
- Created UI for Feature Request or Feedback
- Added Tests and more RiveScript Scenarios
- UI changes to make options menu more presentable in Web Interface by implementing an overlay
- Minor Bug Fixes
- CouchDB Integration for Feedback and Chat History
- Integrated WolframAlpha API for Question Answering
- Integrated DBpedia Lookup and Spotlight for grounding entities
- Integrated TMDB API for Movie and TV Shows
- Slack Integration
- Standalone Feedback Page
- Login & Admin Pages
- RiveScript for DBpedia Lookup, Datasets
- Chat Reporting Interface in Admin Section
- Adding Tests and fixing issues
- Added RiveScript for Mappings & GSoC
- Added dct:description for cards
- Added icon for Slack
- Added Infobox Properties
- Added Test Cases
- Added Location Card based on Nomatim API and OpenStreet Map
- Added About Section
- Added Spell Check and Ignore Words
- Started integration with GitLab CI
- Updated Tests to be compliant with GitLab
- Added Embed Functionality
- Writing Final Documentation
If you have any questions about your project or related issues you are encouraged to pose them via our support page.