TutorialSearch

Kentico Kontent Docs - Tutorial Search

Overview

Along with the Indexing Service, Tutorial Search is a microservice responsible for indexing tutorials of the Kentico Kontent documentation portal on the Algolia search engine. It responds to events sent by Dispatcher, then it fetches content from Kentico Kontent and returns the content in the form of Algolia-compatible records. Finally, it stores the records in an Azure Blob Storage, where the following Indexing Service can access them.

Specification

Triggers

The service has 2 endpoints which differ in the types of triggers:

Initialize

HTTP trigger accepting either a GET or POST request. This endpoint serves as a "restart" button - the index of all tutorials get cleared and then repopulated with current data from Kentico Kontent project.

Update

Event grid trigger, that is used to consume events published by Dispatcher. This allows the service to react to webhooks from Kentico Kontent project.

How it works

In order to understand how the tutorials are split into records, read the introduction to Algolia beforehand.

Indexed content types

As described on the Content Models page, Tutorials consist of multiple content types with text elements that are supposed to be searchable:

Article - Introduction, Content (Rich Text elements), Title (Text element)
Scenario - Introduction, Content (Rich Text elements), Title (Text element)
Content chunk - Content (Rich Text element)
Callout - Content (Rich Text element)
Code sample - Code (Text element)

On Kentico Kontent, a tutorial can be represented by a single article or scenario content type. Therefore, these 2 content types will be referred to as tutorials or root items for the sake of this guide.

Indexable records

Description

Each content item of one of the indexed content types described above always corresponds to at least one record. Additionally, in order to further improve the search experience, <h2> tags within rich text elements of an article, scenario and content chunk also act as splitters. The most recently encountered heading for each piece of text is being saved in each record (heading attribute), so the web can then redirect the user directly to the anchored heading.

Ideally, there should not be multiple search hits for a single tutorial. This is being accomplished by saving the codename of the root item on each record and setting the codename as a distinct attribute.

Metadata attributes codename, id and section, are used by Index Sync for index management.

objectID is being computed by the Tutorial Search and consists of the tutorial codename and the order of the record. It is required by Algolia and always has to be unique.

Article, content chunk and code sample contain a platform taxonomy element, that specifies a list of platforms that the tutorial relates to. Having this information stored for each record further improves search experience - a user can be instantly redirected to a tutorial targetting the requested platform.

Format

{
  "content": piece of text, (searchable attribute),
  "heading": heading relevant to the piece of text (searchable attribute),
  "title": title of a tutorial (searchable attribute),
  "id": id of a tutorial root item (metadata),
  "codename": codename of a tutorial root item (metadata),
  "order": order of the record within the tutorial,
  "platforms": list of plaforms the tutorial relates to,
  "section": "tutorials" (metadata) - constant value,
  "objectID": id of the record (metadata) - required by Algolia,
}

Example

{
  "content": "Because your application doesn't require any server-side code, you can host it as a collection of static files for free on Surge, Github pages, or a similar service. ...",
  "heading": "Build and deploy",
  "title": "Building your first React app",
  "id": "<GUID>",
  "codename": "building_your_first_react_app",
  "order": 32,
  "platforms": [
    "react"
  ],
  "section": "tutorials",
  "objectID": "building_your_first_react_app_32"
}

Algorithm

(0.) Initialize endpoint start

In order to ensure correctness of all the indexed data, the service has to first call the clear index HTTP endpoint of Index Sync. Then it continues by fetching all the root items from Kentico Kontent project.

(0.) Update endpoint start

Being subscribed to an event grid topic, the service reacts to any changes of published content on Kentico Kontent project. The forwarded webhook from Kentico Kontent contains the following information:

operation type (subject)
array of affected content items and their language, codename, and type

Unfortunately, when there is a change to some published content item that is being nested in another items, then the webhook specifies only direct "parents" of the changed content item. Therefore, the service has to compute the codenames of all the tutorials that are altered by itself. It fetches all content items from Kentico Kontent project and iteratively traverses them until it finds all the changed root items.

1. Resolving items from Kentico Kontent

The service uses JavaScript Delivery SDK for fetching and resolving data from Kentico Kontent. Using the SDK's RichTextResolver, it labels nested content items inside tutorials and inserts their text inside the parent's rich text element. Platform elements and possible headings of nested content items are also labelled inside the rich text. As a result, there is only a single string (with content of all the nested items) for each tutorial that will be split into multiple records.

Additionally, the service filters out the root items that are excluded from search according to their visibility element.

2. Splitting into records

Class ItemRecordsCreator is responsible for splitting the tutorials into separate records. Having received the root content item and the text to index (representing the whole tutorial), it works accordingly:

Split the text by <h2> tags, extract the headings from the text and save them separately.
Split the text by content chunk labels. Extract and save any labelled h2 heading inside content chunk. Also resolve any labelled platform element as well.
Split the text by inner items (callout and code sample). When a labelled h2 heading is encountered, do not save it for records. On this level, the headings are not anchored on the web anymore. Resolve any labelled platform element.
Having a split piece of text prepared, the service strips HTML tags off it using the striptags npm package, as those are not intended to be searched upon, nor do we want them to be returned from Algolia. There is one exception to this rule - when the text corresponds to a code sample, it is supposed to keep its HTML tags. That's why the indexInnerItem method of ItemRecordsCreator does not use call striptags.
Create a record using the split piece of text, currently saved heading (either from step 1 or 2) and platforms. Utilize title, codename and id of the tutorial. Incrementally increase the order attribute.

3. Saving to blob storage

Finally, the service saves the records to a blob storage. It creates a new blob file for each tutorial that will be indexed. Alongside the list of records, the blob contains the tutorial's codename and id as well (necessary in case of an archived root content item in Kentico Kontent). Additionally, the blob also specifies whether the intialize endpoint has been called or not.

Example output - blob

{
  "itemRecords": [
    {
      "content": "Your asset list gives you a complete overview of the files uploaded in your project. ...",
      "id": "<GUID>",
      "title": "Viewing all your project's assets",
      "heading": "",
      "codename": "viewing_all_your_project_s_assets",
      "order": 1,
      "objectID": "viewing_all_your_project_s_assets_1",
      "platforms": [],
      "section": "tutorials"
    },
    {
      "content": "Find the assets you want by typing your query into the filter field. ...",
      "id": "<GUID>",
      "title": "Viewing all your project's assets",
      "heading": "Searching for assets",
      "codename": "viewing_all_your_project_s_assets",
      "order": 2,
      "objectID": "viewing_all_your_project_s_assets_2",
      "platforms": [],
      "section": "tutorials"
    },
  ],
  "codename": "viewing_all_your_project_s_assets",
  "id": "<GUID>",
  "initialize": false
}

Configuration for integration tests

Receiving an event from Dispatcher with the "test": "enabled" attribute makes the service run with an alternative set of environment variables that is used for integration tests.

Overview
Website
Dispatcher
Notifier
Search Functionality
GitHub Integration
- Github Reader
- Samples Manager
Bulk Publishing / Scheduling to Publish
- Publisher
API Reference
Integration Tests
Content Models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly