-
Notifications
You must be signed in to change notification settings - Fork 1
TutorialSearch
Along with the Indexing Service, Tutorial Search is a microservice responsible for indexing tutorials of the Kentico Kontent documentation portal on the Algolia search engine. It responds to events sent by Dispatcher, then it fetches content from Kentico Kontent and returns the content in the form of Algolia-compatible records. Finally, it stores the records in an Azure Blob Storage, where the following Indexing Service can access them.
The service has 2 endpoints which differ in the types of triggers:
HTTP trigger accepting either a GET
or POST
request. This endpoint serves as a "restart" button - the index of all tutorials get cleared and then repopulated with current data from Kentico Kontent project.
Event grid trigger, that is used to consume events published by Dispatcher. This allows the service to react to webhooks from Kentico Kontent project.
In order to understand how the tutorials are split into records, read the introduction to Algolia beforehand.
As described on the Content Models page, Tutorials consist of multiple content types with text elements that are supposed to be searchable:
- Article - Introduction, Content (Rich Text elements), Title (Text element)
- Scenario - Introduction, Content (Rich Text elements), Title (Text element)
- Content chunk - Content (Rich Text element)
- Callout - Content (Rich Text element)
- Code sample - Code (Text element)
On Kentico Kontent, a tutorial can be represented by a single article
or scenario
content type. Therefore, these 2 content types will be referred to as tutorials or root items for the sake of this guide.
Each content item of one of the indexed content types described above always corresponds to at least one record. Additionally, in order to further improve the search experience, <h2>
tags within rich text elements of an article, scenario and content chunk also act as splitters. The most recently encountered heading for each piece of text is being saved in each record (heading
attribute), so the web can then redirect the user directly to the anchored heading.
Ideally, there should not be multiple search hits for a single tutorial. This is being accomplished by saving the codename
of the root item on each record and setting the codename
as a distinct attribute.
Metadata attributes codename
, id
and section
, are used by Index Sync for index management.
objectID
is being computed by the Tutorial Search and consists of the tutorial codename and the order of the record. It is required by Algolia and always has to be unique.
Article, content chunk and code sample contain a platform
taxonomy element, that specifies a list of platforms that the tutorial relates to. Having this information stored for each record further improves search experience - a user can be instantly redirected to a tutorial targetting the requested platform.
{
"content": piece of text, (searchable attribute),
"heading": heading relevant to the piece of text (searchable attribute),
"title": title of a tutorial (searchable attribute),
"id": id of a tutorial root item (metadata),
"codename": codename of a tutorial root item (metadata),
"order": order of the record within the tutorial,
"platforms": list of plaforms the tutorial relates to,
"section": "tutorials" (metadata) - constant value,
"objectID": id of the record (metadata) - required by Algolia,
}
{
"content": "Because your application doesn't require any server-side code, you can host it as a collection of static files for free on Surge, Github pages, or a similar service. ...",
"heading": "Build and deploy",
"title": "Building your first React app",
"id": "<GUID>",
"codename": "building_your_first_react_app",
"order": 32,
"platforms": [
"react"
],
"section": "tutorials",
"objectID": "building_your_first_react_app_32"
}
In order to ensure correctness of all the indexed data, the service has to first call the clear index HTTP endpoint of Index Sync. Then it continues by fetching all the root items from Kentico Kontent project.
Being subscribed to an event grid topic, the service reacts to any changes of published content on Kentico Kontent project. The forwarded webhook from Kentico Kontent contains the following information:
- operation type (subject)
- array of affected content items and their language, codename, and type
Unfortunately, when there is a change to some published content item that is being nested in another items, then the webhook specifies only direct "parents" of the changed content item. Therefore, the service has to compute the codenames of all the tutorials that are altered by itself. It fetches all content items from Kentico Kontent project and iteratively traverses them until it finds all the changed root items.
The service uses JavaScript Delivery SDK for fetching and resolving data from Kentico Kontent. Using the SDK's RichTextResolver
, it labels nested content items inside tutorials and inserts their text inside the parent's rich text element. Platform elements and possible headings of nested content items are also labelled inside the rich text. As a result, there is only a single string (with content of all the nested items) for each tutorial that will be split into multiple records.
Additionally, the service filters out the root items that are excluded from search according to their visibility element.
Class ItemRecordsCreator
is responsible for splitting the tutorials into separate records. Having received the root content item and the text to index (representing the whole tutorial), it works accordingly:
- Split the text by
<h2>
tags, extract the headings from the text and save them separately. - Split the text by content chunk labels. Extract and save any labelled h2 heading inside content chunk. Also resolve any labelled platform element as well.
- Split the text by inner items (callout and code sample). When a labelled h2 heading is encountered, do not save it for records. On this level, the headings are not anchored on the web anymore. Resolve any labelled platform element.
- Having a split piece of text prepared, the service strips HTML tags off it using the striptags npm package, as those are not intended to be searched upon, nor do we want them to be returned from Algolia.
There is one exception to this rule - when the text corresponds to a code sample, it is supposed to keep its HTML tags. That's why the
indexInnerItem
method ofItemRecordsCreator
does not use call striptags. - Create a record using the split piece of text, currently saved heading (either from step 1 or 2) and platforms. Utilize title, codename and id of the tutorial. Incrementally increase the order attribute.
Finally, the service saves the records to a blob storage. It creates a new blob file for each tutorial that will be indexed. Alongside the list of records, the blob contains the tutorial's codename and id as well (necessary in case of an archived root content item in Kentico Kontent). Additionally, the blob also specifies whether the intialize endpoint has been called or not.
{
"itemRecords": [
{
"content": "Your asset list gives you a complete overview of the files uploaded in your project. ...",
"id": "<GUID>",
"title": "Viewing all your project's assets",
"heading": "",
"codename": "viewing_all_your_project_s_assets",
"order": 1,
"objectID": "viewing_all_your_project_s_assets_1",
"platforms": [],
"section": "tutorials"
},
{
"content": "Find the assets you want by typing your query into the filter field. ...",
"id": "<GUID>",
"title": "Viewing all your project's assets",
"heading": "Searching for assets",
"codename": "viewing_all_your_project_s_assets",
"order": 2,
"objectID": "viewing_all_your_project_s_assets_2",
"platforms": [],
"section": "tutorials"
},
],
"codename": "viewing_all_your_project_s_assets",
"id": "<GUID>",
"initialize": false
}
Receiving an event from Dispatcher with the "test": "enabled"
attribute makes the service run with an alternative set of environment variables that is used for integration tests.
- Overview
- Website
- Dispatcher
- Notifier
- Search Functionality
- GitHub Integration
- Bulk Publishing / Scheduling to Publish
- API Reference
- Integration Tests
- Content Models