Skip to content

ShardKeyCookbook

Nathan Leach edited this page Dec 18, 2020 · 4 revisions

Shard Key Cookbook

If you have come to this documentation, you may have the need to store quite a lot of data and access such data in a scalable way. The MongoDB documentation about choosing a shard key is a good read if you are not familiar with sharding concepts as they relate to document databases. In the context of CxAnalytix, sharding is primarily discussed in terms of storage scalability. Scalability for efficient reads (e.g. avoiding queries across multiple shards) is beyond the scope of CxAnalytix.

Shard Keys and Vulnerability Data

The difficulty in choosing a shard key from vulnerability data is that it is difficult to predict if fields have low cardinality (e.g. the field has very few unique values) or high cardinality (e.g. the field has mostly unique values). Even harder still is the ability to predict the frequency at which a field's value changes.

ProjectName, ProjectId, TeamName are examples of low cardinality fields within any given collection. As scans are executed against a project, the names of the projects will repeat quite often. Projects are also unlikely to change teams very often.

ScanId is an example of a high cardinality field. ScanId, however, may not be an ideal selection as a shard key considering it changes for every scan. In an extreme scenario, imagine a data storage shard being allocated for each scan. This may result in a system where the size of the allocated data storage is difficult to manage.

Combining several fields to form a composite key can often achieve sufficient cardinality. When choosing fields, it is important to include fields that have a sufficient change frequency. The required frequency may depend on the volume of scanning performed. If the frequency of change is very low and scan volume very high, the shard storage space may reach maximum capacity.

Generated Shard/Partition Keys

The MongoDB configuration has the ability to optionally specify a calculated shard key to add to each record written to a collection. This is primarily for use with cloud-based document storage systems that dynamically expand the storage based on a value at the root of the document. It can be used as a generic method for automating the calculation of shard affinity if desired, but MongoDB's shard indexing capability is more flexible than using this option.

Each record has different fields and cardinality considerations.

For those that wish to experiment with the key format specifier, an interactive programming example is available.

Key Format Specifier

The key format specifier is a string composed of alphanumeric text, field specifiers, and field format specifiers.

The syntax of the format specifier is:

{field key[:format value]}

Where the field key element is the name of the field in the record from which to extract a value used when composing a shard key. The following example creates a shard key from the data found in ScanType, TeamName, and ProjectName.

<Spec KeyName="pkey" CollectionName="SAST_Scan_Summary" FormatSpec="SHARD-{ScanType}-{TeamName}-{ProjectName}"  />

When the field key element is a dictionary type, a dotted notation may be used to reference a key in the dictionary value referenced by field key. The following example creates a shard key using the value of a custom field:

<Spec KeyName="pkey" CollectionName="SAST_Scan_Summary" FormatSpec="SHARD-{ScanType}-{TeamName}-{ProjectName}-{CustomFields.MyCustomField}"  />

(CustomFields is currently the only element containing a dictionary)

The curly braces ({ and }) can be embedded in the format spec by using a backslash (\) to escape the curly brace. The following example shows the shard key contents surrounded by curly braces:

<Spec KeyName="pkey" CollectionName="SAST_Scan_Summary" FormatSpec="\{{ScanType}-{TeamName}-{ProjectName}\}"  />

Field Format Specifier

The field format specifier follows the same convention of a .Net string format specifier. The format string used depend on the data type:

Example Shard Key Format Specifiers

This section is intended to be where new examples of shard keys are documented as they are chosen in field implementations.

The examples below are given for consideration of a suitable shard key. The volume of scans in an organization should be taken into consideration when selecting a shard key. The given examples are likely suitable for a moderate scan volume.

The fields TeamName and ProjectName are common to all records and are often easy to add (either both or one) to increase cardinality. Date fields are also generally a good choice for increasing cardinality; using the field format specifier, the cardinality increases as the time span length decreases (e.g. year > month > day-of-week > day-of-month and so on).

SAST Scan Summary

This example uses the scan type, the year and full name of the day of the week when the scan finished.

<add KeyName="pkey" CollectionName="SAST_Scan_Summary" FormatSpec="{ScanType}-{ScanFinished:yyyy-dddd}" NoHash="true" />

SAST Vulnerability Details

This example uses the scan type, the query group, the year and full name of the day of the week when the scan finished.

<add KeyName="pkey" CollectionName="SAST_Scan_Detail" FormatSpec="{ScanType}-{QueryGroup}-{ScanFinished:yyyy-dddd}" NoHash="true" />

OSA Scan Summary

TBD

OSA Vulnerability Details

TBD

Project Information

TBD

Policy Violations Details

TBD