Data Analysis FAQ

Frequently Asked Questions Related to Data Analysis

CxAnalytix outputs data related to vulnerabilities as they are detected and remediated over time. This means it is technically time-series data. Time series data, as a basic definition, is periodically recording values that change over time. Each recorded value can be referred to as a "sample".

This generally causes some confusion as most people are accustomed to analyzing business data that essentially only records the current state of the business (e.g. "Give me a A/R report showing me a list of customers that have outstanding balances, grouping the past-due amounts in 30/60/90 day buckets.") Most of this data is organized in a relational form that is selected by understanding the relational linkage between multiple entities. The pre-requisite for extracting meaning from data organized relationally would be to understand which entities are related.

The CxAnalytix data is generally "flat" output (with a few exceptions); this means there is no knowledge required to understand the relationship between entities. Each record (or "sample") has all values required for the context of the record without needing to understand any relationships between entities. The technique for deriving meaning from the data is to understand the filtering criteria needed to reduce the data set to show only the data needed for analysis. Often this filtering is performed as a pipeline starting with the widest filtering criteria sending data through progressively narrower filtering criteria.

Understanding Sampling Behavior

Performing analysis on vulnerability data requires a bit of knowledge about the circumstances by which data arrives. Most time series data is collected from sensors that are emitting samples on a somewhat predictable periodic basis; vulnerability data is not collected in the same manner.

Sample Collection

The first thing to understand is that scans are not necessarily performed with a predictable cadence. Why is this?

Scanning code on commit to a repository would require commits to a repository; developers don't always work on code in a particular repository, and they certainly do not have a regular pattern for committing code.
Scheduled scans may fail or be delayed by lack of scanning resources.
Ad-hoc scans may be interleaved between scans scheduled on a regular cadence.
Code that is not under active development may not get scanned regularly.

Factors for Changing Results

There are several variables that affect how vulnerabilities can change over time. The most obvious one is that vulnerabilities appear and disappear based on code changes over time. If this were the only factor that caused changes in detected vulnerabilities, analysis would be easy. Consider:

Upgrades to the SAST software can increase the coverage of languages, frameworks, APIs, and even coding techniques. Vulnerability count trends may reflect the level of scan accuracy that product changes introduce in the upgrade.
Tuning of the Preset, Exclusions, and/or CxQL query can change what is detected in a scan.
The submitted code can be changed to affect the scan results.
- In some integration scenarios, it is possible for developers to submit code to exclude files containing vulnerable code. The issue will appear to have been remediated due to a change in the build process that is not detected by static code analysis.
- Similarly, it is possible to inadvertently submit code that should not be scanned thus increasing the number of results.
Errors in the management of the code may cause vulnerabilities that were previously fixed to reappear.
Incremental scan results will likely differ significantly from full scan results.

FAQs

How do I identify a SAST vulnerability across multiple scans?

This would require the use of the SAST Vulnerability Details record set. The samples in this record set contain the flattened path of each vulnerability and would therefore need to be filtered to reduce the number of samples for analysis. Filtering the samples where NodeId is 1 is usually sufficient to reduce the records for this type of analysis.

Most approaches make the wrong assumption that SimilarityId can be used to identify each vulnerability across scans. This does not work due to:

Vulnerability paths for files that are copied to different paths will have the same SimilarityId.
Vulnerability paths for files that are scanned in multiple projects will have the same SimilarityId.
Code that is copy/pasted multiple times in the same file may have the same SimilarityId.
Different vulnerabilities with the same start and end node in the data flow path will have the same SimilarityId.

Identifying a specific vulnerability across scans depends on what is needed for your particular analysis. This may seem counterintuitive, but this is because it depends on how the SAST system is being used and what the analysis is trying to achieve. Generating a compound identifier using multiple data components will allow the vulnerability to be tracked in multiple scans.

To understand which components to select for the compound identifier, some explanation of the data elements is required.

Project Identification

Scans are executed under the context of a SAST project. It is possible that each project represents a unique code repository or multiple branches of a single code repository. ProjectId is a unique value for the concatenation of TeamName and ProjectName as a path. For example, "/CxServer/US/FinancialServices/ShoppingCart_master" has a team name of "/CxServer/US/FinancialServices" and a project name of "ShoppingCart_master".

The SAST server treats vulnerabilities with the same SimilarityId as a single vulnerability across all projects in the same team. Setting a vulnerability with the status of Not Exploitable in one project, for example, would result in the vulnerability being marked as Not Exploitable if the same file (or a copy of it) were scanned in another project on the same team.

Vulnerability Classification

Since vulnerabilities may have the same start and end node, a SimilarityId value may appear under multiple vulnerability categories (or even multiple times per category). The category roughly corresponds to the QueryName. Often, the use of QueryName as a component in a compound identifier would be sufficient for classification since most queries won't report results for the same QueryName that appear in a different QueryGroup with the same SimilarityId. This is usually the case given the result path is limited to a single language in all nodes of the data flow path.

It is possible, however, to have language agnostic results (such as TruffleHog) that give the same result for the same QueryName under each QueryGroup. Using both the QueryGroup and QueryName as part of the compound identifier would increase the identification accuracy.

One caveat is that QueryGroup is mainly a composition of QueryLanguage and the default QuerySeverity. Consider:

The QuerySeverity value can be adjusted via CxAudit. The QueryGroup value will not change to reflect the adjusted QuerySeverity value.
The ResultSeverity value defaults to the QuerySeverity value but can be adjusted on a per-result basis by users of the system. This is often done to reflect vulnerability remediation priority.

Using fields that have meanings that can change may produce some unexpected analysis results.

Aggregation of Data from Multiple SAST Instances

If you have multiple SAST instances and are aggregating the data into a single store, add InstanceId as part of the unique identifier.

Examples of Compound Identifiers for Tracking Vulnerabilities Across Scans

`ProjectId` + `QueryName` + `SinkFileName` + `SinkLine` + `SimilarityId`

This identifier will track the vulnerability across scans. It will potentially result in duplicate vulnerabilities in projects under the same team when counting total vulnerabilities.

For greater accuracy in tracking, consider adding the following fields:

QueryGroup
QueryLanguage + QuerySeverity
QueryLanguage + ResultSeverity

`TeamName` + `QueryName` + `SinkFileName` + `SinkLine` + `SimilarityId`

Using TeamName in place of ProjectId will effectively allow vulnerabilities to be assessed once for all projects on the team. There are some potential drawbacks:

The same code in unrelated projects may be counted as one vulnerability for all projects in the team.
Projects can be moved to different teams. Moving a project to a new team will change the timeline for the vulnerability given historical samples will reflect the team name at the time the sample was recorded.
It may not be possible to determine when a vulnerability was resolved since it will require all projects in the team that report the vulnerability to perform a scan that no longer contains the vulnerability.

How do I detect when a vulnerability first appeared?

One method is to find the vulnerability where the Status field is New. This works if and only if a sample was recorded the first time the vulnerability was detected. There are various scenarios where this may not happen:

The report for the scan could not be retrieved at the time CxAnalytix performed the crawl for scans.
Data retention has been run and the first scan was purged prior to CxAnalytix crawling the scans.

A more general method may be to use the compound identifier for tracking vulnerabilities across scans and determine which scan is associated with the sample containing the earliest value in the ScanFinished field.

FirstDetectionDate

As of SAST 9.3 and CxAnalytix 1.3.1, the field FirstDetectionDate is part of the data output specification. Scans executed prior to 9.3 will not have a valid value for FirstDetectionDate.

How do I detect when a vulnerability was resolved?

This depends on how your organization defines the criteria for a "resolved vulnerability".

First, some variable definitions:

Let V_T be the vulnerability that is tracked across multiple scans using the chosen composite identifier.
Let 𝕊 be the set of scans having the same ProjectId field value where at least one scan reports V_T.
Let the subset 𝕊_found be the subset of scans where V_T is reported such that 𝕊_found = {𝕊 | V_T is reported} and 𝕊_found ⊆ 𝕊.

Finding the date V_T first appeared means finding scan S_found ∈ 𝕊_found with the earliest value for ScanFinished.

The Easy Answer

Given the subset of scans where V_T is not reported 𝕊_fixed = {𝕊 | not reporting V_T} we know that if 𝕊_fixed == ∅ (empty set) that the vulnerability is still outstanding.

If the most recent scan S_latest ∈ 𝕊 is also in 𝕊_fixed (S_latest ∈ 𝕊_fixed), then we can find the scan S_fixed ∈ 𝕊_fixed with the earliest ScanFinished date to find the date the vulnerability was remediated.

The Hard Answer

Note that it is possible for V_T to be re-introduced to the code; while it may be rare, the result is that there are potentially multiple resolution dates. If S_latest ∉ 𝕊_fixed, it can be assumed that the vulnerability was re-introduced and is still outstanding.

The detection method presented above will technically work for all cases at the expense of the accuracy of dates related to appearance and resolution. Your organization can decide how they would like to approach analysis for this case. If there is a need to find a more exact date of resolution, more advanced logic is needed.

For a basic method of dealing with vulnerability reappearance, the ScanFinished date for S_found may still be considered the date V_T first appeared for most tracking purposes. It must still hold that S_latest ∈ 𝕊_fixed to indicate the vulnerability has been resolved.

Using the scan S_{most-recent-found} ∈ 𝕊_found where the ScanFinished value is the most recent is the date where the search for the latest fix date can begin.

Find the scan where V_T was most recently fixed S_{most-recent-fixed} ∈ 𝕊_fixed by selecting S_{most-recent-fixed} with a ScanFinished value greater than that of the ScanFinished value of S_{most-recent-found} and the earliest value for all scans S ∈ 𝕊_fixed. The ScanFinished value for S_{most-recent-fixed} is the latest date on which V_T was resolved.

Why do I see duplicate projects in the Project Information data?

The Project Information is a sample of the current state of a project. The fields indicate the state of the project at the time CxAnalytix performed the crawl scans on each project.

Why do I see that some projects don't get updated as often as others in Project Information data?

If a project has had no scans executed since the previous crawl, there is effectively no change that has been imposed on the project. If one or more scans are executed since the previous crawl, the a Project Information sample will be recorded.