Large Language Models

Motivation

Large Language Models (LLMS) are a hot research area in the Natural Lanaguge Processing (NLP) community. With the release of ChatGPT, LLMs have brought generative AI into the mainstream more rapidly than any other previous technical approach. Starting with TRAM 1.3, we have developed and included an ATT&CK labeling engine based on the LLM known as BERT. LLMs are pre-trained on large amounts of text, and in the process they learn to perform useful tasks such as being able to fill in missing words or finish sentences given a prompt. The successful completion of these tasks indicates that LLMs learn important semantic features of language, such as distinguishing parts of speech or identifying words with similar meanings. Previous versions of TRAM used linear models or decision trees based on simple text features (e.g. n-grams) that cannot learn rich representations of language, especially with small datasets. (And, since annotated data is very expensive to produce, we must accept that we will be working with a limited amount of data.)

Our research goal for using LLMs in TRAM is to try to leverage an LLM's understanding of language and then "fine tune" the model on our specific problem, i.e. labeling ATT&CK techniques in text. Our hypothesis is that the model will be able to quickly identify synonymous usage of words. For example, "forking a process" and "spawning a process" are two ways of saying the same thing, and both are relevant to understanding how malware might behave. An ideal LLM will be able to quickly learn that forking and spawning are synonyms in this context.

Subset of Techniques

To focus our research efforts, we selected a subset of 50 ATT&CK techniques from the complete set of over 600 techniques and subtechniques. The criteria for choosing these 50 techniques are:

The most commonly found techniques in the TRAM 1.0 data
The most commonly discovered techniques from the Sightings Report
The most common techniques as defined by Actionability, Choke Point, and Prevalence from Top ATT&CK Techniques

		Table of 50 ATT&CK Techniques
T1548.002 Abuse Elevation Control Mechanism: Bypass User Account Control	T1484.001 Domain Policy Modification: Group Policy Modification	T1070.004 Indicator Removal: File Deletion	T1566.001 Phishing: Spearphishing Attachment	T1518.001 Software Discovery: Security Software Discovery
T1557.001 Adversary-in-the-Middle: LLMNR/NBT-NS Poisoning and SMB Relay	T1573.001 Encrypted Channel: Symmetric Cryptography	T1105 Ingress Tool Transfer	T1057 Process Discovery	T1218.011 System Binary Proxy Execution: Rundll32
T1071.001 Application Layer Protocol: Web Protocols	T1041 Exfiltration Over C2 Channel	T1056.001 Input Capture: Keylogging	T1055 Process Injection	T1082 System Information Discovery
T1547.001 Boot or Logon Autostart Execution: Registry Run Keys / Startup Folder	T1190 Exploit Public-Facing Application	T1570 Lateral Tool Transfer	T1090 Proxy	T1016 System Network Configuration Discovery
T1110 Brute Force	T1068 Exploitation for Privilege Escalation	T1036.005 Masquerading: Match Legitimate Name or Location	T1012 Query Registry	T1033 System Owner/User Discovery
T1059.003 Command and Scripting Interpreter: Windows Command Shell	T1210 Exploitation of Remote Services	T1112 Modify Registry	T1219 Remote Access Software	T1569.002 System Services: Service Execution
T1543.003 Create or Modify System Process: Windows Service	T1083 File and Directory Discovery	T1106 Native API	T1021.001 Remote Services: Remote Desktop Protocol	T1552.001 Unsecured Credentials: Credentials In Files
T1074.001 Data Staged: Local Data Staging	T1564.001 Hide Artifacts: Hidden Files and Directories	T1095 Non-Application Layer Protocol	T1053.005 Scheduled Task/Job: Scheduled Task	T1204.002 User Execution: Malicious File
T1005 Data from Local System	T1574.002 Hijack Execution Flow: DLL Side-Loading	T1003.001 OS Credential Dumping: LSASS Memory	T1113 Screen Capture	T1078 Valid Accounts
T1140 Deobfuscate/Decode Files or Information	T1562.001 Impair Defenses: Disable or Modify Tools	T1027 Obfuscated Files or Information	T1072 Software Deployment Tools	T1047 Windows Management Instrumentation

Architecture

Large language models benefit from being pre-trained on vast amounts of data. These models start off having a rich understanding of human language, and therefore require a much smaller amount of domain-specific training data to learn domain-specific tasks. A main benefit of using LLMs is the ability to predict text not included in the training data. This means that LLM-based models are more robust to unseen words and are capable of perceiving subtle relationships between words that are indicative of an ATT&CK technique.

We considered three different LLMs between two architectures. While other models and LLM architectures exist, these three:

are open access
have appropriate licenses for our use case
are associated with reputable labs

The two architectures considered were BERT and GPT-2. In both cases, the LLMs are intended for different use cases than text classification but can be adapted during fine-tuning. We considered two BERT models, namely the original BERT model (Devlin et al) as well as SciBERT, which is a variation trained on scientific literature. BERT is designed to predict masked words in text, while GPT-2 is designed for generating text, and produces sequences of words by considering what next word would make sense given words it has already produced.

To confirm our hypothesis that LLMs could have better performance we needed a way to analyze and compare results. Precision, Recall, and F1 score are common metrics we can use to compare the performance of models. Precision is the metric that penalizes false positives (a score of 1 indicates no false positives), and recall is the metric that penalizes false negatives (a score of 1 indicates no false negatives). F1 is the harmonic mean of precision and recall, which means that instead of being half way between precision and recall (as you would get from adding them and dividing by two), the F1 score is skewed towards the lower of the two scores.

Each of these three metrics are calculated for each individual ATT&CK technique. When talking about the aggregate performance of the whole model, we can take the micro or macro average. The micro average is where we treat each instance the same, and calculate precision, recall, and F1 based on true positives, false positives, and false negatives across every technique. The macro average is where we treat each technique the same (even if it appears more or less often than other techniques) and take the precision, recall, and F1 scores that are already calculated, and take the average of each.

Typical metrics for Machine Learning performance - source “The Role of Machine Learning in Cybersecurity“ https://doi.org/10.1145/3545574

Results

To compare the performance of each model, all three (SciBERT, BERT, GPT-2) were trained to perform single-label classification on ten epochs of a dataset that combined the TRAM tool’s embedded training data with the annotated CTI reports.

Precision, Recall, and F1-score comparison between SciBERT and Logistic Regression models

The results show SciBERT performs best during the first epoch and reaches peak performance more quickly than the other two. This is likely because its vocabulary is more aligned with the vocabulary of our data, and by extension, the kinds of documents on which the final model will be applied. As a result, we selected SciBERT as the best performing LLM architecture to integrate into TRAM.

The fine-tuned SciBERT model shows improvement over the logistic regression model in all but one area where we measured precision, recall, and F1-score. For TRAM users this means our new LLM correctly identified the correct ATT&CK technique 88 of 100 times; and missed finding 12 techniques out of 100 samples. F1 score indicates a balance between precision and recall scores.

Jupyter Notebooks

Link to Notebooks

The LLM functionality has been built into a Jupyter notebook that you can run locally or hosted online through Google Colab. With Colab, you can import your own data and run our LLM training code on Google’s GPU-enabled systems using either paid or free tiers. This alternative approach offers advanced users a step-by-step process to executing the code behind the text classifier. You can use this to create your own model weights, train on additional data, or even set up training for ATT&CK techniques not included in our subset of 50 described above.

To use the notebook, follow the comment sections in each of the cells to download the model, setup the analysis parameters, then upload a report. Machine learning engineers can customize the configuration to further refine the results.

The TRAM notebook divides uploaded reports into partially overlapping n-grams. An n-gram is a sequence of n adjacent words. By extracting n-grams from each document, we can produce segments that might be more similar in construction to the segments that the model was trained on than are complete sentences. The notebooks will allow you to specify the value of n, as the model was trained on segments of varying length, and adjusting the number may allow the model to make predictions that it wouldn’t make on larger or shorter segments.

Provide feedback

Saved searches