Novel learning framework for building vulnerability detection models
Using graph neural networks and open-source repositories to detect code vulnerabilities. This is an implementation of the model described in: "Combining Graph-based Learning with Automated Data Collection for Code Vulnerability Detection"
FUNDED is a novel learning framework for building vulnerability detection models, which leverages the advances in graph neural networks (GNNs) to develop a novel graph-based learning method to capture and reason about the program’s control, data, and call dependencies.
November 2020 - The paper was accepted to IEEE TIFS!
Dataset are available at here, include C, Java and Php! As shown in Lili's work, our dataset had the highest complexity, the largest sample size, and the most subroutine calls compared to other public vulnerability datasets.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Install the necessary dependencies before running the project,the part of SoftWare is related to data preprocess while Python Libraries are the environment we have tested.
For more details, please reference requirements.txt:
This section gives the steps, explanations and examples for getting the project running.
$ git clone [email protected]:HuantWang/FUNDED_NISL.git
$ pip install -r requirements.txt
$ cd NISL_TIFS2021/FUNDED/cli
$ CUDA_VISIBLE_DEVICES=2 python train.py GGNN GraphBinaryClassification ../data/data/CWE-77
$ cd NISL_TIFS2021/FUNDED/cli
$ CUDA_VISIBLE_DEVICES=2 python test.py GGNN GraphBinaryClassification ../data/data/data/cve/badall --storedModel_path "./trained_model/GGNN_GraphBinaryClassification__2023-02-01_05-36-00_f1 = 0.800_best.pkl"
This part contains GNN Detection model' relevant source code structure and partial sample data set.
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── requirements.txt <- The python environment for developers using this project.
├── FUNDED
│ ├── cli
│ │ ├── train.py <- the entrance of training models.
│ │ ├── test.py <- testing the specified model using data.
│ │ ├── __init__.py
│ ├── cli_utils
│ │ ├── default_hypers
│ │ │ ├── GraphBinaryClassification_GGNN.json
│ │ ├── dataset_utils.py
│ │ ├── model_utils.py
│ │ ├── param_helpers.py
│ │ ├── task_utils.py
│ │ ├── training_utils.py
│ │ ├── __init__.py
│ ├── data
│ │ ├── data
│ │ │ ├── data_preprocess.py
│ │ │ ├── our_map_all.txt
│ │ │ ├── __init__.py
│ │ ├── graph_dataset.py
│ │ ├── jsonl_graph_dataset.py
│ │ ├── jsonl_graph_property_dataset.py
│ │ ├── __init__.py
│ ├── layers
│ │ ├── message_passing
│ │ │ ├── ggnn.py
│ │ │ ├── gnn_edge_mlp.py
│ │ │ ├── gnn_film.py
│ │ │ ├── message_passing.py
│ │ │ ├── __init__.py
│ │ ├── gnn.py
│ │ ├── graph_global_exchange.py
│ │ ├── nodes_to_graph_representation.py
│ │ ├── __init__.py
│ ├── models
│ │ ├── graph_binary_classification_task.py
│ │ ├── graph_regression_task.py
│ │ ├── graph_task_model.py
│ │ ├── node_multiclass_task.py
│ │ ├── __init__.py
│ ├── utils
│ │ ├── activation.py
│ │ ├── constants.py
│ │ ├── gather_dense_gradient.py
│ │ ├── param_helpers.py
│ └── └── __init__.py
└────── __init__.py
To construct the AST, we use Soot for Java, ANTLR for Swift, PHP and joern for C/C++.
For c/c++, we download different CWE types' datasets from SARD, CVE and Github.
The specific steps of data preprocessing are as follows:
Warning: Modify the path with your own data in code.
- Slicing data
$ cd FUNDED_NISL/Edge_processing/slicec_7edges_funcblock/src/main/java/slice
- Run ClassifyFileOfProject.java to extract all the C file.
- Run Main.java to slice data in function level.
- Extracting different edge relationship
Then we traverse all the source codes' AST nodes,which have been parsed by cdt.While traversing, all nodes are numbered in sequence, and the relationship between different edges is obtained according to specific rules.
$ cd FUNDED_NISL/Edge_processing/slicec_7edges_funcblock/src/main/java/sevenEdges
- Use joern to get all the control flows and data flows in the source code, specific reference: joern.
- Run Main.java to extrace others.
- Run concateJoern.java to concate all edges.
We provide a demo dataset for data preprocess.
For java,We download data from SARD, CVE and Github.
With the same idea like parsing c/c++ above,we construct all relationships in different edges using soot and jdt.
Warning: Modify the path with your own data
$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/Java_jdt_AST_CDFG/src/main/java/yoshikihigo/tinypdg/
$ java Main.java sourceFilePath savafilePath
For PHP and Swift,We collect datasets from SARD, CVE and Github.
Then extracting edge nodes from AST constructed with Antlr.
$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/php_swift/src/php/main
$ java TestPhp.java sourceFilePath savafilePath
$ cd NISL_TIFS2021/EdgesGenerationAndDataPreprocess/php_swift/src/swift3/main
$ java TestSwift3.java sourceFilePath savafilePath
The datasets can be collected here.
The edges dataset contains 44 different types of C language CWE data. Through script processing,we can get the final inputs.
For example, under data/data/CWE-399
and data/data/CWE-400
are available the test datasets with the graphs consisting of ast, cfg and pdg.
cwe | file_id | target | contents |
---|---|---|---|
399 | 0a2a9a6f-779e-47b4-823e-43eccd125b4f.c$$$0 | 0 | 1,2 1,3 2,7,9 (1,9,0)(2,8,1)(3,7,2) ... |
399 | 1b733c0b-30d5-4cc2-9431-8695795abfed.c$$$1 | 1 | 6,7 4,5 1,4,9 (2,7,0)(3,5,1)(4,2,2) ... |
399 | 3e9bebda-cef3-4988-9543-a5e5473849c2.c$$$0 | 0 | 1,2 3,5 3,5,8 (1,2,0)(2,6,1)(4,8,2) ... |
399 | 8bcbb6c4-3f3f-471c-b2dc-ab9151bb22f8.c$$$2 | 1 | 2,7 2,9 2,3,7 (6,7,0)(1,5,1)(6,9,2) ... |
399 | 53ee12a1-ba49-41f2-a163-c2b662a4db27.c$$$0 | 0 | 4,5 7,8 3,6,8 (5,8,0)(3,6,1)(7,8,2) ... |
... | ... | ... | |
400 | 8388fdcf-40cf-4e59-9f11-17d9e320efd8.c$$$4 | 0 | 1,7 2,5 3,4,8 (4,7,0)(5,8,1)(2,9,2) ... |
400 | 91978dee-4ee4-428b-8576-ffb49e8dc12a.c$$$6 | 1 | 2,3 3,8 3,7,9 (3,6,0)(4,6,1)(2,8,2) ... |
400 | 113353a8-f804-4aff-a81a-15f20e638d4b.c$$$1 | 1 | 4,6 4,7 5,6,7 (3,7,0)(4,5,1)(8,9,2) ... |
400 | b7b5ae35-d478-4c51-96c2-8f107fc08fde.c$$$3 | 1 | 2,5 7,8 1,7,8 (5,8,0)(3,6,1)(2,8,2) ... |
400 | e831aff3-bd88-4ef7-a5b0-2d87e1b20fbe.c$$$0 | 0 | 6,8 2,8 4,6,9 (6,9,0)(1,5,1)(1,4,2) ... |
... | ... | ... |
Example results of training on the sample dataset CWE-400. Saved Model checkpoint at 60 epochs.
Dataset parameters: {
"max_nodes_per_batch": 128,
"num_fwd_edge_types": 7,
"add_self_loop_edges": true,
"tie_fwd_bkwd_edges": true,
"threshold_for_classification": 0.5
}
Model parameters: {
"gnn_aggregation_function": "sum",
"gnn_message_activation_function": "ReLU",
"gnn_hidden_dim": 256,
"gnn_use_target_state_as_input": false,
"gnn_normalize_by_num_incoming": true,
"gnn_num_edge_MLP_hidden_layers": 1,
"gnn_num_aggr_MLP_hidden_layers": null,
"gnn_message_calculation_class": "RGIN",
"gnn_initial_node_representation_activation": "tanh",
"gnn_dense_intermediate_layer_activation": "tanh",
"gnn_num_layers": 5, "gnn_dense_every_num_layers": 10000,
"gnn_residual_every_num_layers": 2,
"gnn_use_inter_layer_layernorm": true,
"gnn_layer_input_dropout_rate": 0.2,
"gnn_global_exchange_mode": "gru",
"gnn_global_exchange_every_num_layers": 10000,
"gnn_global_exchange_weighting_fun": "softmax",
"gnn_global_exchange_num_heads": 4,
"gnn_global_exchange_dropout_rate": 0.2,
"optimizer": "Adam", "learning_rate": 0.001,
"learning_rate_decay": 0.98, "momentum": 0.85,
"gradient_clip_value": 1.0,
"use_intermediate_gnn_results": false,
"graph_aggregation_num_heads": 16,
"graph_aggregation_hidden_layers": [128],
"graph_aggregation_dropout_rate": 0.2
}
== Running on test dataset
Loading data from ../data/data/tem_CWE-77/ast.
Loading data from ../data/data/tem_CWE-77/cdfg.
Restoring best model state from trained_model/GGNN_GraphBinaryClassification__2020-11-30_10-41-23_best.pkl.
NoneCP_test Accuracy = 0.915|precision = 0.846 | recall = 1.000 | f1 = 0.917
== Running on test dataset
Loading data from ../data/data/tem_CWE-77/new/ast.
Loading data from ../data/data/tem_CWE-77/new/cdfg.
Restoring best model state from trained_model/GGNN_GraphBinaryClassification__2020-11-30_10-44-23_best.pkl.
CP_test Accuracy = 0.942|precision = 0.893 | recall = 1.000 | f1 = 0.943
We use NNI(Neural Network Intelligence) for tuning in this project.
$ pip install nni
Add a search_space.json file under the work directory and write the parameters to be configured,which we have configured in the project.
search_space.json
{
"max_nodes_per_batch":{ "_type": "choice", "_value": [32,64,128]},
"gnn_hidden_dim":{ "_type": "choice", "_value": [4,8,16,...]},
"gnn_num_layers": { "_type": "choice", "_value": [2,4,8,...] },
"graph_aggregation_num_heads":{ "_type": "choice", "_value": [4,8,16,32,...]
},
"graph_aggregation_hidden_layers":{ "_type": "choice", "_value": [32,64,128,256,...] },
"graph_aggregation_dropout_rate":{ "_type": "choice", "_value": [0.1,0.2,0.5,...] },
"learning_rate": { "_type": "choice", "_value": [0.01,0.001,0.0001,...] }
}
Define the configuration file in YAML format, which declares the search space and the path of the trial file. It also provides other information, such as the parameters of the whole algorithm, the maximum number of trials and the maximum duration.
config.yml
authorName: NNI Example
experimentName: CWE-77
trialConcurrency: 1
maxExecDuration: 110h # max executable time
maxTrialNum: 500 # max trial num
trainingServicePlatform: local
searchSpacePath: search_space.json # path of search space
useAnnotation: false
tuner:
builtinTunerName: TPE
classArgs:
optimize_mode: maximize # choices: maximize, minimize
gpuIndices: "1" # specify GPUof optimizer
trial:
command: python3 train.py GGNN GraphBinaryClassification ../data/data/CWE-77 --patience 100 # execute commands
codeDir: .
gpuNum: 0
logDir: ~/nni # log directory
localConfig:
gpuIndices: "0" # specify GPU number
useActiveGpu: true
Run NNI
nnictl create --config config.yml --port 8080
Wait for the output INFO: Successfully started experiment! in the command line. This message indicates that the experiment has been successfully started.
For more details,reference https://github.com/Microsoft/nni
├── EnsembleLearning.py
├── InputData_New.py
├── stopwords.txt
├── sample.zip
- Download our pretrained w2v model here
- We also provide a dataset sample.zip, unzip and make it work
- You can extract features from commits, or just use our sample.zip
- Use EnsembleLearning.py to train your own ensemble model
Warning: Replace the path with your own data path.
python EnsembleLearning.py
Distributed under the NISL License. See LICENSE for more information.
Huanting Wang - [email protected]
@ARTICLE{Wang2020FUNDED,
author = {H. {Wang} and G. {Ye} and Z. {Tang} and S. H. {Tan} and S. {Huang} and D. {Fang} and Y. {Feng} and L. {Bian} and Z. {Wang}},
journal = {IEEE Transactions on Information Forensics and Security},
title = {Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection},
year = {2021},
volume = {16},
pages = {1943-1958},
doi = {10.1109/TIFS.2020.3044773},
ieeeid = {9293321},
publisher = {IEEE},
keywords = {Software Vulnerability, Code Vulnerability Detection, Deep Learning, Deep Graph Neural Networks},}