This repository contains any code and documents developed for the master thesis "Building and Evaluating Software Vulnerability Datasets" (2020/2021).
Software vulnerabilities can have serious consequences when exploited, such as unauthorized authentication, data losses, and financial losses. Although there exist techniques for detecting these vulnerabilities by analyzing the source code or executing the software, these suffer from both false positives (misidentified vulnerabilities) and false negatives (undetected vulnerabilities). One other way of identifying vulnerabilities is to combine certain source code properties (software metrics) with machine learning techniques. A previous study has shown this to be feasible, although the data that was collected is now out of date. In a similar fashion, security alerts (i.e. potential vulnerabilities) may be found directly by using Static Analysis Tools (SATs), though these also present a high number of false positives.
-
Implemented an automated process capable of collecting vulnerability metadata from the CVE Details website, retrieving any affected code units (files, functions, classes) from a project's version control system, generating software metrics and security alerts for each one, storing the collected information in a MySQL database, and building robust datasets capable of being fed to machine learning algorithms.
-
Built datasets of vulnerable code units for five large open-source C/C++ projects: Mozilla, Linux Kernel, Xen Hypervisor, Apache HTTP Server, and GNU C Library (Glibc).
-
Validated the function samples by exploring various machine learning configurations and investigating whether it is possible to detect vulnerable function code in current versions using static data from previous commits.
- João Henggeler Antunes - Student
- José Alexandre D'Abruzzo Pereira - Supervisor
- Marco Vieira - Supervisor