-
Notifications
You must be signed in to change notification settings - Fork 567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/sarif output #2036
Feature/sarif output #2036
Conversation
…mpatible sarif output
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Hey @ReversingWithMe, thanks! Can you share a few sentences about SARIF and how you use it? I've seen it referenced a few times recently but haven't tries it myself. |
I wonder if it's best to add SARIF directly to capa output, or add a script (found in |
Sure! The Static Analysis Results Interchange Format (SARIF) is a standardized format for the output of static analysis tools, which are used to evaluate source or binary for things like vulnerabilities or dataflow. SARIF enables different analysis tools to produce results in a common format that can be easily understood, integrated, and acted upon by software development tools and systems. E.g. vscode, ghidra, radare2, and github all adopt a common standard for representing types of information. Sarif describes: the analysis being ran and results from an analysis on an artifact. Results include description of artifacts related to a run of the tool where artifact is source code, binary file, and auxiliary data files. Results also include the invocation or how the tool was run, including version, command line, any knobs/parameters. The idea being you can reconstruct where output data came from foe things that depend on parameters on specific input. Results themselves are captured via "rules" where it is some type of analysis, one could imagine a single rule identifier for all of capa, but that wouldn't be very useful. For each rule/type of information, there is a single message for the finding as well as a property bag which you can shove anything into. So from this, given a sarif file, all you need to know how to handle is the property bag for each ruleid found in the output, the rest is reusable. You can see in the python code of this PR the 3-4 major chunks and how they relate to capas json. The primary reason someone would use SARIF is to facilitate the aggregation, comparison, and management of analysis results from multiple tools, improving the efficiency of identifying, understanding, and addressing potential software issues. In other words, capa adopting SARIF means that any tool that understands sarif only needs special logic around types of results, but can skip parsing and trying to understand capa schema. The approach here was trying to get as close as possible to direct capa output, but pydantic serialization to json got in the way. The way I am json decoding a few times isn't great. |
trailofbits/vscode-sarif-explorer#12 Issue includes an example output file from this code. I can also upload it here. The invocation part of json says which one but I think it's just --sarif flag. |
I'm also more in favor of this approach. |
cleaning up branch to open a new PR going the script route |
Add sarif rendering which adapts existing json rendering logic. Additional code for closer to Ghidra compatible with built-in sarif module.
Output of this file passes compliance checks from microsoft, but will fail other parsers like trail of bits Sarif Explorer.
There would be several things to do better in this code style-wise, but testing water on whether this is even of interest, or if the idea is worth keep and re-implementing from scratch.
Checklist