-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Symbolic tracing #157
base: master
Are you sure you want to change the base?
Symbolic tracing #157
Conversation
I still have to wok on it before it is ready. What I plan to do if ok for you is :
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planning to add any tests for the Tracer? Without them it will be hard to prevent regressions.
#include "Runtime.h" | ||
#include <cstdint> | ||
#include <fstream> | ||
#include <nlohmann/json.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Json library has not been added to SymCC dependencies. Please update the README, dockerfile and github workflows.
nlohmann::json Tracer::currentTrace; | ||
nlohmann::json Tracer::expressions; | ||
nlohmann::json Tracer::pathConstraints; | ||
const std::string Tracer::BACKEND_TRACE_FILE = "/tmp/backend_trace.json"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be either configurable or created in more generic way (e.g. mktemp
).
for (auto byteAddress = pageAddress; byteAddress < pageAddress + kPageSize; | ||
byteAddress++) { | ||
auto byteExpr = _sym_read_memory((u_int8_t *)byteAddress, 1, true); | ||
if (byteExpr != nullptr && !byteExpr->isConcrete()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can compare the current memory state with previous one to don't duplicate information in each step and only print diffs? This will reduce number of IO operations and needed memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would certainly be a useful improvement but unfortunately I will not have time to work on this.
#include "Shadow.h" | ||
#include "llvm/Support/raw_ostream.h" | ||
|
||
nlohmann::json Tracer::currentTrace; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In RuntimeCommon.cpp
there is a goal to do the runtime thread local. You use here the global state. Could you refactor your code so that it can be run in multiple threads in future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure to understand. Are we talking about threads of the tested program ? In my understanding, the shadow memory pages, the set of accumulated path constraints, etc. should be shared across threads of the target program.
As this tracing feature simply takes snapshots of this state, I do not see how it could be thread local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, the shadow memory pages, the set of accumulated path constraints, etc. should be shared across threads of the target program.
I think that in case of multi-thread tested programs the set of accumulated path constraints should be thread local, because each thread will execute potentially different path, so the set should be different. According to the shadow memory, you are right.
My idea was to have the possibility to execute the multiple runtimes in different threads to implement #14 without forking, but I don't know if that has a much sense.
Nevertheless if there is a possibility to doesn't have global state we should work in that direction, because programs with such state doesn't scale well.
target_link_libraries(SymRuntime ${Z3_LIBRARIES} ${QSYM_LLVM_DEPS}) | ||
include(FetchContent) | ||
|
||
FetchContent_Declare(json URL https://github.com/nlohmann/json/releases/download/v3.11.2/json.tar.xz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Why this is in qsym_backend specific directory? Why you fetch the code instead of require nlohmann-json to be available in operating system? This is very popular package and is available on major distrubiutions, e.g:
Ubuntu https://packages.ubuntu.com/jammy/nlohmann-json3-dev
Debian https://packages.debian.org/bookworm/nlohmann-json3-dev
Arch Linux https://archlinux.org/packages/extra/any/nlohmann-json/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no experience working with CMake, I simply used a solution that was suggested in the README of nlohmann-json. If this is a bad solution maybe you could change it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to prepare a PR till the end of week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I created a PR: damienmaier#1
@@ -449,7 +469,7 @@ bool _sym_feasible(SymExpr expr) { | |||
// | |||
|
|||
void _sym_collect_garbage() { | |||
if (allocatedExpressions.size() < g_config.garbageCollectionThreshold) | |||
// if (allocatedExpressions.size() < g_config.garbageCollectionThreshold) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you left that comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tracer identifies each expression with a unique ID, that is used in the JSON file. Currently, I used the memory address of the expression as ID and thus to avoid ID collision we need to disable the garbage collection.
As discussed above, I will put the code of this PR inside #ifdef
s such that garbage collection is disabled only when performing tracing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe changing way of creating IDs will be a good idea? I am afraid that after turning off garbage collection the memory usage can explode.
@@ -321,6 +340,7 @@ void _sym_push_path_constraint(SymExpr constraint, int taken, | |||
if (constraint == nullptr) | |||
return; | |||
|
|||
Tracer::tracePathConstraint(constraint, taken != 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't be calls to tracer put into compiler plugin code? So that it will be runtime independent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. What do you mean by compiler plugin code ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compiler
directory in symcc repository root (https://github.com/eurecom-s3/symcc/tree/master/compiler). There is a code, which is executed on compilation phase as a plugin to clang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imho it should also not be put in compiler
, otherwise, SymQEMU will not be able to use tracing.
I guess the best option would be to have _sym_push_path_constraint
wrapped into a runtime-agnostic function calling to the tracer and the runtime-specific corresponding function.
The extra call indirection should be optimized out at compile-time.
|
||
Args | ||
trace_file: The path to the trace file generated by SymCC. | ||
memory_region_names_file: The path to a file containing a mapping from memory region names to their addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to generate such file using SymCC? What should be its format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know how to generate the memory region file with SymCC. I added code in SymQEMU such that it outputs this file, and I plan to submit a PR to SymQEMU with this feature.
I will document the format of the file, and I will also edit this python function such that the memory region files is not mandatory.
For a bit of context, the motivation for using this memory region information is that, because of ASLR (and because I was comparing the behavior of different versions of SymQEMU), equivalent memory areas were put at different locations in memory across different runs. To be able to compare different traces, I needed to express the addresses relatively to their area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you are adding SymQEMU specific code to the SymCC I don't think that this is a good idea, because not everyone who will use SymCC has to use SymQEMU. So I think that the solution should be made more general to work without SymQEMU. In you paper I see:
As a solution, we added some code in both versions of SymQEMU
that makes them output, in another json file, the addresses of the different memory regions,
including the addresses of the guest memory segments (stack, data, etc.)
Here I some my ideas how this can be improved:
- Maybe will be enough to use variables provided by linker? It provides you end of text, end of data and end of initialized memory symbols: https://stackoverflow.com/a/4309620.
- If you need more data you can add a code which reads
/proc/<pid>/maps
file to get addresses of interesting sections in runtime. - You can start tested program using
setarch -R
to disable adress randomization - Instead of addresses you can generate ID's in other more generic way (e.g. hashes similar to qsym).
util/symbolic_trace/README.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You added information how to install it, but not how to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will improve the README
util/symbolic_trace/pyproject.toml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add tests. Also type and format checking will be useful (e.g. pyright, black).
In a previous commit I got allocatedExpressions out of the anonymous namespace, but this is not necessary anymore as it can be accessed with getAllocatedExpressions
Those are lines of code that I did not intend to include in the PR
The goal of this PR is to add the tracing feature described in section 5.3 here