Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enables ARM Thumb/Thumb2 and interworking #1178

Merged
merged 57 commits into from
Nov 3, 2020

Conversation

Phosphorus15
Copy link
Contributor

@Phosphorus15 Phosphorus15 commented Jul 15, 2020

This PR provides a new Thumb lifter that is implemented using the new Core Theory abstraction and upgrades the disassemler engine to support interworking, i.e., switching between mutiple encodings in the same piece of memory.

This PR also includes a few optimizations around the code that we need to implement as this PR (the interworking part) heavily impacted performace. We were 100% slower than the 2.1 version. The good news is that now we're a even a bit faster than 2.1 .

The PR also provides minimal support for Thumb v2 instructions (via recoding them to ARM instructions) and extends our disassembler backend interface with an option to pass attributes.

There are much more details in the commit messages.

@ivg
Copy link
Member

ivg commented Jul 15, 2020

So we're closing #1122, right?

plugins/arm_thumb/thumb.ml Outdated Show resolved Hide resolved
@Phosphorus15
Copy link
Contributor Author

So we're closing #1122, right?

Yes, looks like there's no other choices.

@ivg
Copy link
Member

ivg commented Jul 21, 2020

Ok, the good news is that I was able to disassemble an interworking binary using your lifter and enable-interworking branch (see interwork+thumb branch. There is still a lot of work to be done, for example, we are currently missing PLT entries since we can't obtain information about their architecture (I hope to make it work soon), but you can already start working with that branch. Feel free to merge it in this branch (or rebase your branch over enable-interworking).

Copy link
Member

@ivg ivg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please change the register names to capitalized (as it is done in our ARM lifter)
  2. Remove the PC register from the environment

The PC register is an implementation detail of the specific architecture and is not represented in CT or BIL abstractions. Whenever you see a PC register, you shall substitute it with the current instruction offset plus some architecture-specific offset. And assingments to PC are represented as jumps to the current offset plus ... you got it.

let run_lifter _label addr insn mem
(lifter : Bitvec.t -> Defs.insn -> Defs.op array -> unit Theory.eff) =
match fix_cmnz (Insns.of_basic insn) mem with
| None -> raise (Defs.Lift_Error "unknown instruction")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, no exceptions, a lifter shall never fail - any input sequence of bytes is valid input.

Suggested change
| None -> raise (Defs.Lift_Error "unknown instruction")
| None -> KB.return Insn.empty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still needs to be addressed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have a try_with at the place which run_lifter is called, I think this should be enough to address the underlying problem

    (match Or_error.try_with (
         fun () -> run_lifter label addr insn mem Lifter.lift_with
       ) with

@ivg ivg mentioned this pull request Jul 23, 2020
@Phosphorus15
Copy link
Contributor Author

@ivg the Thumb lifter should be fine with pc and branch instructions now.

ivg added a commit to ivg/bap that referenced this pull request Sep 30, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit to ivg/bap that referenced this pull request Sep 30, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit to ivg/bap that referenced this pull request Oct 1, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit to ivg/bap that referenced this pull request Oct 1, 2020
In the second patch of this series (BinaryAnalysisPlatform#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (BinaryAnalysisPlatform#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge BinaryAnalysisPlatform#1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
ivg added a commit that referenced this pull request Oct 1, 2020
* enables interworking in the disassembler driver

What is interworking
--------------------

Interworking is a feature of some architectures that enables mixing
several instruction sets in the same compilation unit. Example, arm
and thumb interworking that this branch is trying to add.

What is done
-------------

1. We add the switch primitive to the basic interface that changes the
dissassembler in the current disassembling state. It is a bold move
and can have conseqeuences, should be carefully reviewed

2. Attributes each destination in the disassembler driver state with
the architecture and calls switch every time we are going to
disassemble the next chunk of memory.

3. The default rule that extends the unit architecture to all
instructions in that unit is disabled for ARM/Thumb and is overriden
in the arm plugin with the following behavior, if an arm unit has a file
and that file has a symbol table then we provide information based on
the last bit of that symbol table (todo: we should also check for
abi), otherwise we propagate the unit arch to instructions.

What is to be done
------------------

Next, the arm lifter shall provide a promise to compute
destinations (which itself will require destinations, because we don't
really want to compute them) and provide the destination architecture,
based on the source encoding. We can safely examine any representation
of the instruction since it is already will be lifted by that moment.

* flattens the target interface, publishes the Enum module

also makes Enum more strict by checking that the element is indeed a
member of the set of elements and by preventing double declarations.

* adds an llvm decode for x86

* drops the dependency on arch from the disassembler driver

* overhauls the target/architecture abstraction (2/n)

In the second patch of this series (#1225) we completely got rid of
Arch.t dependency in the disassembler engine that finally opens the
path for seamless integration of targets that are not representable
with Arch.t.

To achieve this, we introduced a proper dependency injection into the
disassembler driver so that it is no longer responsible for creating
the llvm MC disassembler. Instead a plugin that implements a target,
aka the target support package, has to create a disassembler and is
now in full control of all parameters and can choose backend, specify
the CPU and other details of encoding. The encoding is a new
abstraction in our knowledge base that breaks the tight connection
between the target and the way how the program for that target is
encoded. Unlike the target, which is a property of a unit of code, the
encoding is associated with a program itself, i.e., it is a property
of each instruction. That enables targets with context-dependent
encodings such ARM's thumb mode and MIPS16e for binary encodings as
well as paves the road for non-binary encodings for the same
architecture, e.g., text assembly (which also may have several
encodings on its own, cf. att vs intel syntax). We base this branch on
the enable-interworking (#1188) and this branch fully superseeds and
includes it, since encodings made it much more natural. It is still
highlty untested how it will work with real thumb binaries but we will
get back to it when we will merge #1178.

Another big update, is that the disassembler backend (which is
responsible for translating bits into machine instructions) is no
longer required to be implemented in C++ and it is now possible to
write your own backends/disassemblers in pure OCaml, e.g., to support
PIC microcontrollers. The Backend interface is pretty low-level and we
might provide higher-level interfaces later, see
`Disasm_expert.Backend` for the interface and detailed comments.

Finally, we rectify the interface introduced in the previous PR and
flatten the hierarchy of newly introduced to the Core Theory
abtractions, i.e., instead of `Theory.Target.Endiannes` we now have
`Theory.Endianness` and so on. We also made the `Enum` module public
which introduced enumerated types built on to of `Knowledge.Value`s.

In the next episodes of this series we will gradually remove Arch.t
from other bap components and further clean up the newly introduced
interfaces.
@ivg
Copy link
Member

ivg commented Oct 7, 2020

with #1225 we're now finally ready to integrate this PR. I will start working on it right now.

@XVilka
Copy link
Contributor

XVilka commented Oct 16, 2020

Looks like CI failure is unrelated:

*** omake: done (2.93 sec, 0/0 scans, 237/237 rules, 1/1634 digests)
*** omake: reading OMakefiles
*** omake error:
   File setup.log.om: line 52, characters 59-65
   unbound variable: private.key
Makefile:29: recipe for target 'reinstall-libs' failed
make: *** [reinstall-libs] Error 1
Error: Process completed with exit code 2.

@ivg
Copy link
Member

ivg commented Oct 19, 2020

Looks like CI failure is unrelated:

*** omake: done (2.93 sec, 0/0 scans, 237/237 rules, 1/1634 digests)
*** omake: reading OMakefiles
*** omake error:
   File setup.log.om: line 52, characters 59-65
   unbound variable: private.key
Makefile:29: recipe for target 'reinstall-libs' failed
make: *** [reinstall-libs] Error 1
Error: Process completed with exit code 2.

Yep, it is a common omake-related issue, just needs a restart.

@ivg
Copy link
Member

ivg commented Oct 19, 2020

Looks like we're passing the tests! I will add more tests tomorrow (the armv5t-linux-gnueabi-echo file that we're now available to run), polish the code a little bit and I think we can merge it. This PR also enables partial support for thumb2 and the latest armv7 code, but with some limitations (probably something with abi, the lifted code looks correct and I have compared it with ghidra's p-code line-by-line, but we still getting a segmentation fault at _start).

@Phosphorus15, you may have noticed that I removed a lot of code and removed many instructions. It is not because they are wrong, it is because due to the time limitation I had to focus on implementing the minimal subset of the instruction set that is sufficient for runnng a non-trivial application. The removed instructions are still in git so I encourage you to restore them and re-implement using the framework that I have set up in this lifter. Anyway, you did a great job on this lifter! Thanks a lot!

@Phosphorus15
Copy link
Contributor Author

Looks like we're passing the tests! I will add more tests tomorrow (the armv5t-linux-gnueabi-echo file that we're now available to run), polish the code a little bit and I think we can merge it. This PR also enables partial support for thumb2 and the latest armv7 code, but with some limitations (probably something with abi, the lifted code looks correct and I have compared it with ghidra's p-code line-by-line, but we still getting a segmentation fault at _start).

@Phosphorus15, you may have noticed that I removed a lot of code and removed many instructions. It is not because they are wrong, it is because due to the time limitation I had to focus on implementing the minimal subset of the instruction set that is sufficient for runnng a non-trivial application. The removed instructions are still in git so I encourage you to restore them and re-implement using the framework that I have set up in this lifter. Anyway, you did a great job on this lifter! Thanks a lot!

Thanks for all the works for making the lifter being able to get integrated! I'll start to work on restoring the full instructions set in respect of the current framework, then.

@ivg
Copy link
Member

ivg commented Oct 28, 2020

A small status update. The PR works, the PR works correctly (we are able to execute thumb interworking binaries without the help of any other external tools and even lift thumb2 binaries). But it is slow. We have a 33% performance loss wrt the master branch and a 100% performance loss wrt the 2.1 version. I am currently trying to reduce this performance regression (some regression is assumed as 2.2 is doing much more that 2.1 but nothing close to x2 is acceptable).

ivg added 12 commits October 29, 2020 08:43
The debts (cross-section destinations) may reference sections that
were already scanned, therefore in order to disassemble such kind
of destinations we have to revisit and disassemble again previosly
disassembled sections. It comes with a performance price, so we may
try to find an more optimal solution.
The branches are now handling the mode-switching (bx, blx) in case if
the destination could be resolved at static time. There are also quite
a few fixes as well as a few new instructions that are handled.
fixes a bug when we were accidentally narrowing memory while searching
for tasks that have decodings.
1. searches for all references in the plt
2. follows the jumps in the plt
This tweak changes the semantics of the knowledge base by preventing
KB from triggering for properties that are already computed. In
certain cituations it may signficantly improve performance, i.e., when
the value of the property is already known and we do not want to
trigger expensive computations to recompute it.

P.S. to be honest, I was long time under an impression that this is
how KB works and was a bit surprised that it was only in my
imagination. We will use this feature to reduce the number of
unecessary calls for scoped objects (in fact it enables us to hoist
loop invariants from the loops).
A small microoptimization that improves the `Memory.view` function and
introduces the `Memory.view_exn` function.

The optimized version doesn't use Or_error and validation functions
but instead relies on exceptions and if/then/else for bounds
checking. Ideally, the optimized function will have only one
allocation, the returned value of type `Memory.t`.
Several micro-optimizations. First, we will represent assembler
strings as bytes to minimize bytes to string transformations. Second,
we collect predicates using fold instead of filter map, that reduces
the number of allocations to 1/3 of what we had. Finally, we use the
new `Memory.view_exn` function to reduce the number of allocations
even further.
Finally found the culprit! The problem was that we were building each
graph for each subroutine from a clean slate, because we were
forgetting the previously built graphs.

To solve the problem we had to extend the interface of the explore
function so that it can explore from several starts while preserving
the state. Now bap is back on 2.1 speeds (maybe even faster).
We can safely assume that the a chunk of memory belongs to the same unit.
The same justification - a single piece of memory always belong to the
same unit.
@ivg ivg changed the title ARM Thumb lifter enables ARM Thumb/Thumb2 and interworking Oct 29, 2020
@ivg ivg force-pushed the arm-thumb branch 2 times, most recently from 8cd7e25 to 3677778 Compare November 2, 2020 17:26
ivg added 7 commits November 2, 2020 15:44
Introduces memory (Memmap.t) to the knowledge base accessible via the
`Project.memory_slot` and uses it for computing stubs in the
stub-resolver plugin.

Also fixes BinaryAnalysisPlatform#1125.
It is probably a bad idea to deduce a simple property from a complex
one so that every time we want to know if a sequence of bytes is a
valid instruction we compute if it is a function start.
Not a big performance impact but at least the code is cleaner and
easier to debug (since now the references are in the knowledge base).
Under an assumption that there are more cross-section references that
point backwards than those that point forwards. Not a big difference,
but just as final optimization step.
@ivg ivg merged commit 32391a7 into BinaryAnalysisPlatform:master Nov 3, 2020
ivg added a commit to ivg/bap that referenced this pull request Feb 22, 2021
The first bug was breaking the raw loader at least. It bug was
introduced in BinaryAnalysisPlatform#1178 during the `Memory.view` optimization. A check was
missed that allowed for creation of invalid views. The invalid views
later were caught with an assertion check, (easily reproduced with
`bap /bin/ls --loader=raw`), e.g.,

```
("Assert_failure bap_memory.ml:72:2"
  "Raised at file \"bap_memory.ml\", line 72, characters 2-31\
 \nCalled from file \"bap_memory.ml\", line 196, characters 2-18\
 \nCalled from file \"bap_memory.ml\", line 410, characters 26-52\
 \nCalled from file \"bap_trie.ml\", line 46, characters 34-53\
 \nCalled from file \"bap_trie.ml\", line 62, characters 4-130\
 \nCalled from file \"bap_byteweight.ml\", line 78, characters 12-39\
 \nCalled from file \"bap_byteweight.ml\", line 129, characters 12-39\
 \nCalled from file \"byteweight_main.ml\", line 50, characters 44-56\
 \nCalled from file \"src/sequence.ml\", line 123, characters 29-36\
 \nCalled from file \"byteweight_main.ml\" (inlined), line 49, characters 4-116\
 \nCalled from file \"byteweight_main.ml\", line 48, characters 4-146\
...
```

The second bug was probably all the time in the library and concerns
`Memory.find_map` and `Memory.find_if`, which are talking an optional
`word_size` parameter that was ignored and functions were always
iterating over bytes.
ivg added a commit that referenced this pull request Feb 22, 2021
The first bug was breaking the raw loader at least. It bug was
introduced in #1178 during the `Memory.view` optimization. A check was
missed that allowed for creation of invalid views. The invalid views
later were caught with an assertion check, (easily reproduced with
`bap /bin/ls --loader=raw`), e.g.,

```
("Assert_failure bap_memory.ml:72:2"
  "Raised at file \"bap_memory.ml\", line 72, characters 2-31\
 \nCalled from file \"bap_memory.ml\", line 196, characters 2-18\
 \nCalled from file \"bap_memory.ml\", line 410, characters 26-52\
 \nCalled from file \"bap_trie.ml\", line 46, characters 34-53\
 \nCalled from file \"bap_trie.ml\", line 62, characters 4-130\
 \nCalled from file \"bap_byteweight.ml\", line 78, characters 12-39\
 \nCalled from file \"bap_byteweight.ml\", line 129, characters 12-39\
 \nCalled from file \"byteweight_main.ml\", line 50, characters 44-56\
 \nCalled from file \"src/sequence.ml\", line 123, characters 29-36\
 \nCalled from file \"byteweight_main.ml\" (inlined), line 49, characters 4-116\
 \nCalled from file \"byteweight_main.ml\", line 48, characters 4-146\
...
```

The second bug was probably all the time in the library and concerns
`Memory.find_map` and `Memory.find_if`, which are talking an optional
`word_size` parameter that was ignored and functions were always
iterating over bytes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants