Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BAP does not support well for thumb instruction set #951

Closed
valour01 opened this issue May 20, 2019 · 8 comments
Closed

BAP does not support well for thumb instruction set #951

valour01 opened this issue May 20, 2019 · 8 comments

Comments

@valour01
Copy link

Hi, I noticed that BAP does not have very good support on binaries in thumb instruction set.
You can try this test file test.zip
We noticed that BAP would disassemble the binary with arm instruction set, which is completely wrong. Due to the mistake of disassembly, BAP has very bad performance on function detection, cfg and cg construction for thumb binaries. I could provide more test cases if you want. Many Thanks

@gitoleg
Copy link
Contributor

gitoleg commented May 20, 2019

@valour01 Well, it's the known issue. BAP just doesn't support ARM Thumb mode. Once we have time, we will add support for it, but I can't promise it will happen in the nearest future. So PRs are welcome!

@fib1d
Copy link

fib1d commented May 24, 2019

Hi,
I had a look at how arm binaries are loaded and lifted but have difficulties to understand why thumb is incorrectly decoded, i.e. where would I need to fix it (or try to). :)

Can you explain me the processing chain from elf to the Arm_lifter, where is it "programmed"?
My current assumption is that the Arm_lifter itself is correct, but the serializer (internal or objdump) is not handling the special thumb encoding correctly. Is this correct?

Did you consider to use the elf_loader and asm to binary mapping from the sail project?

Is the --lifter=bap-elf deprecated?

@ivg
Copy link
Member

ivg commented May 24, 2019

This is a manifold problem and the only thing that works reliably here is actually the decoding.
For example, we can easily decode a thumb instruction, e.g.,

$ bap-mc '2c 60' --arch=thumb --show-insn=asm
str r4, [r5]

However, our lifter will not understand it. We re using the LLVM MC decoder, which for some reason, most likely valid, distinguishes between the same named ARM and Thumb instructions, e.g.,

$ bap-mc '2c 60' --arch=thumb --show-insn
tSTRi(R4,R5,0x0,0xe,Nil)

The same str r4, [r5] instruction is encoded in arm as 00 40 85 e5 and is decoded as

$ bap-mc '00 40 85 e5' --arch=arm --show-insn
STRi12(R4,R5,0x0,0xe,Nil)

We have to consult the LLVM documentation and source code to really understand why they chose different codes for the same instruction and how this affects the semantics of operands.

The long story short, lifter has to be updated.

$ bap-mc '00 40 85 e5' --arch=arm --show-bil
{
  mem := mem with [R5, el]:u32 <- R4
}
$ bap-mc '2c 60' --arch=thumb --show-bil
{
  special (Lifter: not implemented)
}

Once lifter is updated we can move to the solution of the main problem. If you will look into the file, that was nicely provided by @valour01, you will notice that it contains both ARM and Thumb instructions. Moreover, it is actually an ARM binary,

$ arm-linux-gnueabi-readelf -h test
ELF Header:
  Magic:   7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF32
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           ARM
  Version:                           0x1
  Entry point address:               0x10a38
  Start of program headers:          52 (bytes into file)
  Start of section headers:          201884 (bytes into file)
  Flags:                             0x5000200, Version5 EABI, soft-float ABI
  Size of this header:               52 (bytes)
  Size of program headers:           32 (bytes)
  Number of program headers:         9
  Size of section headers:           40 (bytes)
  Number of section headers:         38
  Section header string table index: 35

Notice, Machine ARM and Version5 EABI.

And it starts as an ARM binary, instructions at 0x10a38 are 4 byte plain old ARM instructions, as well as all other instructions from the linked in runtime.

Such kind of multiarch binaries are not an exception, they are common as most of the ARM processors has two (or even more) decoders for each instruction set they support. Depending on a state of the CPU it will interpret bytes differently. The state is usually just a flag, which is set with branching, e.g., BX dst will jump to the label dst and flip the state from ARM to Thumb, or vice versa, depending on the current state. From the point of view of reverse engineering it means, that the same sequence of bytes may have different interpretations depending on the current state of a cpu, e.g., the same sequence of bytes 00 40 85 e5 is interpreted as

ands r0, r0
b #-0x4f6

in the Thumb mode, i.e., and and operation followed by a pc-relative branch, and as

str r4, [r5]

And as a storage operation in the ARM mode.

While modern compilers will unlikely generate code that will reuse the same location for different interpretations, it is possible that a malformed or malicious program will do this. For us, as reverse engineers, it means that both interpretations are valid, depending on a context. It also means, that the same address may have different instructions depending on a context. It also means, that every time we see a blx instruction, we have to fork our disassembler to produce two versions of the program - one for the case when we were in the ARM mode and another for the case when we were in the Thumb mode. It is easily seen that we have an exponential growth of the program, i.e.,, 2^n where n is the number of bx instructions. Third, as soon as we see an unresolved instruction, e.g., bx R0 we basically have to assume that every byte now could be interpreted as either an ARM instruction or as a THUMB instruction.

Now we can see, that the correct disassembling of a thumb interworked binary is nearly impossible, but doable. There are, however, some roadblocks in the current implementation of BAP 1.x disassembler. It doesn't allow switching architectures, as the architecture is the property of the whole binary. This is being fixed in BAP 2.0, where the new disassembler engine ascribes arbitrary architecture to any program location. And program locations are no longer represented with addresses, so that we can now treat the same address as two different locations, dependending on the current instruction set. The new framework also enables speculative disassembly, so that we can fully disassemble all possible interpretations of a program and get a sound model of a binary.

But this was a problem in general, and as you can see, we're moving in the right direction. A smaller problem would be updating the lifter, so that we can at least get the semantics of thumb instructions.

@fib1d
Copy link

fib1d commented May 27, 2019

Wow, thank you very much!

I was using bap on a cortex-m ELF file. Using bap-mc on the pure encoding produces entirely different results :)

Thanks a lot again!

@ivg
Copy link
Member

ivg commented Jul 16, 2020

the work on this issue has moved to #1174

@ivg
Copy link
Member

ivg commented Nov 10, 2020

fixed in #1178. We now support interworking and Thumb/Thumb2 instruction sets.

@ivg ivg closed this as completed Nov 10, 2020
@valour01
Copy link
Author

That's great! I was wondering whether there are stable release version of BAP that supports the Thumb/Thumb2 instruction sets? Or I have to clone the latest git repo to get the support. Many Thanks.

@ivg
Copy link
Member

ivg commented Nov 26, 2020

@valour01, the stable release (2.2.0) will be out soon. You can get the latest testing (that matches the master branch of BAP) by just adding the testing repository to opam,

opam repo add  bap-testing git+https://github.com/BinaryAnalysisPlatform/opam-repository#testing

Alternatively, you can use Debian packages that are automatically released every Saturday, see here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants