Idea - Yara support #395

re-fox · 2021-01-17T05:27:02Z

re-fox
Jan 17, 2021

Description

There exist edge cases where it can be awkward to write a Capa rule to trigger on some binary functionality. Yara helps address these.

Currently Capa does not support:

Pairing a mnemonic with a register or constant (ex push 0xf000)
Mnemonic ordering
Inspecting the bytes of a function/basic block

Rather than generically scanning the whole file with Yara, I propose using Capa's knowledge of functions and basic blocks to limit scope and write rules that otherwise would not be possible.

Generic Example

Using the example from Idea - Location information · Discussion #393 · fireeye/capa · GitHub

rule:
  meta:
    scope: basic block
  features:
    - and:
      - yara: 'rule jmp_middle_instruction {
               strings: $jump_middle_instruction = {eb 01 ?? ?? e8 00}
               condition: any of them
               }'

The data passed to Yara would be the raw bytes that comprise the basic block. Yara makes this easy via the option in yara-python to pass a buffer rather than a file handle.

matches = rule.match(data=raw_basic_block_bytes)

Taking this one step further and pairing with an API would result in the following (naive) rule

rule:
  meta:
    scope: basic block
  features:
    - and:
      - api: foobar 
      - yara: 'rule push_deadbeef_and_call {
               strings: $push_ = {68 ef be ad de (ff 15| e8)}
               condition: any of them
               }'

The above would be looking for

push 0xdeadbeef
call foobar

Which would solve the issue of pairing the mnemonic with value(s). The example is not perfect, but when the constant may be more generic or the order of args are important, Yara would present a solution.

Implementation Considerations

Capa's concept of scope fits well.

file - scan the whole file with Yara
basic block and function - when the search should be restricted.

Strings could also be passed to Yara, but Capa already provides a reasonable (full-regex) support for these. Most use cases I've thought of have been scanning for byte sequences where instruction order is important and operator+operand pairing.

The logic to extract bytes for each BB/Function would have to exist down in each engine (more code), but it should be possible with most backend frameworks.

Drawbacks to including Yara

Attempting to lint the contributed rules
Yara rules can be difficult to qualitatively measure
May encourage lazy ports of Yara rules when the logic should exist in Capa logic
Could introduce confusion about Capa's core tenants of identifying capabilities and not being a detection tool
Presenting the data could be difficult (byte sequences may introduce confusion to the average user)
No longer pure python - building, packaging, shipping is not a pip install anymore.

williballenthin · 2021-01-17T18:04:49Z

williballenthin
Jan 17, 2021
Maintainer

First, I think this is a well-written proposal. I appreciate the various considerations that you've included here, and it definitely will spark a good discussion.

Second, some related discussions:

In Support Bytes feature in file scope #242 we discuss supporting matching byte sequences at the file scope.
In support wildcards and skips for bytes feature #233 we discuss supporting wildcards and skips in the existing bytes feature (that is limited to references).

Of note, our existing stance has been:

the primary counter argument is to avoid competing with yara (which does a great job at matching binary content).
On the other hand, I think we could experiment with this feature pretty easily.

However, just cause we've been doing it one way for a while doesn't mean that we should never change our minds. As noted, I think we could experiment with an implementation fairly easily.

These days, for me, the following concerns at the front of my mind. They're not necessarily relevant here, but might be:

concern 1: don't want rules to become too difficult for a human to understand
concern 2: don't want the code to become too complex to maintain
concern 3: don't want to compete with other purpose-built tools

I wonder if we could emulate some of yara's functionality but continue to use our own syntax. For example, could we support another bytes-like feature that matches against the current scope (and supports wildcards) that can mix with existing capa statements like and/or/etc?

rule:
  meta:
    scope: basic block
  features:
    - and:
      - api: foobar 
      - contains: {68 ef be ad de (ff 15| e8)} = push 0xDEADBEEF; call XXX

0 replies

mr-tz · 2021-01-18T10:00:18Z

mr-tz
Jan 18, 2021
Maintainer

It would really be awesome to express features such as:

instructions with operands, like push 0xDEADBEEF
code sequences like push 0xDEADBEEF; call X

Using yara is an interesting idea for this, although I'd prefer to extend capa accordingly. I think something like this would be cool:

features:
  - and:
    - sequence: |
        push 0xDEADBEEF
        call X

I know we'd have to change the engine quite a bit for this, so we need to take a closer look at all the benefits.

0 replies

re-fox · 2021-01-18T20:32:42Z

re-fox
Jan 18, 2021
Author

@mr-tz interesting idea. There are several benefits to your approach.

@williballenthin thank you for the comments. I had not seen the Capa issues.

One disadvantage to byte matching is rule authors may try to account for instruction encoding and register swapping. This can lead to either overly permissive regular expressions or overly restrictive ones (unless done right).

A common example is having a Yara rule look for 33c0, but not 31c0 where both encodings assemble to xor eax, eax. An author may just wildcard the nibble 3?c0m but now this is overly permissive and may no longer trigger on the intended instructions.

@mr-tz's idea would eliminate the need to account for instruction encoding, which could help lower the barrier to entry and prevent unintended byte matches.

An extension to this idea could be:

features:
  - and:
    - sequence:
      - instruction: push 0xdeadbeef
      - mnemonic: call
      - api: foobar

The instruction feature that contains the full instruction, and other features could be leveraged alongside. sequence would specify that the features need to occur in order. I would assume that the above yaml would match

push 0xdeadbeef
call foobar

and

push 0xdeadbeef
nop
nop
call foobar

Since the characteristics only need to match in order, this could eliminate the need to worry about interlaced junk instructions. Whereas with Yara rules, byte regexes are forced to become more permissive and have optional wildcarded ranges.

Byte pattern matching would still be useful in the (edge) cases where control flow is obfuscated. Jumping to the middle of the next instruction (for example) would not be easy to describe when the disassembly may become untrustworthy.

Yara is the tool that comes to mind when referencing byte pattern matching. With that said, Yara itself is not as important as the idea. Capa already supports boolean logic, if adding support for byte patterns (or regular expressions) is a trivial addition then it would only make sense not to include an external dependency.

@mr-tz's approach may supersede the original idea altogether.

0 replies

williballenthin · 2021-01-19T15:59:41Z

williballenthin
Jan 19, 2021
Maintainer

One disadvantage to byte matching is rule authors may try to account for instruction encoding and register swapping. This can lead to either overly permissive regular expressions or overly restrictive ones (unless done right).

A common example is having a Yara rule look for 33c0, but not 31c0 where both encodings assemble to xor eax, eax. An author may just wildcard the nibble 3?c0m but now this is overly permissive and may no longer trigger on the intended instructions.

I like this reasoning a lot. Furthermore, using the raw hex for instructions is difficult for a human to inspect - they need to trust any associated comments, and are unlikely to disassemble any hex string to figure out what instructions match. Therefore, I'd agree that it would be preferable to use assembly language strings to describe what we're looking for (@mr-tz idea).

I think we can implement verbatim assembly matching fairly easily, using something like keystone. Assemble the string to bytes, then match on byte sequences, and good to go.

However, I think we'd want at least some wildcarding support. That is, being able to say something like push 1; push 2; call ??? or mov {eax, ebx}, 0x12. I've always wanted to be able to recognize, e.g. indirect calls to CreateThread using only the pattern of parameters and constants, and I think this might enable that. Though, we'd need to develop:

a domain specific language that extends x86 to support the wildcards we want (operand classes? value ranges? named constants?)
a library that can generate patterns (recommend binary regexes, probably) from wildcarded x86.

I think this is probably a feasible project, though non-trivial. (sidebar: this would be a neat intern project - practical, challenging, many opportunities to learn). I can also see how this can be integrated into capa while maintaining fair performance (single up front scan for pattern matches, index results by VA, emit features from this index into the appropriate scopes. only scary part is the up front scans, but that can be done in native code-land, so its probably fast enough).

@re-fox goes further:

sequence would specify that the features need to occur in order.

This would provide ultimate flexibility, though as a counterpoint, would it be too much flexibility? In a scenario of push 1; call A; push 2; call B;, someone matching sequence: push 1; call B; would get a false positive. Then again, that's not worse that how capa currently does things: and: mnemonic: push; integer: 1; api: B; has the same FP.

2 replies

williballenthin Jan 19, 2021
Maintainer

pop quiz: what function is called here?

push 0
push 0
push 0
push sub_401000
push 0
push 0
call eax

of course, its CreateThread!
so that suggests we could create the capa rule:

meta:
  name: calls CreateThread indirectly
features:
  instructions: |
    push 0
    push 0
    push 0
    push {offset}
    push 0
    push 0
    call {reg}

what i like:

its readable
it probably works?

what i don't like:
3. it doesn't work for 64-bit, we'd need a separate rule
4. we might be hacking our way around not having real function argument extraction (but probably impossible to do right 100% of the time)

if we had (4), then we could alternatively write a rule like:

meta:
  name: calls CreateThread indirectly
features:
  call:
    target: {indirect}
    args:
      - 0
      - 0
      - {offset}
      - 0
      - 0
      - 0

now, this could be generalized across 32- and 64-bit platforms, and probably also across interspersed unrelated logic (or junk instructions). however, we'd need a hardcore code analysis layer here, approaching a decompiler. that's probably at odds with us supporting multiple analysis backends. so, what set of features do we want and how to prioritize them?

personally, this last feature sounds like a ton of work and complexity for moderate gain, while the other ideas in this thread are less work for approximately the same gain. i just bring it up to help imagine what a best possible solution could look like.

williballenthin Jan 19, 2021
Maintainer

hm, not sure if i should have used a "reply" here versus creating the next "answer" entry. recommend that we continue using "answers" (that everyone else has been doing but me 👼 ) unless the response is specifically related enough for a "reply".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea - Yara support #395

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Idea - Yara support #395

re-fox Jan 17, 2021

Description

Generic Example

Implementation Considerations

Replies: 4 comments · 2 replies

williballenthin Jan 17, 2021 Maintainer

mr-tz Jan 18, 2021 Maintainer

re-fox Jan 18, 2021 Author

williballenthin Jan 19, 2021 Maintainer

williballenthin Jan 19, 2021 Maintainer

williballenthin Jan 19, 2021 Maintainer

re-fox
Jan 17, 2021

Replies: 4 comments 2 replies

williballenthin
Jan 17, 2021
Maintainer

mr-tz
Jan 18, 2021
Maintainer

re-fox
Jan 18, 2021
Author

williballenthin
Jan 19, 2021
Maintainer

williballenthin Jan 19, 2021
Maintainer

williballenthin Jan 19, 2021
Maintainer