Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support matching integer values or ranges > 8 bits #21

Open
nishihatapalmer opened this issue Nov 2, 2019 · 5 comments
Open

Support matching integer values or ranges > 8 bits #21

nishihatapalmer opened this issue Nov 2, 2019 · 5 comments

Comments

@nishihatapalmer
Copy link
Owner

nishihatapalmer commented Nov 2, 2019

Matching integer values (or ranges of them) for more than a byte, e.g. a 16 bit or 32 bit value.

SequenceMatchers
It is trivial to create a SequenceMatcher which matches 2, 4 or 8 bytes. The unit of matching is then the length of the sequence. To match a single value, it's not really necessary to have such a matcher, as the specific byte sequence could just be inserted as a fixed set of bytes. To match a range of integer values would require the logic in the matcher to accurately assess whether the number observed fell in the integer range.

Byte Dependencies
When matching an integer range, different bytes may match or not depending on the values of others. Deconstructing an IntRange SequenceMatcher into a sequence of ByteMatchers no longer gives identical results.

The four ByteMatchers which could be returned for an Int32RangeMatcher can only give the superset of all possible bytes that might match, which means they will match things the int32rangematcher will not (as they aren't embodying the int range logic).

It is possible that some optimisation algorithms might split up SequenceMatchers differently, operating on the assumption that it's always possible to deconstruct a SequenceMatcher into a list of ByteMatchers, and that running each of them will give identical results.

Search Algorithm Issues
If these are plugged into a search algorithm, some algorithms will have false positive matches. This is because they all assume that the bytes which can match at a particular position are fixed and always the same - but if we have > 8 bit numbers, a byte which matches in one position becomes dependant on the value of a byte in another position.

Any search algorithm which always validates a match by running the Matchers will be accurate. However, other algorithms like ShiftOR "compile" the pattern down to a lookup table of byte matches, so they don't actually run the SequenceMatcher to validate a match at all (and that is one reason they are fast).

@nishihatapalmer
Copy link
Owner Author

nishihatapalmer commented Nov 2, 2019

No Value in Integer Matchers

There is clearly no need for matchers for single integer values, as these can just be directly expressed as a sequence of hex bytes naturally in byteseek. Supporting endianness might be marginally useful, but adding integer syntax and endianness just to avoid writing the bytes out in the other order seems excessive and unnecessary complexity.

Range Integer Matchers
There is possibly value in looking at matchers which see if a > 8 bit number falls into a range. This would need to support endianness. This introduces the issues of byte dependencies, violating the assumption that a SequenceMatcher can always be deconstructed into a list of ByteMatchers.

@nishihatapalmer
Copy link
Owner Author

nishihatapalmer commented Nov 2, 2019

Are They SequenceMatchers?

A key question is whether it is worth it to make these matchers normal SequenceMatchers. They violate an assumption that SequenceMatchers can be decomposed equivalently into individual ByteMatchers.

Search Algorithm Assumptions
Most search algorithms assume that bytes to match in a pattern are fixed, and don't change depending on the values of earlier bytes read. If that assumption no longer holds, then all search algorithms will have to treat some SequenceMatchers differently.

Decomposability Flag
The SequenceMatchers could have some kind of flag on them to indicate they always require running (they are not equivalently decomposable to ByteMatchers), even though they provide the means to do so (for a superset of all the bytes they could match). Search algorithms which do not always run the original matchers would have to provide a post-match check for any SequenceMatchers which are not decomposable, to avoid false positives.

Fragility
It feels fragile to force search algorithms to treat some SequenceMatchers as different from others. If they neglect to do this, then their results will be wrong. What has to be done varies from algorithm to algorithm, adding complexity and slowing performance a little for the vast majority of cases.

@nishihatapalmer
Copy link
Owner Author

nishihatapalmer commented Nov 2, 2019

Performance Implications
An IntegerRangeMatcher would massively degrade the performance of most search algorithms it's part of. This is because many byte positions will have all possible bytes as a possible value.

ByteMatchers which match all bytes cause huge slowdowns for almost all search algorithms (except ShiftOR), depending on where in the pattern they fall.

These sorts of byte matchers should be generally avoided in almost all search algorithms to obtain good performance. It is usually better to search for a slightly shorter sequence without them, then validate that they also match once the search for the slightly shorter sequence has matched.

It is true that we already have ByteMatchers which do this (the Any matcher), so adding IntegerRanges doesn't add a completely new problem here - but even they should be avoided in the common search algorithms where possible.

@nishihatapalmer
Copy link
Owner Author

nishihatapalmer commented Nov 2, 2019

Search Algorithm Assessment

There are only a few search algorithms which will work with integer ranges (or other matchers where bytes are dependant on other byte values).

Linear Algorithms
Linear algorithms examine each position in the data. These should mostly work OK with these types of matcher.

Naive Search
This always works, as it just looks in each position for the matcher.

ShiftOR Search
This would possibly be faster than the naive search, but would have to add post-match validation of the actual matcher. Since one of the advantages of ShiftOR is that it doesn't have to do this normally, it's unclear what the impact on performance would be. Would need testing.

Sub-Linear algorithms
Lots of algorithms are sub-linear, in that they can jump over parts of data which can't possibly match. However, these massively degrade performance when faced with bytes that can match all possible values. Horpsool, Qgram-Filtering, SignedHash and others use these techniques, and would not work well with these sorts of matchers. They may also have to add additional validation steps to make sure they always work correctly.

@nishihatapalmer
Copy link
Owner Author

nishihatapalmer commented Nov 2, 2019

DynamicMatchers

Another option is to make these sorts of Matcher entirely different from SequenceMatchers. The bytes that they match depend on the values of other bytes or logic. This then excludes them from being put into the existing search algorithms.

A naive searcher and shiftOR searcher could be provided specifically for these types. It does not appear worth it to provide any sub-linear algorithms, as they tend not to work well with this kind of matcher.

This has the advantage that the type system enforces what can be searched for, and we don't build in unnecessary complexity and fragility into the search algorithms.

The disadvantage is that it is now much more complicated to search for an overall pattern that contains both normal SequenceMatchers and DynamicMatchers. They both share a lot in common - they are fixed length matchers that can be assembled into longer sequences of other sequences.

DynamicMatchers would still need to decompose themselves into possible ByteMatchers so that the ShiftOR algorithm could build its matching table. So they would still need the same machinery.

Making them a different type protects existing search algorithms from having to deal with them, but they are essentially doing exactly the same things as the other SequenceMatchers.

So this boils down to:

  1. If they are sequence matchers, then we have to modify existing search algorithms to ensure they always work, even when they won't work especially well. Everything then plays nicely with everything else, even though it adds complexity which is mostly not needed except for these special types of matcher.

  2. If they are dynamic matchers, then we have to add new search algorithms for them, and we have to come up with a way to split up matching and searching across different types of fixed length sequences, even though they share almost all of the same essential machinery between them.

Summary

Is it worth making some existing search algorithms more fragile and complex to avoid the overhead of creating entirely new ones and then finding a way to make it all work together for compound patterns?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant