-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support matching integer values or ranges > 8 bits #21
Comments
No Value in Integer Matchers There is clearly no need for matchers for single integer values, as these can just be directly expressed as a sequence of hex bytes naturally in byteseek. Supporting endianness might be marginally useful, but adding integer syntax and endianness just to avoid writing the bytes out in the other order seems excessive and unnecessary complexity. Range Integer Matchers |
Are They SequenceMatchers? A key question is whether it is worth it to make these matchers normal SequenceMatchers. They violate an assumption that SequenceMatchers can be decomposed equivalently into individual ByteMatchers. Search Algorithm Assumptions Decomposability Flag Fragility |
Performance Implications ByteMatchers which match all bytes cause huge slowdowns for almost all search algorithms (except ShiftOR), depending on where in the pattern they fall. These sorts of byte matchers should be generally avoided in almost all search algorithms to obtain good performance. It is usually better to search for a slightly shorter sequence without them, then validate that they also match once the search for the slightly shorter sequence has matched. It is true that we already have ByteMatchers which do this (the Any matcher), so adding IntegerRanges doesn't add a completely new problem here - but even they should be avoided in the common search algorithms where possible. |
Search Algorithm Assessment There are only a few search algorithms which will work with integer ranges (or other matchers where bytes are dependant on other byte values). Linear Algorithms Naive Search ShiftOR Search Sub-Linear algorithms |
DynamicMatchers Another option is to make these sorts of Matcher entirely different from SequenceMatchers. The bytes that they match depend on the values of other bytes or logic. This then excludes them from being put into the existing search algorithms. A naive searcher and shiftOR searcher could be provided specifically for these types. It does not appear worth it to provide any sub-linear algorithms, as they tend not to work well with this kind of matcher. This has the advantage that the type system enforces what can be searched for, and we don't build in unnecessary complexity and fragility into the search algorithms. The disadvantage is that it is now much more complicated to search for an overall pattern that contains both normal SequenceMatchers and DynamicMatchers. They both share a lot in common - they are fixed length matchers that can be assembled into longer sequences of other sequences. DynamicMatchers would still need to decompose themselves into possible ByteMatchers so that the ShiftOR algorithm could build its matching table. So they would still need the same machinery. Making them a different type protects existing search algorithms from having to deal with them, but they are essentially doing exactly the same things as the other SequenceMatchers. So this boils down to:
Summary Is it worth making some existing search algorithms more fragile and complex to avoid the overhead of creating entirely new ones and then finding a way to make it all work together for compound patterns? |
Matching integer values (or ranges of them) for more than a byte, e.g. a 16 bit or 32 bit value.
SequenceMatchers
It is trivial to create a SequenceMatcher which matches 2, 4 or 8 bytes. The unit of matching is then the length of the sequence. To match a single value, it's not really necessary to have such a matcher, as the specific byte sequence could just be inserted as a fixed set of bytes. To match a range of integer values would require the logic in the matcher to accurately assess whether the number observed fell in the integer range.
Byte Dependencies
When matching an integer range, different bytes may match or not depending on the values of others. Deconstructing an IntRange SequenceMatcher into a sequence of ByteMatchers no longer gives identical results.
The four ByteMatchers which could be returned for an Int32RangeMatcher can only give the superset of all possible bytes that might match, which means they will match things the int32rangematcher will not (as they aren't embodying the int range logic).
It is possible that some optimisation algorithms might split up SequenceMatchers differently, operating on the assumption that it's always possible to deconstruct a SequenceMatcher into a list of ByteMatchers, and that running each of them will give identical results.
Search Algorithm Issues
If these are plugged into a search algorithm, some algorithms will have false positive matches. This is because they all assume that the bytes which can match at a particular position are fixed and always the same - but if we have > 8 bit numbers, a byte which matches in one position becomes dependant on the value of a byte in another position.
Any search algorithm which always validates a match by running the Matchers will be accurate. However, other algorithms like ShiftOR "compile" the pattern down to a lookup table of byte matches, so they don't actually run the SequenceMatcher to validate a match at all (and that is one reason they are fast).
The text was updated successfully, but these errors were encountered: