How to handle input with characters having more than one byte in UTF-8 #154

mar4th3 · 2024-08-29T11:05:30Z

Hi,

first of all thank you for this amazing library.

While playing around with it I stumbled upon this issue.

When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.

See for instance this example:

import hyperscan

matches = []


def match_event_handler(dbid, start, end, flags, context) -> bool | None:
    matches.append(end)


expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions],
)


text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)

print(matches)
# [5, 6]

The highest end offset is 6 but len("test®") is 5`.

Is there any workaround to this? Am I misunderstanding something?

Thank you!

The text was updated successfully, but these errors were encountered:

betterlch · 2024-09-02T02:33:18Z

Because len(text.encode()) is 6
text.encode() == b'test\xc2\xae'

mar4th3 · 2024-09-02T09:52:33Z

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

betterlch · 2024-09-04T07:16:17Z

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

try add flag HS_FLAG_UTF8

expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions], flags=[hyperscan.HS_FLAG_UTF8],
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle input with characters having more than one byte in UTF-8 #154

How to handle input with characters having more than one byte in UTF-8 #154

mar4th3 commented Aug 29, 2024 •

edited

Loading

betterlch commented Sep 2, 2024

mar4th3 commented Sep 2, 2024

betterlch commented Sep 4, 2024

How to handle input with characters having more than one byte in UTF-8 #154

How to handle input with characters having more than one byte in UTF-8 #154

Comments

mar4th3 commented Aug 29, 2024 • edited Loading

betterlch commented Sep 2, 2024

mar4th3 commented Sep 2, 2024

betterlch commented Sep 4, 2024

mar4th3 commented Aug 29, 2024 •

edited

Loading