Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle input with characters having more than one byte in UTF-8 #154

Open
mar4th3 opened this issue Aug 29, 2024 · 3 comments
Open

Comments

@mar4th3
Copy link

mar4th3 commented Aug 29, 2024

Hi,

first of all thank you for this amazing library.

While playing around with it I stumbled upon this issue.

When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.

See for instance this example:

import hyperscan

matches = []


def match_event_handler(dbid, start, end, flags, context) -> bool | None:
    matches.append(end)


expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions],
)


text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)

print(matches)
# [5, 6]

The highest end offset is 6 but len("test®") is 5`.

Is there any workaround to this? Am I misunderstanding something?

Thank you!

@betterlch
Copy link

Because len(text.encode()) is 6
text.encode() == b'test\xc2\xae'

@mar4th3
Copy link
Author

mar4th3 commented Sep 2, 2024

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

@betterlch
Copy link

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

try add flag HS_FLAG_UTF8

expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions], flags=[hyperscan.HS_FLAG_UTF8],
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants