Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorscan does not support backreferences with indices >= 8 (HS_FLAG_PREFILTER is on) #209

Open
apismensky opened this issue Dec 14, 2023 · 3 comments
Assignees

Comments

@apismensky
Copy link

apismensky commented Dec 14, 2023

Regex a([ -]?)a\\1a|b([ .-]?)b\\2b|c([ -]?)c\\3c|d([ -]?)d\\4d|e([ -]?)e\\5e|f([ -]?)f\\6f|g([ -]?)g\\7g|h([ -]?)h\\8h
should match all following strings: "a a a", "b b b", "c c c", "d d d", "e e e", "f f f", "g g g" and "h h h"
but it matches everything except "h h h"

test to reproduce:

TEST(order, alexey1) {
    vector<pattern> patterns;
    patterns.push_back(pattern("a([ -]?)a\\1a|b([ .-]?)b\\2b|c([ -]?)c\\3c|d([ -]?)d\\4d|e([ -]?)e\\5e|f([ -]?)f\\6f|g([ -]?)g\\7g|h([ -]?)h\\8h", HS_FLAG_DOTALL | HS_FLAG_PREFILTER | HS_FLAG_MULTILINE | HS_FLAG_CASELESS | HS_FLAG_UCP | HS_FLAG_UTF8, 1));
    const char *data = "h h h";

    hs_database_t *db = buildDB(patterns, HS_MODE_NOSTREAM);
    ASSERT_NE(nullptr, db);

    hs_scratch_t *scratch = nullptr;
    hs_error_t err = hs_alloc_scratch(db, &scratch);
    ASSERT_EQ(HS_SUCCESS, err);

    CallBackContext c;
    err = hs_scan(db, data, strlen(data), 0, scratch, record_cb,
                  (void *)&c);
    ASSERT_EQ(HS_SUCCESS, err);

    EXPECT_EQ(1, countMatchesById(c.matches, 1));
    err = hs_free_scratch(scratch);
    ASSERT_EQ(HS_SUCCESS, err);
    hs_free_database(db);
}

There is some comment for 8 and 9 in: https://github.com/VectorCamp/vectorscan/blob/master/src/parser/Parser.rl#L1503 . But not sure why 8 and 9 are special cases? Are we supposed to pass them as octal numbers?

@markos markos self-assigned this Dec 15, 2023
@markos
Copy link

markos commented Dec 15, 2023

It might be that it expects octal, I will do some local tests in this, but it could just as well be a bug. I admit I'm not very familiar with this part of the code.

@seanrohead
Copy link

@markos it looks like perl supports octal escapes, but it is supposed to interpret it as a backreference if there have been enough capture groups to make it a valid backreference.

See https://perldoc.perl.org/perlrebackslash#Disambiguation-rules-between-old-style-octal-escapes-and-backreferences

@markos
Copy link

markos commented Jan 11, 2024

@seanrohead @apismensky This is most likely related to pcre and one of the limitations/differences between pcre and pcre2:

https://stackoverflow.com/questions/70273084/regex-differences-between-pcre-and-pcre2/73767663#73767663

It will probably be fixed when #83 is fixed and we migrate to pcre2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants