Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keywords match if they are prefixes of terminals using unicode flag #1732

Open
c-classen opened this issue Oct 29, 2024 · 3 comments
Open

Keywords match if they are prefixes of terminals using unicode flag #1732

c-classen opened this issue Oct 29, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@c-classen
Copy link

Langium version: 3.2.0

Steps To Reproduce

Use the following grammar:

grammar HelloWorld

entry Model:
    'keyword' value=ID;

hidden terminal WS: /\s+/;
terminal ID: /[A-Za-z]+/;

and apply it to the following text:

keyword keywordSomeSuffix

This should show no errors. Now add a Unicode Flag ("u") behind the last slash of the terminal ID line. Now an error should show that complains that behind the keyword at the start of the text, an ID is expected, but another keyword was found.

Link to code example:
https://langium.org/playground?grammar=OYJwhgthYgBAEgUwDbIPYHU0mQEwFD6IB2ALiAJ6wCyauKAXPrC7AOQDWiFA7trm1gA3MMgCuiALwBJACIBuQgAsAlrnrFYpRCAgrio2BgDKDWAHoAOgGcA1OcXbd%2Bw3LPmA2gEEAtAC0wHwAvAF17MXkgA&content=NYUwng7g9gTgJgAlJWcDKUC2I0FcBm%2BAlgB5A

The current behavior

An error is shown that complains that behind the keyword, an ID is expected, but another keyword was found.

The expected behavior

The keyword should not be matched as it is not expected from the current rule and not isolated

@c-classen c-classen added the bug Something isn't working label Oct 29, 2024
@msujew
Copy link
Member

msujew commented Oct 29, 2024

Hm, that one is pretty annoying. For our tokenizer, we need to explicitly supply the LONGER_ALT property, see:

LONGER_ALT: this.findLongerAlt(keyword, terminalTokens)

Computing that property can be done for RegExp (and we do it) however, adding the u flag transforms the RegExp into a function matcher pattern, which is no longer subject to the LONGER_ALT evaluation.

Fixing this in your own language is just a small matter of overriding the TokenBuilder and adding the token to the LONGER_ALT list manually. Fixing this on a generic level will be a bit more difficult.

@c-classen
Copy link
Author

Thank you for the help!

For anyone else encountering this issue in the meantime, I solved it by following the idea and ended up with a new TokenBuilder class like this:

import { TokenType } from "chevrotain";
import { DefaultTokenBuilder } from "langium";

interface Keyword {
  value: string;
}

export class TokenBuilder extends DefaultTokenBuilder {

  override findLongerAlt(keyword: Keyword, terminalTokens: TokenType[]): TokenType[] {
    const result = super.findLongerAlt(keyword as any, terminalTokens);
    if (terminalTokens && keyword.value.match(/[A-Za-z]+/)) {
      const idToken = terminalTokens.find(it => it.name == "ID");
      if (!idToken) {
        throw new Error("ID token not found");
      }
      result.push(idToken);
    }
    return result;
  }
}

The Keyword interface is a workaround since I could not figure out how to access the real Keyword type used in findLongerAlt from the library. In my case, the terminal that uses the Unicode flag is named ID. I add it to the return value of super.findLongerAlt whenever the keyword consists only of letters. Theoretically I should not just use [A-Za-z] in the regular expression, but match on all Unicode characters, but since my keywords do not contain any non Unicode letters, this should be fine.

@aabounegm
Copy link
Member

@c-classen The real Keyword type can be found on the GrammarAST namespace (importable from langium):

import type { GrammarAST } from "langium";

export class TokenBuilder extends DefaultTokenBuilder {
    protected override findLongerAlt(keyword: GrammarAST.Keyword, terminalTokens: TokenType[]): TokenType[] {
        // ...
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants