Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex entities with optional words #157

Open
corus87 opened this issue May 15, 2022 · 6 comments
Open

Regex entities with optional words #157

corus87 opened this issue May 15, 2022 · 6 comments
Labels

Comments

@corus87
Copy link

corus87 commented May 15, 2022

Not sure if this is a bug, a missing feature or that I just don't know how it works...

Is there a possibility to have an optional word what can or cannot be there, for example:

  1. search for moon
  2. search for the moon

both sentences should match only moon.

I can match both sentence with:
engine.register_regex_entity("(for|about) (?P<search>.*)")

but with the second phrase, I will match the moon

I played around with (|the), (|the)?, (the)? and a lot of other regex patterns...I also looked into other mycroft-skills to find a hint, but no luck...

Isn't there some kind of wildcard for this case or a regex pattern I missed?

Thanks in advance.

@corus87 corus87 added the bug label May 15, 2022
@forslund
Copy link
Collaborator

forslund commented May 16, 2022

Not sure if I've understood it correctly but it may be the way regexes work. They match as much as possible. (atleast from what I've gathered messing around on https://pythex.org/

A workaround may be to do something like:
(for|about) ((the (?P<search>.+))|(?P<search2>.+))
But then you need to make your intent use .one_of('search', 'search2') so it's not terribly elegant. It feels like the intent should be expressible using regexes but I'm not enough of a guru to figure it out either :/

Edit: (literally 5 min later) (for|about) (the|)?(?P<search>.+$) should work as you expected (not sure what I did wrong yesterday) so something in adapt is causing this if it's not working for you. I'll do some more experimentation in adapt and see what I can see...

@corus87
Copy link
Author

corus87 commented May 17, 2022

Thanks @forslund for looking into my issue.
Unfortunately it does not work on my end with register_regex_entity("(for|about) (|the)?(?P<search>.+$)")

search for the moon still matches the the article, "search": "the moon", but it should only match moon.

It is possible to do an or statement in the regex with ( for| a| an) but I can't believe there is no way to match just nothing. If no article matches, it should just match the whole rest of the sentence.

I played a bit around on regex101 and with this pattern (for|about)( (an|a|the)|\b) (?P<search>.*) it does match with both sentences only moon, but unfortunately not with adapt :(

I do have some workarounds in my project, like making a new intent for this case but it would be nice to have a cleaner solution.

@forslund
Copy link
Collaborator

Did some digging tonight, the issue is (I think) the iterative approach adapt uses with regexes...
The sentence "tell me about the moon" will be tried in stages
"tell"
"tell me"
"tell me about"
"tell me about the" <- match found
"tell me about the moon" <- match found

and adapt chooses to report back "the" as the match...

However using negative lookahead:

(for|about) (the|)?(?P<search>(?!the).+$)

should not match against "tell me about the", but should work with "tell me about the moon", however for some reason this doesn't result in any match...I see that "moon" is matched but somehow not used as a result. will need to dig some more...

@forslund
Copy link
Collaborator

"(for|about) (the |)?(?P<search>(?!the)\S+$)" Seems to work in my test case...there is an issue where an extra space is matched: " moon" causing a later match to fail.

@corus87
Copy link
Author

corus87 commented May 18, 2022

Indeed this works on my end too.

It also works with several words "(for|about) (the |a |an |)?(?P<search>(?!the|a|an).*)". But it does look a bit ugly though... Is it because of the design of adapt we have to set the optional words twice? Or is it because of regex works?

Something like (an |a |the |.* ) would be much cleaner, but I got the feeling its not possible because of regex...

With this solution I can hardcode something in my code to have an easier matching.

Anyway, thank you very much for your help!

@forslund
Copy link
Collaborator

Partially it's caused by how adapts runs multiple passes over subsets of the utterance but partially something seems slightly wrong (or I don't quite understand it). I will look into it some more and see if I can understand the internals better but I'm not sure I'll be able to improve things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants