-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding auto langdetect and cleaning up keywords extracted by yake #107
adding auto langdetect and cleaning up keywords extracted by yake #107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write unit tests for keep_largest_overlapped_keywords
.
@ahmednasserswe Per the Jira ticket, let's use CLD - https://pypi.org/project/gcld3/ , which is a newer/better library than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be easier to implement if there are tests added that cover the substitutions. And this will save lots of time later as we add more cases to substitute
At a higher level, do we know if the "sub keyword" examples we are seeing are coming from the same text (extracting both "Biden" and "Joe Biden") which hopefully will be well handled by this approach vs multiple texts (one article referring to "Biden" and another to "President Biden") which might need some other post-processing
Description
applying the following to improve yake keywords extraction
1- Replace ’ with ' and similar for other characters
2- Keep the longest keyword of if there is an overlap between two keywords. (“good” and “good morning” keep “good morning”)
3- If language is not specified or is “auto” perform language detection with CLD
Reference: CV2-4909 (to provide additional context)
How has this been tested?
locally + already existing tests