Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding auto langdetect and cleaning up keywords extracted by yake #107

Merged

Conversation

ahmednasserswe
Copy link
Contributor

Description

applying the following to improve yake keywords extraction

1- Replace ’ with ' and similar for other characters

2- Keep the longest keyword of if there is an overlap between two keywords. (“good” and “good morning” keep “good morning”)

3- If language is not specified or is “auto” perform language detection with CLD

Reference: CV2-4909 (to provide additional context)

How has this been tested?

locally + already existing tests

Copy link
Collaborator

@DGaffney DGaffney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write unit tests for keep_largest_overlapped_keywords.

@computermacgyver
Copy link
Collaborator

@ahmednasserswe Per the Jira ticket, let's use CLD - https://pypi.org/project/gcld3/ , which is a newer/better library than langdetect

lib/model/yake_keywords.py Outdated Show resolved Hide resolved
lib/model/yake_keywords.py Show resolved Hide resolved
lib/model/yake_keywords.py Outdated Show resolved Hide resolved
Copy link
Contributor

@skyemeedan skyemeedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be easier to implement if there are tests added that cover the substitutions. And this will save lots of time later as we add more cases to substitute

At a higher level, do we know if the "sub keyword" examples we are seeing are coming from the same text (extracting both "Biden" and "Joe Biden") which hopefully will be well handled by this approach vs multiple texts (one article referring to "Biden" and another to "President Biden") which might need some other post-processing

@skyemeedan skyemeedan self-requested a review August 22, 2024 16:50
lib/model/yake_keywords.py Outdated Show resolved Hide resolved
lib/model/yake_keywords.py Show resolved Hide resolved
lib/model/yake_keywords.py Show resolved Hide resolved
@ahmednasserswe ahmednasserswe merged commit 1517edf into master Aug 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants