Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequencies tk #2

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

TuomasKetola
Copy link

The first commits are about a small change I had to make in order to create the frequency files. Due to the previous ordering, line 198 was tyring to call .save on a document that was a dictionary.

@TuomasKetola
Copy link
Author

There are a few other things that I am not quite sure word_frequency.py. Most likely this is me not understanding count-min sketches well enough though... :

I am not 100% sure how token_frequency should be used for estimating the importance / specificity of a term. Document frequency and inverse document frequency for sure. However, not quite sure what would be the best way to define N for the IDF formula. Maybe another counter for the WordFrequency class? Kind of like n_items, but for number of documents, rather than number of terms.

Let me know peoples thoughts on whether this makes sense.

@TuomasKetola
Copy link
Author

"some idf changes" is just me trying around how the idf could be calculated. Sadly I can't really run a proper test as my data is a bit limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

1 participant