Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is W Precisely? #27

Open
rtrad89 opened this issue Jul 7, 2020 · 3 comments
Open

What is W Precisely? #27

rtrad89 opened this issue Jul 7, 2020 · 3 comments

Comments

@rtrad89
Copy link

rtrad89 commented Jul 7, 2020

During topic learning, one needs to supply W: int, size of vocabulary.

I tried to fathom the meaning of W reading Algorithm 1: Gibbs sampling algorithm for BTM in the paper BTM: Topic Modeling over Short Texts, but W is not an input there. However, it is data-dependent to me, so am I correct if I assume W to mean the number of unique terms in the cleaned and preprocessed corpus? If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question.

Thank you.

@zhongpeixiang
Copy link

During topic learning, one needs to supply W: int, size of vocabulary.

I tried to fathom the meaning of W reading Algorithm 1: Gibbs sampling algorithm for BTM in the paper BTM: Topic Modeling over Short Texts, but W is not an input there. However, it is data-dependent to me, so am I correct if I assume W to mean the number of unique terms in the cleaned and preprocessed corpus? If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question.

Thank you.

W denotes the vocab size.

W=`wc -l < $voca_pt` # vocabulary size

@rtrad89
Copy link
Author

rtrad89 commented Aug 20, 2020

During topic learning, one needs to supply W: int, size of vocabulary.
I tried to fathom the meaning of W reading Algorithm 1: Gibbs sampling algorithm for BTM in the paper BTM: Topic Modeling over Short Texts, but W is not an input there. However, it is data-dependent to me, so am I correct if I assume W to mean the number of unique terms in the cleaned and preprocessed corpus? If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question.
Thank you.

W denotes the vocab size.

W=`wc -l < $voca_pt` # vocabulary size

Can you clarify this then please?

If so, any reason W is not calculated from the corpus docs_pt automatically? I'm afraid I am missing something hence my question.

@zhongpeixiang
Copy link

$voca_pt is the vocab file automatically calculated from $doc_pt. See

python indexDocs.py $doc_pt $dwid_pt $voca_pt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants