Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexEror #9

Open
franck-nkolongo opened this issue Sep 20, 2024 · 2 comments
Open

indexEror #9

franck-nkolongo opened this issue Sep 20, 2024 · 2 comments

Comments

@franck-nkolongo
Copy link

franck-nkolongo commented Sep 20, 2024

hello, I have a problem: reviews = list(review_data[2]) reviews = reviews[:5000] # only consider the first 5k reviews

IndexError: boolean index did not match indexed array along dimension 0; dimension is 5000 but corresponding boolean dimension is 1000.

this works with reviews = reviews[:1000]

@deepbot86
Copy link

deepbot86 commented Sep 22, 2024

same here ..
` File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/topicgpt/TopicRepresentation.py:310, in extract_topics_no_new_vocab_computation(corpus, vocab, document_embeddings, clusterer, vocab_embeddings, n_topwords, topword_extraction_methods, consider_outliers)
306 dim_red_centroids = umap_mapper.transform(np.array(list(centroid_dict.values()))) # map the centroids to low dimensional space
308 dim_red_centroid_dict = {label: centroid for label, centroid in zip(centroid_dict.keys(), dim_red_centroids)}
--> 310 word_topic_mat = extractor.compute_word_topic_mat(corpus, vocab, labels, consider_outliers = consider_outliers) # compute the word-topic matrix of the corpus
311 if "tfidf" in topword_extraction_methods:
312 tfidf_topwords, tfidf_dict = extractor.extract_topwords_tfidf(word_topic_mat = word_topic_mat, vocab = vocab, labels = labels, top_n_words = n_topwords) # extract the top-words according to tfidf

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/topicgpt/ExtractTopWords.py:308, in ExtractTopWords.compute_word_topic_mat(self, corpus, vocab, labels, consider_outliers)
305 word_topic_mat = np.zeros((len(vocab), len((np.unique(labels)))))
307 for i, label in tqdm(enumerate(np.unique(labels)), desc="Computing word-topic matrix", total=len(np.unique(labels))):
--> 308 topic_docs = corpus_arr[labels == label]
309 topic_doc_string = " ".join(topic_docs)
310 topic_doc_words = word_tokenize(topic_doc_string)

IndexError: boolean index did not match indexed array along dimension 0; dimension is 6969 but corresponding boolean dimension is 4999
`

@franck-nkolongo
Copy link
Author

4999

I've found the solution, first you need to delete the directory (SaveEmeddings which includes the embeddings.pkl file). This file was initially made with 1000 data (in my case), in your case, you must have initially tried with a 4999 data set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants