Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nan in datamatrix #4

Closed
matarhaller opened this issue Nov 23, 2015 · 17 comments
Closed

nan in datamatrix #4

matarhaller opened this issue Nov 23, 2015 · 17 comments

Comments

@matarhaller
Copy link
Collaborator

the datamatrix has nans in it, which breaks PCA. I'm not completely sure why they are there, but do you think it's reasonable to just replace nans with 0?

@juanshishido
Copy link
Owner

I think I know what might be going on. Concatenating NaNs with non-NaNs results in NaNs (somewhat related (at about 3:40)). I'm working on a fix now.

@juanshishido
Copy link
Owner

Maybe.

@juanshishido
Copy link
Owner

@matarhaller What was the code you had for checking NaNs? I have the data_matrix object and want to check it.

@juanshishido
Copy link
Owner

Got it 😅

>>> np.isnan(data_matrix.todense()).sum()
0

👍

@matarhaller
Copy link
Collaborator Author

just to check if anything is nan: np.isnan(datamatrix).any()

or you can do np.where(np.isnan(datamatrix)) to figure out exactly where the nans are

@matarhaller
Copy link
Collaborator Author

@juanshishido You're too speedy!

@juanshishido
Copy link
Owner

Thanks!

The shape of the matrix is now: (57822, 3429). I don't remember the original dimensions, but it's good now.

I created a new notebook for this in a new branch. I think it might be better to just modify the original. What do you all think?

@juanshishido
Copy link
Owner

What was happening was that some people did not fill out any essays. So their TotalEssays values were blank. I am returning this instead: return df[df.TotalEssays.str.len() > 0]. I also found that some of those "empty" TotalEssays had a length greater than 0. So I also added this: .apply(lambda x: re.sub('\s+', ' ', x).strip()).

@juanshishido
Copy link
Owner

A question that stems (NLP joke) from this is, do we want to only use individuals who filled something out for all essays or are partial responses okay (of course, no responses aren't useful)?

@matarhaller
Copy link
Collaborator Author

Good point. Since we have so much data, I'm okay with dropping people that didn't answer all the essays.

@jnaras
Copy link
Collaborator

jnaras commented Nov 23, 2015

Oh, okay! Sounds good. Happy to drop people who didn't answer and happy to convert to .py files.

@juanshishido
Copy link
Owner

Great! We'll have to make sure do add that in.

@juanshishido
Copy link
Owner

4b9355f fixes this.

@juanshishido
Copy link
Owner

Decided to move the conversation of NaNs we were having in #7 here.

@jnaras Everything ran and confirmed that np.isnan(data_matrix.todense()).sum() == 0.

With 5fd38b0, I rearranged the imports slightly (and removed the ones we were not using), removed the print statements in filter_vocab and create_data_matrix, added whitespace to the list comprehensions in generate_freqdists and filter_vocab, and changed the formatting for the "Calculating PMI Features" cell.

Thank you!

@juanshishido
Copy link
Owner

Also, the pickled data is good 👍

@matarhaller
Copy link
Collaborator Author

So is master fully updated?
On Nov 24, 2015 12:42 AM, "Juan Shishido" [email protected] wrote:

Decided to move the conversation of NaNs we were having in #7
#7 here.

@jnaras https://github.com/jnaras Everything ran and confirmed that np.isnan(data_matrix.todense()).sum()
== 0.

With 5fd38b0
5fd38b0,
I rearranged the imports slightly (and removed the ones we were not using),
removed the print statements in filter_vocab and create_data_matrix,
added whitespace to the list comprehensions in generate_freqdists and
filter_vocab, and changed the formatting for the "Calculating PMI
Features" cell.

Thank you!


Reply to this email directly or view it on GitHub
#4 (comment).

@juanshishido
Copy link
Owner

@matarhaller Yeah. It says jaya is 3 commits ahead of master, but that's because of how I updated master—fetched the jaya branch to get Calculate PMI features.ipynb, update it, and pushed to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants