Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting #1

Open
ragesoss opened this issue Sep 15, 2017 · 31 comments
Open

Troubleshooting #1

ragesoss opened this issue Sep 15, 2017 · 31 comments

Comments

@ragesoss
Copy link
Member

@kjschiroo I'm working on this for the fall_2016 and spring_2017 terms, but I don't know what to do on wmflabs in terms of the user_database instructions. I have a CSV file of student user names for a term, and I assume I can use that to make a user database on labs?

Any pointers for how to proceed would be much appreciated.

@kjschiroo
Copy link
Collaborator

Let me restate your problem to confirm I understand it. You are trying to create a user database, and aren't sure how to do that on wmflabs, correct? I think this page should offer some helpful advice. Let me know if that wasn't your problem or if you need further help.

@ragesoss
Copy link
Member Author

@kjschiroo aha! that's helpful. I inferred that user_database was a database of users, rather than an arbitrary database own by my account.

@ragesoss
Copy link
Member Author

Okay... so I've logged in to a tool account, connected to the enwiki.labsdb database, created a new user database, saved the sql file with that user database, and then did source page_project_map.sql from the mysql command line.

It appears to be running, but it's been about 20 minutes so far and it hasn't returned. Is that expected, and am I going in the right direction?

@ragesoss
Copy link
Member Author

Okay, progress! Documenting this mainly for my own sake... Doing it from within mysql wasn't the right approach because it just did output to the terminal, but it did complete after a quite a while. Trying it again as mysql --defaults-file=$HOME/replica.my.cnf -h enwiki.labsdb < page_project_map.sql > page_project_map.csv although it'll be tab-separated rather than comma... but I can fix that easily afterwards.

@ragesoss
Copy link
Member Author

I ran it yesterday, and fired up a bunch of threads and eventually said No more items to process for most of the threads, but it seemed to hang at that point after getting back down to one process that stopped using CPU, and even overnight it never exited or produced output. I killed the process and I'm trying it again.

@ragesoss
Copy link
Member Author

@kjschiroo I've tried it twice now, and it hangs after all the Mapper threads finish with 'no more items to process'. It's been running overnight after reaching that state yesterday, but it still hasn't exited or produced any output. Any ideas for what's wrong?

@kjschiroo
Copy link
Collaborator

What are the details of the machine you are running it on? My first guess would be that it ran out of memory and then silently killed one of the processes (I know this is one of Linux's nasty habits). Then it just waits forever because there is still a job that it is waiting for.

We're currently using mwxml.map for processing the dumps, iirc it will use all available cpu cores by default, if there are a lot of cores, but not as much memory then there can be a problem. When I was running this I believe the machine had somewhere around 100 GB of RAM.

Also, my apologies for the late response. If it happens again just hit me with a @kjschiroo to grab my attention.

@ragesoss
Copy link
Member Author

@kjschiroo cool. My machine has a meager 16GB of RAM (and 8 threads). Maybe I should do this on wmflabs instead of locally.

@kjschiroo
Copy link
Collaborator

Yeah, 16GB on 8 threads is going to have at least one of them die. wmflabs might be an option, otherwise you could spin up a beefy machine on AWS or Google Cloud Platform for a day for a reasonable price. Alternatively, take a look at pull request #2. It should let you set the number of threads being used. Set it down to like 2, keep an eye on your memory and let it run for longer. I haven't been able to test it yet though, since I don't have any of the files on hand.

It is honestly one of the things that I'm most upset about with Unix based systems that they think you can just kill a process and not make it die loud.

@ragesoss
Copy link
Member Author

Sweet! Giving it a try with 2 threads.

@ragesoss ragesoss changed the title How do I make my <user_database>? Troubleshooting Sep 21, 2017
@ragesoss
Copy link
Member Author

@kjschiroo that worked! How do I interpret the results? Is this bytes added by everyone to the topics (general) and bytes added by the input cohort? So I'd just combine those into one dataset to graph students vs. all of wikipedia?

@ragesoss
Copy link
Member Author

Without any further adjustments beyond running the script with the fall 2016 users, it looks like the numbers are a lot lower than the ~6% during peak period that you found at https://meta.wikimedia.org/wiki/Research:Wiki_Ed_student_editor_contributions_to_sciences_on_Wikipedia

portion fall 2016

It's more like 1.5% for the most active period.

I'll run it for the spring 2016 to make sure I'm getting similar results for that one.

@kjschiroo
Copy link
Collaborator

Those look like the figures that I would have expected for overall contribution rate. What topics did you narrow it down to?

@ragesoss
Copy link
Member Author

@kjschiroo I used the same science_projects.csv from the sample inputs.

@kjschiroo
Copy link
Collaborator

Hmm... that's interesting. Wiki Ed hasn't reduced its focus that much on the sciences since the year of science has it? Although a 4 fold increase when you were really pushing towards that wouldn't be that weird. What is the plot of general contribution's to the area? Let me know what the 2016 results are, if they are consistent then I'd guess that it is real, if not we have some investigating to do.

@kjschiroo
Copy link
Collaborator

Wait, that should have still had year of science going on... that is weird.

@kjschiroo
Copy link
Collaborator

I remember there was a push towards labeling all of the articles with a project which is how they are identified. Did that happen for the fall?

@ragesoss
Copy link
Member Author

No, it didn't happen for the fall. I wonder how many were labeled in that way for spring 2016. I'll ask the team.

@kjschiroo
Copy link
Collaborator

I remember Ian and Adam making a concerted effort to get them labeled, although I don't know what portion they were needing to label. I could see how that would bring down the figures significantly though since it would omit many new articles that would be really positive by this metric.

@ragesoss
Copy link
Member Author

@kjschiroo I must have something wrong with the filtering by project, because I get something very similar for spring 2016.

spring_2016

@ragesoss
Copy link
Member Author

@kjschiroo With some print debugging, I note that it's putting about a million pages (1024116) into the 'pages of interest' set. That would mean about 1 out of 5 articles are in one of these science WikiProjects... seems a bit high, but maybe that's right?

@kjschiroo
Copy link
Collaborator

What is the total bytes added summing to for spring 2016? One potential issue I'm seeing here is that it only takes a couple people getting aggressive with their project labels to really change things and these changes end up being applied backward since their is no timestamp associated with them. One is the distribution of articles by project? Are their a couple that decided to go on a labeling spree.

Also, could you attach the results file?

I'm curious now. This could be an interesting thing about Wikipedia. It could be that when we were analyzing the new articles immediately after they were written we ended up getting a biased result because of our labeling efforts. If most new content is being added to new articles and those new articles take a while to get labeled, then there could have been a bunch of work that was happening, but we couldn't count as being relevant to our goal because it hadn't gotten a label yet. However, after a few months go by those articles slowly get project labels applied to them, then they end up counting. That's just a theory.

Let me go take a look at my labs account. I might be able to find the old project-page list that I used.

@ragesoss
Copy link
Member Author

@kjschiroo My results files...

fall and spring 2016 results.zip

@kjschiroo
Copy link
Collaborator

Something is going on here and I'm not sure what. I'm doing some basic sanity checks right now. Validating against your dashboard in Spring 2016 there were 3.73 million words added total. According to this data set in Spring 2016 in science alone 207,886,463 bytes were added iirc there are about 5 bytes per word so 41 million words.

You don't happen to have multiple copies of dump files sitting around do you?

I've got to run right now, but I'll upload the files I've been referencing from spring 2016 later.

@kjschiroo
Copy link
Collaborator

Here are my files.
wikied.zip

I've also included the pages that were labeled as science at the time. It is around 670,000. So 1,000,000 is higher than I'd expect, but not totally unbelievable.

@ragesoss
Copy link
Member Author

@kjschiroo I have all the gz files, including both stub-meta-history and stub-meta-current. Maybe that is a problem? Will try without the -current ones.

@ragesoss
Copy link
Member Author

Extra dumps don't appear to be the problem. I got the same output when I tried after deleting the -current dumps.

I'm now trying to use a modified version of this program to get the overall portion of content contributed by students... which I think I can do just by handling the case of no page maps by setting pages to None, in which case it should process all mainspace pages... if I am understanding it correctly.

@kjschiroo
Copy link
Collaborator

I'd be concerned that there is a deeper issue going on here. The total counts should reflect what we observe on the dashboard.

@kjschiroo
Copy link
Collaborator

@ragesoss I'm looking into this and am having trouble with the mysql connection timing out. Would you be able to save me a bit of trouble and post your page_project_map.csv?

@ragesoss
Copy link
Member Author

ragesoss commented Dec 3, 2017

@kjschiroo I can get that to you on Monday. Don't have access to the file this weekend.

@ragesoss
Copy link
Member Author

ragesoss commented Dec 4, 2017

@kjschiroo I shared a dropbox folder that now has the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants