-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubleshooting #1
Comments
Let me restate your problem to confirm I understand it. You are trying to create a user database, and aren't sure how to do that on |
@kjschiroo aha! that's helpful. I inferred that |
Okay... so I've logged in to a tool account, connected to the enwiki.labsdb database, created a new user database, saved the sql file with that user database, and then did It appears to be running, but it's been about 20 minutes so far and it hasn't returned. Is that expected, and am I going in the right direction? |
Okay, progress! Documenting this mainly for my own sake... Doing it from within mysql wasn't the right approach because it just did output to the terminal, but it did complete after a quite a while. Trying it again as |
I ran it yesterday, and fired up a bunch of threads and eventually said |
@kjschiroo I've tried it twice now, and it hangs after all the Mapper threads finish with 'no more items to process'. It's been running overnight after reaching that state yesterday, but it still hasn't exited or produced any output. Any ideas for what's wrong? |
What are the details of the machine you are running it on? My first guess would be that it ran out of memory and then silently killed one of the processes (I know this is one of Linux's nasty habits). Then it just waits forever because there is still a job that it is waiting for. We're currently using Also, my apologies for the late response. If it happens again just hit me with a @kjschiroo to grab my attention. |
@kjschiroo cool. My machine has a meager 16GB of RAM (and 8 threads). Maybe I should do this on wmflabs instead of locally. |
Yeah, 16GB on 8 threads is going to have at least one of them die. wmflabs might be an option, otherwise you could spin up a beefy machine on AWS or Google Cloud Platform for a day for a reasonable price. Alternatively, take a look at pull request #2. It should let you set the number of threads being used. Set it down to like 2, keep an eye on your memory and let it run for longer. I haven't been able to test it yet though, since I don't have any of the files on hand. It is honestly one of the things that I'm most upset about with Unix based systems that they think you can just kill a process and not make it die loud. |
Sweet! Giving it a try with 2 threads. |
@kjschiroo that worked! How do I interpret the results? Is this bytes added by everyone to the topics (general) and bytes added by the input cohort? So I'd just combine those into one dataset to graph students vs. all of wikipedia? |
Without any further adjustments beyond running the script with the fall 2016 users, it looks like the numbers are a lot lower than the ~6% during peak period that you found at https://meta.wikimedia.org/wiki/Research:Wiki_Ed_student_editor_contributions_to_sciences_on_Wikipedia It's more like 1.5% for the most active period. I'll run it for the spring 2016 to make sure I'm getting similar results for that one. |
Those look like the figures that I would have expected for overall contribution rate. What topics did you narrow it down to? |
@kjschiroo I used the same science_projects.csv from the sample inputs. |
Hmm... that's interesting. Wiki Ed hasn't reduced its focus that much on the sciences since the year of science has it? Although a 4 fold increase when you were really pushing towards that wouldn't be that weird. What is the plot of general contribution's to the area? Let me know what the 2016 results are, if they are consistent then I'd guess that it is real, if not we have some investigating to do. |
Wait, that should have still had year of science going on... that is weird. |
I remember there was a push towards labeling all of the articles with a project which is how they are identified. Did that happen for the fall? |
No, it didn't happen for the fall. I wonder how many were labeled in that way for spring 2016. I'll ask the team. |
I remember Ian and Adam making a concerted effort to get them labeled, although I don't know what portion they were needing to label. I could see how that would bring down the figures significantly though since it would omit many new articles that would be really positive by this metric. |
@kjschiroo I must have something wrong with the filtering by project, because I get something very similar for spring 2016. |
@kjschiroo With some print debugging, I note that it's putting about a million pages (1024116) into the 'pages of interest' set. That would mean about 1 out of 5 articles are in one of these science WikiProjects... seems a bit high, but maybe that's right? |
What is the total bytes added summing to for spring 2016? One potential issue I'm seeing here is that it only takes a couple people getting aggressive with their project labels to really change things and these changes end up being applied backward since their is no timestamp associated with them. One is the distribution of articles by project? Are their a couple that decided to go on a labeling spree. Also, could you attach the results file? I'm curious now. This could be an interesting thing about Wikipedia. It could be that when we were analyzing the new articles immediately after they were written we ended up getting a biased result because of our labeling efforts. If most new content is being added to new articles and those new articles take a while to get labeled, then there could have been a bunch of work that was happening, but we couldn't count as being relevant to our goal because it hadn't gotten a label yet. However, after a few months go by those articles slowly get project labels applied to them, then they end up counting. That's just a theory. Let me go take a look at my labs account. I might be able to find the old project-page list that I used. |
@kjschiroo My results files... |
Something is going on here and I'm not sure what. I'm doing some basic sanity checks right now. Validating against your dashboard in Spring 2016 there were 3.73 million words added total. According to this data set in Spring 2016 in science alone 207,886,463 bytes were added iirc there are about 5 bytes per word so 41 million words. You don't happen to have multiple copies of dump files sitting around do you? I've got to run right now, but I'll upload the files I've been referencing from spring 2016 later. |
Here are my files. I've also included the pages that were labeled as science at the time. It is around 670,000. So 1,000,000 is higher than I'd expect, but not totally unbelievable. |
@kjschiroo I have all the gz files, including both stub-meta-history and stub-meta-current. Maybe that is a problem? Will try without the -current ones. |
Extra dumps don't appear to be the problem. I got the same output when I tried after deleting the -current dumps. I'm now trying to use a modified version of this program to get the overall portion of content contributed by students... which I think I can do just by handling the case of no page maps by setting |
I'd be concerned that there is a deeper issue going on here. The total counts should reflect what we observe on the dashboard. |
@ragesoss I'm looking into this and am having trouble with the mysql connection timing out. Would you be able to save me a bit of trouble and post your |
@kjschiroo I can get that to you on Monday. Don't have access to the file this weekend. |
@kjschiroo I shared a dropbox folder that now has the file. |
@kjschiroo I'm working on this for the fall_2016 and spring_2017 terms, but I don't know what to do on wmflabs in terms of the user_database instructions. I have a CSV file of student user names for a term, and I assume I can use that to make a user database on labs?
Any pointers for how to proceed would be much appreciated.
The text was updated successfully, but these errors were encountered: