-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lintangsutawika news #68
Conversation
@lintangsutawika if you can look over this PR that would be highly valuable. It should incorporate the changes requested to your original PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.
I added some examples. I think there is still room for someone to iterate on the html parsing. For example the |
The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else? |
I dug a little into it, as basically the sitemap parsing we use grabs all the links for a website. As an example, for freedom.press it includes links like this https://freedom.press/foia/obama-admin-secret-opposition-foia-reform/ or people profiles like https://freedom.press/people/kelly-caine/ which aren't formatted the same way a news story like https://freedom.press/news/why-arent-more-journalism-schools-teaching-security-hygiene/ is So it looks like we either want to look into expanding parsing to be able to handle these other pages (the hardest part is probably configuration, like which parser does this particular page need) or we can filter to just the news story pages (this seems like it could lose some data as these pages aren't totally empty). My vote would be to merge the processing pipeline code (this pr) and then people can hack on the parser (the |
I can make some fixes for freedom.press. But if we want to merge this first that it's also possible for me to make a new PR to adjust freedom.press parsing. |
I guess it doesn't hurt to try to get the non-news-article pages, but I don't think we should sink a lot of time into it given that there aren't many of them and they don't have much content. @lintangsutawika I don't think @blester125 was saying it was just on freedom.press that there were issues, right? |
I think other sites have a similar case but the number or articles outnumber them (and the news articles are really the main point anyway). |
We should filter out the non-article pages then, right? |
I think some are worth keeping, we just need to figure out what works and what doesn't, for example, https://freedom.press/training/blog/story-inside-your-software-updates/ is a "training" page is seems to be basically parsed correct and has 1200 words in it. In contrast https://freedom.press/training/secondary-signal-account/ only has its title after parsing but looking on the page it has ~1500 words. I think some pages like any /donate/ can def be filtered out. This filtering seems to be needed on multiple sites (for example 360info has It looks like currently the author parsing is over zealous and you get things like |
added propublica add process_map add multiprocess add differentiation between clean and raw directory renamed directory add os added new source update unified process for all news add way to save index run individual news in process.py remove better parsing added notes for each site processes ahref em and strong tags allow both html and url choice to be used add byline and fix html_path update how pages are saved add dependancies failed pages are saved to a new file process to get_page.py, and added news sites add script for processing html set new arguments add filename to jsonfile add args and capture exceptions update get_record removed comments not use wget add list of sites update to split page download and page processing remove duplicates in page_list removed arg fix args moved limit fix typo add process italic fix script to process text tqdm move to inside map moved process to get-text.sh simplify multiprocess simplify multiprocess add empty line add license arg alphabetical order better process to capture bylines and time add header add paramter of searching for date and bylines attrs are searched as regex string update attribute search update method name change header name updated parameters for CC BY sites author then date update search and attrs add readme Create a shared scraping function. This PR adds a shared scraping function to the licensed pile shared library. It has a default user-agent string and smart retries. We should use it when we need to `HTTP GET` a resource from within python.
…tilities and splits it into steps better. It also updates the information extraction steps to have cleaner authors and filters out some pages with little content.
394ca40
to
55ad138
Compare
Updated the code to clean up the author and date extraction a lot, I also filter out some pointless pages like There are still a few small issues, but need a lot of work to fix, i.e. code that runs for specific sources and whatnot. Those can be addressed in v2 I'm going to merge this in a bit unless someone has objections. |
No description provided.