Lintangsutawika news #68

blester125 · 2024-05-07T05:38:26Z

No description provided.

StellaAthena · 2024-05-07T14:50:27Z

@lintangsutawika if you can look over this PR that would be highly valuable. It should incorporate the changes requested to your original PR.

craffel

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

news/download-pages.sh

blester125 · 2024-05-07T19:25:15Z

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

craffel · 2024-05-07T19:31:15Z

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?

blester125 · 2024-05-07T21:55:05Z

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?

I dug a little into it, as basically the sitemap parsing we use grabs all the links for a website. As an example, for freedom.press it includes links like this https://freedom.press/foia/obama-admin-secret-opposition-foia-reform/ or people profiles like https://freedom.press/people/kelly-caine/ which aren't formatted the same way a news story like https://freedom.press/news/why-arent-more-journalism-schools-teaching-security-hygiene/ is

So it looks like we either want to look into expanding parsing to be able to handle these other pages (the hardest part is probably configuration, like which parser does this particular page need) or we can filter to just the news story pages (this seems like it could lose some data as these pages aren't totally empty).

My vote would be to merge the processing pipeline code (this pr) and then people can hack on the parser (the utils.parse_page function) later.

lintangsutawika · 2024-05-08T04:10:07Z

I can make some fixes for freedom.press. But if we want to merge this first that it's also possible for me to make a new PR to adjust freedom.press parsing.

craffel · 2024-05-08T14:52:47Z

I guess it doesn't hurt to try to get the non-news-article pages, but I don't think we should sink a lot of time into it given that there aren't many of them and they don't have much content. @lintangsutawika I don't think @blester125 was saying it was just on freedom.press that there were issues, right?

lintangsutawika · 2024-05-08T15:23:23Z

I think other sites have a similar case but the number or articles outnumber them (and the news articles are really the main point anyway).

craffel · 2024-05-08T15:30:03Z

We should filter out the non-article pages then, right?

blester125 · 2024-05-08T16:00:48Z

I think some are worth keeping, we just need to figure out what works and what doesn't, for example, https://freedom.press/training/blog/story-inside-your-software-updates/ is a "training" page is seems to be basically parsed correct and has 1200 words in it. In contrast https://freedom.press/training/secondary-signal-account/ only has its title after parsing but looking on the page it has ~1500 words. I think some pages like any /donate/ can def be filtered out. This filtering seems to be needed on multiple sites (for example 360info has /visual_tags and tag pages that don't seems to have any info on them/get parsed to nothing, there are similar /tag urls in libertytvradio. I assume there are things like this for all the sites.)

It looks like currently the author parsing is over zealous and you get things like "author": "Trevor Timm\n\n\n\n\nExecutive Director\n\nMay 28, 2019" which leads to double dates at the start of the article "Unanswered questions on the San Francisco police raid of a journalist’s home\nTrevor Timm\n\n\n\n\nExecutive Director\n\nMay 28, 2019\nMay 28, 2019\n... This happens in multiple sources (like 360info "author": "Authors\nGilda TachedjianBurnet Institute and Monash University")

added propublica add process_map add multiprocess add differentiation between clean and raw directory renamed directory add os added new source update unified process for all news add way to save index run individual news in process.py remove better parsing added notes for each site processes ahref em and strong tags allow both html and url choice to be used add byline and fix html_path update how pages are saved add dependancies failed pages are saved to a new file process to get_page.py, and added news sites add script for processing html set new arguments add filename to jsonfile add args and capture exceptions update get_record removed comments not use wget add list of sites update to split page download and page processing remove duplicates in page_list removed arg fix args moved limit fix typo add process italic fix script to process text tqdm move to inside map moved process to get-text.sh simplify multiprocess simplify multiprocess add empty line add license arg alphabetical order better process to capture bylines and time add header add paramter of searching for date and bylines attrs are searched as regex string update attribute search update method name change header name updated parameters for CC BY sites author then date update search and attrs add readme Create a shared scraping function. This PR adds a shared scraping function to the licensed pile shared library. It has a default user-agent string and smart retries. We should use it when we need to `HTTP GET` a resource from within python.

…tilities and splits it into steps better. It also updates the information extraction steps to have cleaner authors and filters out some pages with little content.

blester125 · 2024-05-27T20:34:55Z

Updated the code to clean up the author and date extraction a lot, I also filter out some pointless pages like /tag/.... The fully scraped and processed dataset is at https://huggingface.co/datasets/blester125/news-dolma

There are still a few small issues, but need a lot of work to fix, i.e. code that runs for specific sources and whatnot. Those can be addressed in v2

I'm going to merge this in a bit unless someone has objections.

blester125 mentioned this pull request May 7, 2024

News #45

Closed

blester125 requested a review from craffel May 7, 2024 13:23

craffel reviewed May 7, 2024

View reviewed changes

news/download-pages.sh Outdated Show resolved Hide resolved

lintangsutawika and others added 2 commits May 27, 2024 16:30

This PR updates the news parsing code to use some of the new shared u…

55ad138

…tilities and splits it into steps better. It also updates the information extraction steps to have cleaner authors and filters out some pages with little content.

blester125 force-pushed the lintangsutawika-news branch from 394ca40 to 55ad138 Compare May 27, 2024 20:32

blester125 merged commit e2b8675 into main May 27, 2024
2 checks passed

blester125 deleted the lintangsutawika-news branch May 27, 2024 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lintangsutawika news #68

Lintangsutawika news #68

blester125 commented May 7, 2024

StellaAthena commented May 7, 2024

craffel left a comment

blester125 commented May 7, 2024

craffel commented May 7, 2024

blester125 commented May 7, 2024 •

edited

Loading

lintangsutawika commented May 8, 2024

craffel commented May 8, 2024

lintangsutawika commented May 8, 2024

craffel commented May 8, 2024

blester125 commented May 8, 2024

blester125 commented May 27, 2024 •

edited

Loading

Lintangsutawika news #68

Lintangsutawika news #68

Conversation

blester125 commented May 7, 2024

StellaAthena commented May 7, 2024

craffel left a comment

Choose a reason for hiding this comment

blester125 commented May 7, 2024

craffel commented May 7, 2024

blester125 commented May 7, 2024 • edited Loading

lintangsutawika commented May 8, 2024

craffel commented May 8, 2024

lintangsutawika commented May 8, 2024

craffel commented May 8, 2024

blester125 commented May 8, 2024

blester125 commented May 27, 2024 • edited Loading

blester125 commented May 7, 2024 •

edited

Loading

blester125 commented May 27, 2024 •

edited

Loading