Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lintangsutawika news #68

Merged
merged 2 commits into from
May 27, 2024
Merged

Lintangsutawika news #68

merged 2 commits into from
May 27, 2024

Conversation

blester125
Copy link
Collaborator

No description provided.

@blester125 blester125 mentioned this pull request May 7, 2024
Closed
@blester125 blester125 requested a review from craffel May 7, 2024 13:23
@StellaAthena
Copy link
Collaborator

@lintangsutawika if you can look over this PR that would be highly valuable. It should incorporate the changes requested to your original PR.

Copy link
Collaborator

@craffel craffel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

news/download-pages.sh Outdated Show resolved Hide resolved
@blester125
Copy link
Collaborator Author

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

@craffel
Copy link
Collaborator

craffel commented May 7, 2024

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?

@blester125
Copy link
Collaborator Author

blester125 commented May 7, 2024

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?

Looks reasonable to me - any example output we can look at for each site? Some of the utils are not necessarily news-specific, but we can worry about pulling out shared functionality later.

I added some examples. I think there is still room for someone to iterate on the html parsing. For example the freedom.press examples seem to be missing some content and the author section includes way to much stuff.

The body text looks okayish on the examples I see - I don't see the freedom.press examples but if we're removing content that we shouldn't, that should be fixed. But yes, we should fix author parsing as the author field seems to contain stuff that's not the authors name in all examples. @lintangsutawika is this something you can work on or should we find someone else?

I dug a little into it, as basically the sitemap parsing we use grabs all the links for a website. As an example, for freedom.press it includes links like this https://freedom.press/foia/obama-admin-secret-opposition-foia-reform/ or people profiles like https://freedom.press/people/kelly-caine/ which aren't formatted the same way a news story like https://freedom.press/news/why-arent-more-journalism-schools-teaching-security-hygiene/ is

So it looks like we either want to look into expanding parsing to be able to handle these other pages (the hardest part is probably configuration, like which parser does this particular page need) or we can filter to just the news story pages (this seems like it could lose some data as these pages aren't totally empty).

My vote would be to merge the processing pipeline code (this pr) and then people can hack on the parser (the utils.parse_page function) later.

@lintangsutawika
Copy link
Collaborator

I can make some fixes for freedom.press. But if we want to merge this first that it's also possible for me to make a new PR to adjust freedom.press parsing.

@craffel
Copy link
Collaborator

craffel commented May 8, 2024

I guess it doesn't hurt to try to get the non-news-article pages, but I don't think we should sink a lot of time into it given that there aren't many of them and they don't have much content. @lintangsutawika I don't think @blester125 was saying it was just on freedom.press that there were issues, right?

@lintangsutawika
Copy link
Collaborator

I think other sites have a similar case but the number or articles outnumber them (and the news articles are really the main point anyway).

@craffel
Copy link
Collaborator

craffel commented May 8, 2024

We should filter out the non-article pages then, right?

@blester125
Copy link
Collaborator Author

I think some are worth keeping, we just need to figure out what works and what doesn't, for example, https://freedom.press/training/blog/story-inside-your-software-updates/ is a "training" page is seems to be basically parsed correct and has 1200 words in it. In contrast https://freedom.press/training/secondary-signal-account/ only has its title after parsing but looking on the page it has ~1500 words. I think some pages like any /donate/ can def be filtered out. This filtering seems to be needed on multiple sites (for example 360info has /visual_tags and tag pages that don't seems to have any info on them/get parsed to nothing, there are similar /tag urls in libertytvradio. I assume there are things like this for all the sites.)

It looks like currently the author parsing is over zealous and you get things like "author": "Trevor Timm\n\n\n\n\nExecutive Director\n\nMay 28, 2019" which leads to double dates at the start of the article "Unanswered questions on the San Francisco police raid of a journalist’s home\nTrevor Timm\n\n\n\n\nExecutive Director\n\nMay 28, 2019\nMay 28, 2019\n... This happens in multiple sources (like 360info "author": "Authors\nGilda TachedjianBurnet Institute and Monash University")

lintangsutawika and others added 2 commits May 27, 2024 16:30
added propublica

add process_map

add multiprocess

add differentiation between clean and raw directory

renamed directory

add os

added new source

update

unified process for all news

add way to save index

run individual news in process.py

remove

better parsing

added notes for each site

processes ahref em and strong tags

allow both html and url choice to be used

add byline and fix html_path

update how pages are saved

add dependancies

failed pages are saved to a new file

process to get_page.py, and added news sites

add script for processing html

set new arguments

add filename to jsonfile

add args and capture exceptions

update get_record

removed comments

not use wget

add list of sites

update to split page download and page processing

remove duplicates in page_list

removed arg

fix args

moved limit

fix typo

add process italic

fix script to process text

tqdm move to inside map

moved process to get-text.sh

simplify multiprocess

simplify multiprocess

add empty line

add license arg

alphabetical order

better process to capture bylines and time

add header

add paramter of searching for date and bylines

attrs are searched as regex string

update attribute search

update method name

change header name

updated parameters for CC BY sites

author then date

update search and attrs

add readme

Create a shared scraping function.

This PR adds a shared scraping function to the licensed pile shared
library. It has a default user-agent string and smart retries. We should
use it when we need to `HTTP GET` a resource from within python.
…tilities and splits it into steps better.

It also updates the information extraction steps to have cleaner
authors and filters out some pages with little content.
@blester125 blester125 force-pushed the lintangsutawika-news branch from 394ca40 to 55ad138 Compare May 27, 2024 20:32
@blester125
Copy link
Collaborator Author

blester125 commented May 27, 2024

Updated the code to clean up the author and date extraction a lot, I also filter out some pointless pages like /tag/.... The fully scraped and processed dataset is at https://huggingface.co/datasets/blester125/news-dolma

There are still a few small issues, but need a lot of work to fix, i.e. code that runs for specific sources and whatnot. Those can be addressed in v2

I'm going to merge this in a bit unless someone has objections.

@blester125 blester125 merged commit e2b8675 into main May 27, 2024
2 checks passed
@blester125 blester125 deleted the lintangsutawika-news branch May 27, 2024 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants