Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text can't be fetched from certain news sites (example included) #128

Open
BryanWall opened this issue May 23, 2024 · 1 comment
Open

Comments

@BryanWall
Copy link

BryanWall commented May 23, 2024

I tried to create a feed using the complex CSS selector in RSS-bridge but was unable to get it to fetch the full text. Full text of articles is viewable through a browser, but this is one of those sites that only lets you view a certain number of articles before throwing up a paywall. It seems to only use cookies for that, however, so you can use a private browsing session to get around it.

The way they are formatting the site prevents RSS-bridge from getting the full text, however. The "hidden" content is just stored in divs with class="subscriber-only", which I assume they hide with CSS once you've exceeded the limit on articles. The text is not removed from the page, however.

When I couldn't figure it out in RSS-bridge I googled for solutions and found morss. I tried creating a feed to just fetch the article links and then use that with morss, but morss runs into the same issue getting the full text. Here's a sample of the RSS-bridge feed that I made that you can try with morss to see the issue.

https://rss-bridge.org/bridge01/?action=display&bridge=XPathBridge&url=https%3A%2F%2Fwww.paducahsun.com%2Fsearch%2F%3Fk%3D%2522mccracken%2520county%2520public%2520schools%2522%23tncms-source%3Dkeyword&item=%2F%2Farticle&title=.%2F%2Fa%2F%40aria-label&content=.%2F%2Fp%2Ftext%28%29&uri=.%2F%2Fa%5B%40class%3D%22tnt-asset-link%22%5D%2F%40href&author=&timestamp=.%2F%2Ftime%2F%40datetime&enclosures=.%2F%2Fimg%2F%40data-srcset%5B1%5D&categories=&format=Json

The CMS being used is Blox CMS (https://www.help.bloxdigital.com/blox_cms/community/access_control/). It is used by hundreds of newspaper and TV station web sites. Figuring out a way to fetch the full text from one of these would probably fix every site that uses this same CMS.

@BryanWall
Copy link
Author

This app has code for bypassing the paywalls on this site to load the full text:

https://github.com/bpc-clone?tab=repositories

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant