This tutorial introduces you to the article class and how to access the parsed data.
The Article
class is the base data container Fundus uses to store information about an article.
It contains the parsed attributes as well as the article's origin.
As an example, let's print some titles.
from fundus import Crawler, PublisherCollection
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl(max_articles=2):
print(article.title) # <- you can use any kind of attribute access Python supports on objects here
This should print something like:
Shutterstock shares pop as company expands partnership with OpenAI
Donald Trump asks judge to delay classified documents trial
Now have a look at the attribute guidelines.
All attributes listed here can be safely accessed through the Article
class.
NOTE: The listed attributes represent fields of the Article
dataclass with all of them having default values.
Some parsers may support additional attributes not listed in the guidelines.
You can find those attributes under the supported publisher tables under Additional Attributes
.
NOTE: Keep in mind that these additional attributes are specific to a parser and cannot be accessed safely for every article.
Sometimes an attribute listed in the attribute guidelines isn't supported at all by a specific parser.
You can find this information under the Missing Attributes
tab within the supported publisher tables.
There is also a built-in search mechanic you can learn about here
Fundus supports two methods to access the body of the article
- Accessing the
plaintext
property ofArticle
witharticle.plaintext
. This will return a cleaned and formatted version of the article body as a single string object and should be suitable for most use cases.
NOTE: The different DOM elements are joined with two new lines and cleaned withsplit()
and' '.join()
. - Accessing the
body
attribute ofArticle
. This returns anArticleBody
instance, granting more fine-grained access to the DOM structure of the article body.
The ArticleBody
consists of
- a
summary
giving a brief introduction of the article - a attribute
sections
containing multipleArticleSection
With ArticleSection
including
- a
headline
; separating the section from other sections - multiple
paragraphs
following the headline
ArticleSection
|-- headline: TextSequence
|-- sections: List[ArticleSection]
|-- headline: TextSequence
|-- paragraphs: TextSequence
Let's print the headline and paragraphs for the last section of the article body.
from fundus import Crawler, PublisherCollection
from textwrap import TextWrapper
crawler = Crawler(PublisherCollection.us.CNBC)
wrapper = TextWrapper(width=80, max_lines=1)
for article in crawler.crawl(max_articles=1):
last_section = article.body.sections[-1]
if last_section.headline:
print(wrapper.fill(f"This is a headline: {last_section.headline}"))
for paragraph in last_section.paragraphs:
print(wrapper.fill(f"This is a paragraph: {paragraph}"))
Will print something like this:
This is a headline: Even a proper will is superseded in some cases
This is a paragraph: A will is superseded in some cases, such as with [...]
This is a paragraph: That may also happen if a decedent owns property in [...]
This is a paragraph: "You have to also look at how your assets are [...]
This is a paragraph: When someone dies, the executor presents their will [...]
This is a paragraph: People who would like to keep the details of their [...]
NOTE: Not all publishers support the layout format shown above. Sometimes headlines are missing or the entire summary is. You can always check the specific parser what to expect, but even within publishers, the layout differs from article to article.
Fundus keeps track of the origin of each article.
You can access this information using the html
field of Article
.
Here you have access to the following information:
requested_url: str
: The original URL used to request the HTML.responded_url: str
: The URL attached to the server response. Often the same asrequested_url
; can change with redirects.content: str
: The HTML content.crawl_date: datetime
: The exact timestamp the article was crawled.source_info: SourceInfo
: Some information about the HTML's origins, mostly for debugging purpose.
Sometimes publishers support articles in different languages.
To address this Fundus includes native support for language detection.
You can access the detected language with Article.lang
.
Let's print some languages for our articles.
from fundus import Crawler, PublisherCollection
crawler = Crawler(PublisherCollection.us)
for article in crawler.crawl(max_articles=1):
print(article.lang)
Should print this:
en
In case you want to save an article in JSON format, the Article
class provides a to_json
method, returning a JSON serializable dictionary.
The function accepts string values to specify which attributes should be serialized.
Per default, all extracted attributes and the plaintext
attribute of Article
are included in the serialization.
for article in crawler.crawl(max_articles=10):
# use the default serialization
article_json = article.to_json()
# or only serialize specific attributes
article_json = article.to_json("title", "plaintext", "lang")
To save all articles at once, using the default serialization and only specifying a location, refer to this section.
In the next section we will show you how to filter articles.