Skip to content

Commit

Permalink
Merge pull request #63 from DocNow/missing-entities-check
Browse files Browse the repository at this point in the history
Missing entities check
  • Loading branch information
igorbrigadir authored Jan 8, 2023
2 parents 767535b + 3a16273 commit cb47453
Show file tree
Hide file tree
Showing 5 changed files with 17 additions and 6 deletions.
11 changes: 8 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,10 +39,10 @@ Usage: twarc2 csv [OPTIONS] [INFILE] [OUTFILE]
Convert tweets to CSV.
Options:
--input-data-type [tweets|users|counts|compliance]
--input-data-type [tweets|users|counts|compliance|lists]
Input data type - you can turn "tweets",
"users", "counts" or "compliance" data into
CSV.
"users", "counts" or "compliance" or "lists"
data into CSV.
--inline-referenced-tweets / --no-inline-referenced-tweets
Output referenced tweets inline as separate
rows. Default: no.
Expand All @@ -51,6 +51,11 @@ Options:
The Retweet Text, metrics and entities are
merged from the original tweet. Default:
Yes.
--process-entities / --no-process-entities
Preprocess entities like URLs, mentions and
hashtags, providing expanded urls and lists
only instead of full json objects. Default:
Yes.
--json-encode-all / --no-json-encode-all
JSON encode / escape all fields. Default: no
--json-encode-text / --no-json-encode-text
Expand Down
5 changes: 3 additions & 2 deletions dataframe_converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -380,14 +380,15 @@ def _format_tweet(self, tweet):
tweet["referenced_tweets"] = {}

# Process entities in the tweets:
if self.process_entities and "entities" in tweet:
if self.process_entities and "entities" in tweet and tweet["entities"]:
tweet["entities"] = self._process_entities(tweet["entities"])

# Process entities in the tweet authors of tweets:
if (
self.process_entities
and "author" in tweet
and "entities" in tweet["author"]
and tweet["author"]["entities"]
):
if "url" in tweet["author"]["entities"]:
urls = [
Expand Down Expand Up @@ -418,7 +419,7 @@ def _format_tweet(self, tweet):
tweet["pinned_tweet_id"] if "pinned_tweet_id" in tweet else None
)
# Process entities
if self.process_entities and "entities" in tweet:
if self.process_entities and "entities" in tweet and tweet["entities"]:
if "description" in tweet["entities"]:
tweet["entities"]["description"] = self._process_entities(
tweet["entities"]["description"]
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

setuptools.setup(
name="twarc-csv",
version="0.7.0",
version="0.7.1",
url="https://github.com/docnow/twarc-csv",
author="Igor Brigadir",
author_email="[email protected]",
Expand Down
1 change: 1 addition & 0 deletions test-data/entities_test.jsonl

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions test_twarc_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,7 @@ def test_many_urls():

def test_verified_type():
_process_file("verified_type")


def test_missing_entities():
_process_file("entities_test")

0 comments on commit cb47453

Please sign in to comment.