Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc on querying scraper output using DuckDB #76

Merged
merged 5 commits into from
Sep 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,15 @@ jobs:
build:
steps:
# Python & dependency installation
- uses: actions/checkout@v2
- uses: actions/checkout@v4
with:
submodules: true
- name: setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: install Poetry
uses: snok/install-poetry@v1.2.1
uses: snok/install-poetry@v1.4.1
- name: set poetry config path
run: poetry config virtualenvs.path ~/.virtualenvs
- name: install dependencies
Expand Down
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.9
4 changes: 4 additions & 0 deletions docs/contributing/scrapers.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,10 @@ directory and inspect the JSON files. If you're trying to make a small
fix this is often sufficient, you can confirm that the scraped data
looks correct and move on.

Please see our document on [Querying Scraper Output Data](../data/query-scraper-output-data.md)
for tools you can use to investigate data issues across a set of many
scraped data output files.


!!! note
It is of course possible that the scrape fails. If so, there's a good
Expand Down
233 changes: 233 additions & 0 deletions docs/data/query-scraper-output-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
# How to query and examine data file output created by scrapers

This document is aimed to help with scraper development and data debugging.

Open States scrapers typically produce a set of output data files in the JSON format. Each file represents a primary
entity that the scraper has produced, such as a Bill, an Event or a Vote Event. When working on scraper code, it can
be really helpful to examine this output for data quality issues.

## Simple examination

The JSON format is relatively easy to view in any text editor or Integrated Development Environment (IDE). This makes
it easy to examine a specific output file.

And you can take it a step further by using a tool like [jq](https://jqlang.github.io/jq/) to query the JSON document
so that you only see the attributes that are relevant to you. For example:

```shell
cat vote_event_ff4ae3ef-656f-11ef-8b4c-732c295f582c.json | jq .counts
```

## Querying a full set of scraper output files

But what if you want to check data quality issues that might be distributed across the output, which may consist of
thousands of data points? You can put together shell scripts/commands that iterate through and find certian files,
then parse them with `jq`. But this can be tricky, especially if you are not familiar with advanced shell scripting.

The good news is that, if you are familiar with SQL, there is a way to treat a set of scraper output files like a
database and use SQL to query that database. A tool called [DuckDB](https://duckdb.org/) makes this possible.

Let's assume I've run one or more scrapers for Delaware. I open my terminal and look at the scraper output directory,
`_data/de`. There are over 3000 JSON files in that folder, so it would be painful to spot check lots of files:

```
jesse@work:~/repo/openstates/openstates-scrapers/scrapers/_data/de$ ls -alh | grep ".json" | wc -l
3760
```

However, installing the DuckDB tool allows me to use it inside that folder to query the Bills data:

```
jesse@work:~/repo/openstates/openstates-scrapers/scrapers/_data/de$ duckdb
v0.10.2 1601d94f94
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D SELECT * FROM read_json('bill_*.json');
```

Running `duckdb` in this way opens a "transient" database, meaning that anything you do will be gone once you close
the `duckdb` terminal (by using the `ctrl+d` keystroke). If you want to save anything you create in your session,
instead open it with a file that will save those Views/Tables: `duckdb scraper_output.db`

You can see the data schema that DuckDB has identified from the JSON files by using `DESCRIBE`:

```
D DESCRIBE SELECT * FROM read_json('bill_*.json');
┌─────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ legislative_session │ VARCHAR │ YES │ │ │ │
│ identifier │ VARCHAR │ YES │ │ │ │
│ title │ VARCHAR │ YES │ │ │ │
│ from_organization │ VARCHAR │ YES │ │ │ │
│ classification │ VARCHAR[] │ YES │ │ │ │
│ subject │ JSON[] │ YES │ │ │ │
│ abstracts │ STRUCT(note VARCHAR, abstract VARCHAR)[] │ YES │ │ │ │
│ other_titles │ STRUCT(note VARCHAR, title VARCHAR)[] │ YES │ │ │ │
│ other_identifiers │ JSON[] │ YES │ │ │ │
│ actions │ STRUCT(description VARCHAR, date DATE, organization_id VARCHAR, classification VARCHAR[], … │ YES │ │ │ │
│ sponsorships │ STRUCT("name" VARCHAR, classification VARCHAR, entity_type VARCHAR, "primary" BOOLEAN, per… │ YES │ │ │ │
│ related_bills │ JSON[] │ YES │ │ │ │
│ versions │ STRUCT(note VARCHAR, links STRUCT(url VARCHAR, media_type VARCHAR)[], date VARCHAR, classi… │ YES │ │ │ │
│ documents │ STRUCT(note VARCHAR, links STRUCT(url VARCHAR, media_type VARCHAR)[], date VARCHAR, classi… │ YES │ │ │ │
│ citations │ STRUCT("publication" VARCHAR, citation VARCHAR, citation_type VARCHAR, effective JSON, exp… │ YES │ │ │ │
│ sources │ STRUCT(url VARCHAR, note VARCHAR)[] │ YES │ │ │ │
│ extras │ JSON │ YES │ │ │ │
│ _id │ UUID │ YES │ │ │ │
├─────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
│ 18 rows 6 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
```

You'll notice that related entities like Bill Actions, Bill Sponsorships, etc... are embedded in the data as `STRUCT`
alexobaseki marked this conversation as resolved.
Show resolved Hide resolved
data: `actions`, `sponsorships`, etc.. This is typically structured as a list of `STRUCT` data, which is like a dict
or object (`STRUCT(property_a VARCHAR, property_b VARCHAR)[]`). We can drill down into those.

### Use views to drill into the data

To make this data structure simpler to reason about, we can use Views. Here's a set of recommended views you can try.
Please note that these views are used in many of the suggested queries below. In order to execute a query that uses
a view, you need to create the view first!

```sql
-- Bills
CREATE VIEW bills AS
SELECT *
FROM read_json('bill_*.json');
CREATE VIEW bill_actions AS
SELECT legislative_session, identifier AS bill_identifier, _id AS bill_id, unnest(actions, recursive := true)
FROM bills;
CREATE VIEW bill_action_classifications AS
SELECT legislative_session, bill_identifier, bill_id, unnest(classification) AS classification
FROM bill_actions;
CREATE VIEW bill_sponsorships AS
SELECT legislative_session, identifier AS bill_identifier, _id AS bill_id, unnest(sponsorships, recursive := true)
FROM bills;
CREATE VIEW bill_versions AS
SELECT legislative_session, identifier AS bill_identifier, _id AS bill_id, unnest(versions, recursive := true)
FROM bills;
CREATE VIEW bill_version_links AS
SELECT legislative_session, bill_identifier, bill_id, unnest(links, recursive := true)
FROM bill_versions;

-- Vote Events
CREATE VIEW vote_events AS
SELECT *
FROM read_json('vote_event_*.json');
CREATE VIEW vote_event_votes AS
SELECT legislative_session,
bill_identifier,
identifier AS vote_identifier,
_id AS vote_id,
unnest(votes, recursive := true)
FROM vote_events;
CREATE VIEW vote_event_counts AS
SELECT legislative_session,
bill_identifier,
identifier AS vote_identifier,
_id AS vote_id,
unnest(counts, recursive := true)
FROM vote_events;
```

To view Views in your DuckDB database: `SHOW TABLES;`

### Useful queries

These queries will assume you've created the Views listed above! Execute the Views queries above before trying these!

General utility: describe the shape of a given dataset:

```sql
DESCRIBE SELECT * FROM bills;
DESCRIBE SELECT * FROM vote_events;
```

### Bills

#### Bills that are missing a property that is a list (abstracts, actions, sponsorships, etc.)

```sql
SELECT (len(abstracts) > 0) AS has_abstract, COUNT(*)
FROM bills
GROUP BY 1;
SELECT (len(sponsorships) > 0) AS has_sponsorship, COUNT(*)
FROM bills
GROUP BY 1;
SELECT (len(related_bills) > 0) AS has_related_bills, COUNT(*)
FROM bills
GROUP BY 1;
```

#### Distribution of bill classifications

```sql
SELECT classification, COUNT(*)
FROM bills
GROUP BY classification;
```

#### Distribution of bill action classifications

```sql
SELECT classification, COUNT(*)
FROM bill_action_classifications

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we access bill_action_classifications? The same question for bill_actions, bill_version_links.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are all Views that can be created with the queries listed in the "Use views to drill down..." section of the document. I just added a commit that tries to emphasize the need to create those views before using various queries.

Do you think that helps enough to explain? or is there something else that should be added to make this more clear?

GROUP BY 1;
```

#### Bill actions that are unclassified

```sql
SELECT (classification = []) AS action_is_unclassified, COUNT(*)
FROM bill_actions
GROUP BY 1;
```

#### Bill version media type distribution

```sql
SELECT media_type, COUNT(*)
FROM bill_version_links
GROUP BY 1;
```

### Vote Events

#### Vote Events over time (by month)

```sql
SELECT date_trunc('month', start_date), COUNT(*)
FROM vote_events
GROUP BY 1,
ORDER BY 1;
```

#### Vote events with zero or unexpected number of votes

```sql
SELECT len(votes) AS num_votes, COUNT(*) AS events_with_this_number
FROM vote_events
GROUP BY 1
ORDER BY 1 DESC;
```

or check via count values for vote events where all counts are zero

```sql
SELECT vote_id, SUM(value)
FROM vote_event_counts
GROUP BY 1
HAVING SUM(value) = 0;
```

#### Vote event counts check distribution for anomalies

```sql
SELECT
option, value, COUNT (*)
FROM vote_event_counts
GROUP BY 1, 2
ORDER BY 1, 2 DESC;
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ nav:
- 'contributing/documentation.md'
- Data:
- 'data/index.md'
- 'data/query-scraper-output-data.md'
- 'data/categorization.md'
- 'data/session-naming.md'
- Enhancement Proposals:
Expand Down
Loading
Loading