Skip to content

Commit

Permalink
Retire Common Crawl module & DAGs (#870)
Browse files Browse the repository at this point in the history
* Retired module commoncrawl and retired the commoncrawl_utils test

* updated DAGs.md and test_dag_parsing.py as suggested in ##861

* Remove ETL test module, additional documentation cleanup

* Delete more unused test files

* Remove unused testing buckets

* Update README.md

Co-authored-by: Olga Bulat <[email protected]>

Co-authored-by: Meet Parekh <[email protected]>
Co-authored-by: Meet Parekh <[email protected]>
Co-authored-by: Olga Bulat <[email protected]>
  • Loading branch information
4 people authored Nov 22, 2022
1 parent a6f4eab commit dad3cb4
Show file tree
Hide file tree
Showing 17 changed files with 26 additions and 89 deletions.
10 changes: 0 additions & 10 deletions DAGs.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,23 +14,13 @@ The DAGs are shown in two forms:

The following are DAGs grouped by their primary tag:

1. [Commoncrawl](#commoncrawl)
1. [Data Refresh](#data_refresh)
1. [Database](#database)
1. [Maintenance](#maintenance)
1. [Oauth](#oauth)
1. [Provider](#provider)
1. [Provider Reingestion](#provider-reingestion)

## Commoncrawl

| DAG ID | Schedule Interval |
| --- | --- |
| `commoncrawl_etl_workflow` | `0 0 * * 1` |
| `sync_commoncrawl_workflow` | `0 16 15 * *` |



## Data Refresh

| DAG ID | Schedule Interval |
Expand Down
49 changes: 25 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,25 @@ This repository contains the methods used to identify over 1.4 billion Creative
Commons licensed works. The challenge is that these works are dispersed
throughout the web and identifying them requires a combination of techniques.

Two approaches are currently in use:
Currently, we only pull data from APIs which serve Creative Commons licensed media.
In the past, we have also used web crawl data as a source.

1. Web crawl data
2. Application Programming Interfaces (API Data)
## API Data

[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
various API ETL jobs which pull and process data from a number of open APIs on
the internet.

## Web Crawl Data
### API Workflows

To view more information about all the available workflows (DAGs) within the project,
see [DAGs.md](DAGs.md).

See each provider API script's notes in their respective [handbook][ov-handbook] entry.

[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/

## Web Crawl Data (retired)

The Common Crawl Foundation provides an open repository of petabyte-scale web
crawl data. A new dataset is published at the end of each month comprising over
Expand All @@ -31,10 +44,10 @@ The data is available in three file formats:
For more information about these formats, please see the
[Common Crawl documentation][ccrawl_doc].

Openverse Catalog uses AWS Data Pipeline service to automatically create an Amazon EMR
cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify
Openverse Catalog used AWS Data Pipeline service to automatically create an Amazon EMR
cluster of 100 c4.8xlarge instances that parsed the WAT archives to identify
all domains that link to creativecommons.org. Due to the volume of data, Apache
Spark is used to streamline the processing. The output of this methodology is a
Spark was also used to streamline the processing. The output of this methodology was a
series of parquet files that contain:

- the domains and its respective content path and query string (i.e. the exact
Expand All @@ -45,26 +58,13 @@ series of parquet files that contain:
- the location of the webpage in the WARC file so that the page contents can be
found.

The steps above are performed in [`ExtractCCLinks.py`][ex_cc_links].
The steps above were performed in [`ExtractCCLinks.py`][ex_cc_links].

This method was retired in 2021.

[ccrawl_doc]: https://commoncrawl.org/the-data/get-started/
[ex_cc_links]: archive/ExtractCCLinks.py

## API Data

[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
various API ETL jobs which pull and process data from a number of open APIs on
the internet.

### API Workflows

To view more information about all the available workflows (DAGs) within the project,
see [DAGs.md](DAGs.md).

See each provider API script's notes in their respective [handbook][ov-handbook] entry.

[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/

## Development setup for Airflow and API puller scripts

There are a number of scripts in the directory
Expand Down Expand Up @@ -224,12 +224,13 @@ openverse-catalog
├── openverse_catalog/ # Primary code directory
│ ├── dags/ # DAGs & DAG support code
│ │ ├── common/ # - Shared modules used across DAGs
│ │ ├── commoncrawl/ # - DAGs & scripts for commoncrawl parsing
│ │ ├── data_refresh/ # - DAGs & code related to the data refresh process
│ │ ├── database/ # - DAGs related to database actions (matview refresh, cleaning, etc.)
│ │ ├── maintenance/ # - DAGs related to airflow/infrastructure maintenance
│ │ ├── oauth2/ # - DAGs & code for Oauth2 key management
│ │ ├── providers/ # - DAGs & code for provider ingestion
│ │ │ ├── provider_api_scripts/ # - API access code specific to providers
│ │ │ ├── provider_csv_load_scripts/ # - Schema initialization SQL definitions for SQL-based providers
│ │ │ └── *.py # - DAG definition files for providers
│ │ └── retired/ # - DAGs & code that is no longer needed but might be a useful guide for the future
│ └── templates/ # Templates for generating new provider code
Expand Down
2 changes: 1 addition & 1 deletion docker-compose.override.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ services:
MINIO_ROOT_USER: ${AWS_ACCESS_KEY}
MINIO_ROOT_PASSWORD: ${AWS_SECRET_KEY}
# Comma separated list of buckets to create on startup
BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs,commonsmapper-v2,commonsmapper
BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs
# Create empty buckets on every container startup
# Note: $0 is included in the exec because "/bin/bash -c" swallows the first
# argument, so it must be re-added at the beginning of the exec call
Expand Down
3 changes: 0 additions & 3 deletions env.template
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,6 @@ AWS_ACCESS_KEY=test_key
AWS_SECRET_KEY=test_secret
# General bucket used for TSV->DB ingestion and logging
OPENVERSE_BUCKET=openverse-storage
# Used only for commoncrawl parsing
S3_BUCKET=not_set
COMMONCRAWL_BUCKET=not_set
# Seconds to wait before poking for availability of the data refresh pool when running a data_refresh
# DAG. Used to shorten the time for testing purposes.
DATA_REFRESH_POKE_INTERVAL=5
Expand Down
1 change: 0 additions & 1 deletion openverse_catalog/dags/.airflowignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
# Ignore all non-DAG files
common/
commoncrawl/commoncrawl_scripts
providers/provider_api_scripts
retired
Empty file removed tests/dags/common/etl/__init__.py
Empty file.
40 changes: 0 additions & 40 deletions tests/dags/common/etl/test_commoncrawl_utils.py

This file was deleted.

2 changes: 0 additions & 2 deletions tests/dags/common/loader/test_resources/new_columns_crawl.tsv

This file was deleted.

2 changes: 0 additions & 2 deletions tests/dags/common/loader/test_resources/new_columns_papis.tsv

This file was deleted.

2 changes: 0 additions & 2 deletions tests/dags/common/loader/test_resources/old_columns_crawl.tsv

This file was deleted.

Loading

0 comments on commit dad3cb4

Please sign in to comment.