Retire Common Crawl module & DAGs (#870)

* Retired module commoncrawl and retired the commoncrawl_utils test * updated DAGs.md and test_dag_parsing.py as suggested in ##861 * Remove ETL test module, additional documentation cleanup * Delete more unused test files * Remove unused testing buckets * Update README.md Co-authored-by: Olga Bulat <[email protected]> Co-authored-by: Meet Parekh <[email protected]> Co-authored-by: Meet Parekh <[email protected]> Co-authored-by: Olga Bulat <[email protected]>
WordPress · Nov 22, 2022 · dad3cb4 · dad3cb4
1 parent a6f4eab
commit dad3cb4
Show file tree

Hide file tree

Showing 17 changed files with 26 additions and 89 deletions.
diff --git a/DAGs.md b/DAGs.md
@@ -14,23 +14,13 @@ The DAGs are shown in two forms:
 
 The following are DAGs grouped by their primary tag:
 
- 1. [Commoncrawl](#commoncrawl)
  1. [Data Refresh](#data_refresh)
  1. [Database](#database)
  1. [Maintenance](#maintenance)
  1. [Oauth](#oauth)
  1. [Provider](#provider)
  1. [Provider Reingestion](#provider-reingestion)
 
-## Commoncrawl
-
-| DAG ID | Schedule Interval |
-| --- | --- |
-| `commoncrawl_etl_workflow` | `0 0 * * 1` |
-| `sync_commoncrawl_workflow` | `0 16 15 * *` |
-
-
-
 ## Data Refresh
 
 | DAG ID | Schedule Interval |

diff --git a/README.md b/README.md
@@ -10,12 +10,25 @@ This repository contains the methods used to identify over 1.4 billion Creative
 Commons licensed works. The challenge is that these works are dispersed
 throughout the web and identifying them requires a combination of techniques.
 
-Two approaches are currently in use:
+Currently, we only pull data from APIs which serve Creative Commons licensed media.
+In the past, we have also used web crawl data as a source.
 
-1. Web crawl data
-2. Application Programming Interfaces (API Data)
+## API Data
+
+[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
+various API ETL jobs which pull and process data from a number of open APIs on
+the internet.
 
-## Web Crawl Data
+### API Workflows
+
+To view more information about all the available workflows (DAGs) within the project,
+see [DAGs.md](DAGs.md).
+
+See each provider API script's notes in their respective [handbook][ov-handbook] entry.
+
+[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
+
+## Web Crawl Data (retired)
 
 The Common Crawl Foundation provides an open repository of petabyte-scale web
 crawl data. A new dataset is published at the end of each month comprising over
@@ -31,10 +44,10 @@ The data is available in three file formats:
 For more information about these formats, please see the
 [Common Crawl documentation][ccrawl_doc].
 
-Openverse Catalog uses AWS Data Pipeline service to automatically create an Amazon EMR
-cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify
+Openverse Catalog used AWS Data Pipeline service to automatically create an Amazon EMR
+cluster of 100 c4.8xlarge instances that parsed the WAT archives to identify
 all domains that link to creativecommons.org. Due to the volume of data, Apache
-Spark is used to streamline the processing. The output of this methodology is a
+Spark was also used to streamline the processing. The output of this methodology was a
 series of parquet files that contain:
 
 - the domains and its respective content path and query string (i.e. the exact
@@ -45,26 +58,13 @@ series of parquet files that contain:
 - the location of the webpage in the WARC file so that the page contents can be
   found.
 
-The steps above are performed in [`ExtractCCLinks.py`][ex_cc_links].
+The steps above were performed in [`ExtractCCLinks.py`][ex_cc_links].
+
+This method was retired in 2021.
 
 [ccrawl_doc]: https://commoncrawl.org/the-data/get-started/
 [ex_cc_links]: archive/ExtractCCLinks.py
 
-## API Data
-
-[Apache Airflow](https://airflow.apache.org/) is used to manage the workflow for
-various API ETL jobs which pull and process data from a number of open APIs on
-the internet.
-
-### API Workflows
-
-To view more information about all the available workflows (DAGs) within the project,
-see [DAGs.md](DAGs.md).
-
-See each provider API script's notes in their respective [handbook][ov-handbook] entry.
-
-[ov-handbook]: https://make.wordpress.org/openverse/handbook/openverse-handbook/
-
 ## Development setup for Airflow and API puller scripts
 
 There are a number of scripts in the directory
@@ -224,12 +224,13 @@ openverse-catalog
 ├── openverse_catalog/                      # Primary code directory
 │   ├── dags/                               # DAGs & DAG support code
 │   │   ├── common/                         #   - Shared modules used across DAGs
-│   │   ├── commoncrawl/                    #   - DAGs & scripts for commoncrawl parsing
+│   │   ├── data_refresh/                   #   - DAGs & code related to the data refresh process
 │   │   ├── database/                       #   - DAGs related to database actions (matview refresh, cleaning, etc.)
 │   │   ├── maintenance/                    #   - DAGs related to airflow/infrastructure maintenance
 │   │   ├── oauth2/                         #   - DAGs & code for Oauth2 key management
 │   │   ├── providers/                      #   - DAGs & code for provider ingestion
 │   │   │   ├── provider_api_scripts/       #       - API access code specific to providers
+│   │   │   ├── provider_csv_load_scripts/  #       - Schema initialization SQL definitions for SQL-based providers
 │   │   │   └── *.py                        #       - DAG definition files for providers
 │   │   └── retired/                        #   - DAGs & code that is no longer needed but might be a useful guide for the future
 │   └── templates/                          # Templates for generating new provider code

diff --git a/docker-compose.override.yml b/docker-compose.override.yml
@@ -27,7 +27,7 @@ services:
       MINIO_ROOT_USER: ${AWS_ACCESS_KEY}
       MINIO_ROOT_PASSWORD: ${AWS_SECRET_KEY}
       # Comma separated list of buckets to create on startup
-      BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs,commonsmapper-v2,commonsmapper
+      BUCKETS_TO_CREATE: ${OPENVERSE_BUCKET},openverse-airflow-logs
     # Create empty buckets on every container startup
     # Note: $0 is included in the exec because "/bin/bash -c" swallows the first
     # argument, so it must be re-added at the beginning of the exec call

diff --git a/env.template b/env.template
@@ -100,9 +100,6 @@ AWS_ACCESS_KEY=test_key
 AWS_SECRET_KEY=test_secret
 # General bucket used for TSV->DB ingestion and logging
 OPENVERSE_BUCKET=openverse-storage
-# Used only for commoncrawl parsing
-S3_BUCKET=not_set
-COMMONCRAWL_BUCKET=not_set
 # Seconds to wait before poking for availability of the data refresh pool when running a data_refresh
 # DAG. Used to shorten the time for testing purposes.
 DATA_REFRESH_POKE_INTERVAL=5

diff --git a/openverse_catalog/dags/.airflowignore b/openverse_catalog/dags/.airflowignore
@@ -1,5 +1,4 @@
 # Ignore all non-DAG files
 common/
-commoncrawl/commoncrawl_scripts
 providers/provider_api_scripts
 retired
diff --git a/...talog/dags/commoncrawl/commoncrawl_etl.py → ...gs/retired/commoncrawl/commoncrawl_etl.py b/...talog/dags/commoncrawl/commoncrawl_etl.py → ...gs/retired/commoncrawl/commoncrawl_etl.py
diff --git a/...mmoncrawl_s3_syncer/SyncImageProviders.py → ...mmoncrawl_s3_syncer/SyncImageProviders.py b/...mmoncrawl_s3_syncer/SyncImageProviders.py → ...mmoncrawl_s3_syncer/SyncImageProviders.py
diff --git a/...moncrawl_scripts/scripts/merge_cc_tags.py → ...moncrawl_scripts/scripts/merge_cc_tags.py b/...moncrawl_scripts/scripts/merge_cc_tags.py → ...moncrawl_scripts/scripts/merge_cc_tags.py
diff --git a/...log/dags/commoncrawl/commoncrawl_utils.py → .../retired/commoncrawl/commoncrawl_utils.py b/...log/dags/commoncrawl/commoncrawl_utils.py → .../retired/commoncrawl/commoncrawl_utils.py
diff --git a/.../commoncrawl/sync_commoncrawl_workflow.py → .../commoncrawl/sync_commoncrawl_workflow.py b/.../commoncrawl/sync_commoncrawl_workflow.py → .../commoncrawl/sync_commoncrawl_workflow.py
diff --git a/tests/dags/common/etl/__init__.py b/tests/dags/common/etl/__init__.py
diff --git a/tests/dags/common/etl/test_commoncrawl_utils.py b/tests/dags/common/etl/test_commoncrawl_utils.py
diff --git a/tests/dags/common/loader/test_resources/new_columns_crawl.tsv b/tests/dags/common/loader/test_resources/new_columns_crawl.tsv
diff --git a/tests/dags/common/loader/test_resources/new_columns_papis.tsv b/tests/dags/common/loader/test_resources/new_columns_papis.tsv
diff --git a/tests/dags/common/loader/test_resources/old_columns_crawl.tsv b/tests/dags/common/loader/test_resources/old_columns_crawl.tsv