Skip to content

Converting eBird dataset to partitioned Parquet for Pedram Navid, not that he asked.

License

Notifications You must be signed in to change notification settings

skipprd/gbif-bird-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

gbif-bird-dataset

Converting eBird dataset to partitioned Parquet for Pedram Navid, not that he asked.

Usage

Downloading the source data

Download the data from here.

Converting the data

The data is in a large Zip in CSV format. We'll extract the zip on the fly and convert to json records and pipe to Skippr via a little python script courtesy of ChatGPT.

Skippr will discover the schema, partition the data according to our config and convert to Parquet on our local machine.

See Skippr Docs for more info. You'll find documentation on how to sync these files to AWS S3 and Athena, and the schemas propagated to Glue Data Catalog.

Run the following command to convert the data to partitioned Parquet files:

NOTES:

  • I've extracted the zip file to ~/Downloads/2021-eBird-dwca-1.0/eod_2021.csv on my machine.
  • You'll need to replace INSERT_API_TOKEN_HERE with your Skippr Metadata API token.
  • You may need to edit YOUR_AWS_PROFILE_NAME to match your AWS CLI profile name. If you use the default profile, you can remove this line.
python3 parse_to_json.py ~/Downloads/2021-eBird-dwca-1.0.zip | docker run -i \
-e AWS_PROFILE=YOUR_AWS_PROFILE_NAME \
-e DATA_SOURCE_PLUGIN_NAME=stdin \
-e DATA_SOURCE_BATCH_SIZE_BYTES=5000000 \
-e DATA_SOURCE_BATCH_SIZE_SECONDS=30 \
-e TRANSFORM_BATCH_PARTITION_FIELDS=country \
-e TRANSFORM_BATCH_TIME_FIELDS=date \
-e TRANSFORM_BATCH_TIME_UNIT=year \
-e PIPELINE_NAME=ebirds \
-e WORKSPACE_NAME=dev \
-e SKIPPR_API_TOKEN=INSERT_API_TOKEN_HERE \
-e DATA_DIR=./data \
-v `pwd`/data:/data \
skippr/skipprd:stable

Configuration Notes

head -n 2 on the CSV file reveals the below columns.

It's apparent that the data is partitioned contains year, month and day. So we'll use those values to create an additional field called date and configure TRANSFORM_BATCH_TIME_FIELDS=date and TRANSFORM_BATCH_TIME_UNIT=year to partition the data by year.

I also imagine it will be useful to partition the data by country, so we'll use TRANSFORM_BATCH_PARTITION_FIELDS=country to do that.

basisofrecord
institutioncode
collectioncode
catalognumber
occurrenceid
recordedby
year
month
day
publishingcountry
country
stateprovince
county
decimallatitude
decimallongitude
locality
kingdom
phylum
class
order
family
genus
specificepithet
scientificname
vernacularname
taxonremarks
taxonconceptid
individualcount

About

Converting eBird dataset to partitioned Parquet for Pedram Navid, not that he asked.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages