Table of Contents generated with DocToc
- GDELT public dataset - A geo-spatial-political event dataset. Ingest the first 4 million lines (1979-1984), which comes out to about 1.1GB CSV file. Entire dataset is more than 250 million rows / 250GB. See the discussion on README about different modelling options.
- To ingest:
val csvDF ="com.databricks.spark.csv").
option("header", "true").option("inferSchema", "true").
import org.apache.spark.sql.SaveMode
option("dataset", "gdelt").
option("row_keys", "GLOBALEVENTID").
option("partition_keys", "MonthYear").
- NYC Taxi Trip and Fare Info - really interesting geospatial-temporal public time series data. Contains New York City taxi transactions, and an example of how to handle time series / IoT with many entities. Trip data is 2.4GB for one part, ~ 15 million rows, and there are 12 parts.
should result in 14776615 records for thetrip_data_1.csv
(Part 1).- Partition by string prefix of medallion gives a pretty even distribution, into 676 shards, of all taxi transactions. Note that even with this level of sharding, reading data for one taxi/medallion for a given time range is still pretty fast.
- Putting
first in row key allows range queries by time.
val taxiDF ="com.databricks.spark.csv").
option("header", "true").option("inferSchema", "true").
import org.apache.spark.sql.SaveMode
option("dataset", "nyc_taxi").
option("row_keys", "pickup_datetime,medallion,hack_license").
option("partition_keys", ":stringPrefix medallion 2").
There is a Spark Notebook to analyze the NYC Taxi dataset.
- Weather Datasets and APIs
- Also see the KillrWeather data sets
- Airline on-time performance data - nearly 120 million records, 16GB uncompressed
- PCoE NASA Datasets - a series of time series datasets used for modeling and prediction of failure scenarios
- Design and Implementation of Modern Column-Oriented DBs
- Positional Update Handling in Column Stores
- SnappyData Architecture - a hybrid in-memory, Spark-integrated row/column store