Skip to content

danajsalk/Data-Modeling-with-Postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Summary


This project creates a Postgres database with tables designed to optimize queries on song play analysis. It creates a database schema and ETL pipeline for this analysis. You can test your database and ETL pipeline by running queries given to you in the etl.py script and compare your results with their expected results.

Source Data


The first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

Log files contains songplay events of the users in json format while song_data contains list of songs details.

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings. The log files in the dataset are partitioned by year and month.

Database Schema


Using the song and event datasets, a star schema is created and optimized for queries on song play analysis. Following are the fact and dimension tables made for this project:

Fact Table:

  • songplays - records in event data associated with song plays i.e. records with page NextSong
    • columns: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables:

  • users - users in the app
    • columns: user_id, first_name, last_name, gender, level
  • songs - songs in music database
    • columns: song_id, title, artist_id, year, duration
  • artists - artists in music database
    • columns: artist_id, name, location, lattitude, longitude
  • time - timestamps of records in songplays broken down into specific units
    • columns: start_time, hour, day, week, month, year, weekday

How to Run

Below are steps you can follow to run the project:
Create Tables Write CREATE statements in sql_queries.py to create each table. Write DROP statements in sql_queries.py to drop each table if it exists. Run create_tables.py to create your database and tables. Run test.ipynb to confirm the creation of your tables with the correct columns. Make sure to click "Restart kernel" to close the connection to the database after running this notebook.

The general procedure of this ETL is copying data from one or more sources, in this case single raw files are read and processed from song_data and log_data, temporarily store them using Sparkify. This now represents the data as organized/filtered data tables. The analytical goals of this ETL are to easily pull data from different sources and efficiently process it into useful information or statistics on the overall data size and shape. The database schema is structured around the data stored in seperate PostgreSQL tables to better understand what is store. This follows a star schema optimized for queries on song play analysis.

NOTE:

You will not be able to run test.ipynb, etl.ipynb, or etl.py until you have run create_tables.py at least once to create the sparkifydb database, which these other files connect to.

References:
DOUBLE PRECISION DATA TYPES: DATA TYPES2:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages