Pipeline Breakdown:
- ETL Job
- Twitter API Python Library
- Steam API
- Postgres Data Warehouse
- Data Dashboard
Rust cheater profiles are collected every hour from @rusthackreport with the use of a Airflow Python Operator and Twitter API Python Library Cheater Steam profiles are collected from Steam using the Steam Web API with the use of a Custom Airflow Operator Data is collected and stored in a raw S3 bucket. Raw S3 Bucket data is then transformed and stored in a staging bucket on S3. Lastly, staging S3 bucket dim and fact data is loaded with Custom Airflow Operators LoadDimOperator and LoadFactOperator
- Data collected from the Twitter API is moved to raw s3 bucket.
- Twitter data is read from raw s3 bucket then profile urls are extracted and stored in a temp s3.
- Steam data collected from Steam Web API is moved to a raw s3 bucket.
- Raw S3 Steam data undergoes transformations and data checks then stored in a staging s3 bucket.
- Data is transferred from staging S3 buckets into temp tables then into the data warehouse.
- Dashboard can be used to gain insights about cheaters with the Data Studio Dashboard.
-
The US has the most accounts banned for cheating with Russia trailing behind.
-
Most cheaters have a level 1 steam account.
-
The top 3 cheater names
-123
-NeOn
-xd
-
The most common profile picture is the default steam profile picture.
-
The majority of cheaters get banned between 0 and 10 hours.
-
The top 3 games that cheaters own
-
Counter-Strike: Global Offensive
-
PUBG: BATTLEGROUNDS
-
Apex Legends.
-
-
Top 3 Steam Groups
-
Rustoria
-
Andysolam
-
Payday
-
-
Cheaters use Archi's SC Farm to boost their accounts. It's a cheater's attempt to make their account look more legitimate to normal players.
-
Profile Visibility - A lot of people believe if a profile is private it's a cheater. More cheaters have public profiles than private profiles.
-
Friends of Friends - 2,565
-
Private - 824
-
Friends Only - 133
-
1.) Why not uses Spark? The data that is processed every hour is between 1-5MB.
2.) Why stage the Fact and Dim tables pre load? Easier to debug the pipeline in event that the pipeline fails.
Emily(mod#1073) from Data Engineering Discord -Answered questions I had about my initial data warehouse architecture. Emily was very helpful in my adventure to building a data warehouse!