- Helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data.
- Map Reduce
- Massive Parallel Processing technique
- Steps
- Map with user defined mappers: Get data as key value
- Sort handled by framework
- Reduce with user defined reducers: Aggregates data
- Example
- Documents have words, map counts animal words, framework sort sorts, and reduce sums total occurrences of animal words.
- Map =>
cat:10, dog:3
&cat:24, dog:5
- Reduce =>
cat: 34 , dog: 8
- The clusters can be made of hundreds of EC2 instances.
- EMR takes care of all the provisioning and configuration
- Also supports Apache Spark, HBase, Presto, Flink...
- Auto-scaling and integrated with Spot instances for lower price.
- Nodes
- Master node: One node that manages cluster e.g. track status & monitor health
- 💡 You can SSH into master node and from there to task nodes.
- Core nodes: Data (HDFS), compute.
- Input splits: Smaller data chunks of input data split by Hadoop
- Controls control number of Mapper in Map/Reduce
- Blocks: Physical splitting of data
- Input splits: Smaller data chunks of input data split by Hadoop
- Task nodes: Optional, only compute
- Usually runs on spot instances
- Amazon handles if they get executed through running master operations in core nodes.
- How it works:
- Upload your data and processing application to S3
- Configure and create your cluster by specifying data inputs, outputs, cluster size, security settings, etc.
- Launch modes:
- Cluster: Long running
- Step execution: Do analysis & shut-down cluster
- Instance size & number of instances
- Select EC2 key pair & IAM roles to EC2 instances
- Launch modes:
- Monitor the health and progress of your cluster. Retrieve the output in S3
- 💡 Use cases: data processing, machine learning, web indexing, big data
-
Comparison
Attribute S3 / Glacier Select Athena Redshift Spectrum Functionality SQL SELECT
on S3 objectsQuery & extract data from S3 (ETL) SQL queries on S3 using your Redshift cluster S3 Glacier Support ✓ ✖ ✓ Read compressed files ✓ ✓ Integrations S3, Glacier AWS Glue, QuickSight, VPC Flow logs, ELB logs, CloudFront & CloudTrail logs S3, Redshift cluster Infrastructure Serverless Serverless Serverless Pricing Per query: Data Scanned + Data Returned + S3 (data transfer + requests) Per query: Data Scanned + (data transfer + requests) Data scanned + S3 transfer & request + Redshift cluster When to use Simple SELECT
ETL, Logs, Complex queries (ANSI SQL Compliant e.g. group by, having, window and geo functions) When using Redshift cluster -
Athena vs Spectrum
- Athena for ad hoc data discovery and SQL querying
- Redshift Spectrum for
- More complex queries and scenarios
- A large number of data lake users want to run concurrent BI and reporting workloads
- Fully managed 📝ETL (Extract, Transform & Load) service
- Automates time consuming steps of data preparation for analytics.
- E.g. sorting the data by client timestamp, compressing and storing in a read optimized format.
- Serverless, pay as you go, fully managed, provisions Apache Spark
- Crawls data sources and identifies data formats (schema inference)
- Discover, categorize, save in a catalog from connected sources
- Sources: Amazon Aurora, RDS, Redshift & S3
- Automated Code Generation
- It's a customizable Apache Spark code
- Sinks: S3, Redshift, etc..
- Glue Data Catalog
- Metadata (definition & schema) of the Source Tables
- Can be used by EMR
- Crawlers: Programs that run through data to infer schemas and partitions from e.g. S3, Redshift, RDS.
- E.g. scenario =>
- Kinesis (ingest & process) -> S3 (through Kinesis Data Firehose) -> Spark job on AWS glue (sort, optimize, crawl and catalog) -> Batch process catalog using Amazon MR (Elastic Map Reduce), query with Athena.
- Move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals.
- Output to Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.
- A pipeline consists of
- Data nodes for source or destination storages (s3, mysql, dynamodb...)
- E.g. import, copy, export.
- Differences from ETL
- ETL pipelines are a subset of data pipelines.
- ETL may modify data, data pipeline won't.
- ETLs end in warehouses, built for warehousing.
- Managed HPC (high performance computing)
- Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 and Spot Instances.