Big Data & Data Analytics

Amazon EMR - Elastic Map Reduce

Helps creating Hadoop clusters (Big Data) to analyze and process vast amount of data.
Map Reduce
- Massive Parallel Processing technique
- Steps
  1. Map with user defined mappers: Get data as key value
  2. Sort handled by framework
  3. Reduce with user defined reducers: Aggregates data
- Example
  - Documents have words, map counts animal words, framework sort sorts, and reduce sums total occurrences of animal words.
  - Map => cat:10, dog:3 & cat:24, dog:5
  - Reduce => cat: 34 , dog: 8
The clusters can be made of hundreds of EC2 instances.
EMR takes care of all the provisioning and configuration
Also supports Apache Spark, HBase, Presto, Flink...
Auto-scaling and integrated with Spot instances for lower price.
Nodes
Master node: One node that manages cluster e.g. track status & monitor health
- 💡 You can SSH into master node and from there to task nodes.
Core nodes: Data (HDFS), compute.
- Input splits: Smaller data chunks of input data split by Hadoop
  - Controls control number of Mapper in Map/Reduce
- Blocks: Physical splitting of data
Task nodes: Optional, only compute
- Usually runs on spot instances
- Amazon handles if they get executed through running master operations in core nodes.
How it works:
1. Upload your data and processing application to S3
2. Configure and create your cluster by specifying data inputs, outputs, cluster size, security settings, etc.
  - Launch modes:
    - Cluster: Long running
    - Step execution: Do analysis & shut-down cluster
  - Instance size & number of instances
  - Select EC2 key pair & IAM roles to EC2 instances
3. Monitor the health and progress of your cluster. Retrieve the output in S3
💡 Use cases: data processing, machine learning, web indexing, big data

Query Engines

Comparison

Attribute	S3 / Glacier Select	Athena	Redshift Spectrum
Functionality	SQL `SELECT` on S3 objects	Query & extract data from S3 (ETL)	SQL queries on S3 using your Redshift cluster
S3 Glacier Support	✓	✖	✓
Read compressed files		✓	✓
Integrations	S3, Glacier	AWS Glue, QuickSight, VPC Flow logs, ELB logs, CloudFront & CloudTrail logs	S3, Redshift cluster
Infrastructure	Serverless	Serverless	Serverless
Pricing	Per query: Data Scanned + Data Returned + S3 (data transfer + requests)	Per query: Data Scanned + (data transfer + requests)	Data scanned + S3 transfer & request + Redshift cluster
When to use	Simple `SELECT`	ETL, Logs, Complex queries (ANSI SQL Compliant e.g. group by, having, window and geo functions)	When using Redshift cluster

Athena vs Spectrum
- Athena for ad hoc data discovery and SQL querying
- Redshift Spectrum for
  - More complex queries and scenarios
  - A large number of data lake users want to run concurrent BI and reporting workloads

AWS Glue

Fully managed 📝ETL (Extract, Transform & Load) service
- Automates time consuming steps of data preparation for analytics.
- E.g. sorting the data by client timestamp, compressing and storing in a read optimized format.
Serverless, pay as you go, fully managed, provisions Apache Spark
Crawls data sources and identifies data formats (schema inference)
Discover, categorize, save in a catalog from connected sources
- Sources: Amazon Aurora, RDS, Redshift & S3
Automated Code Generation
- It's a customizable Apache Spark code
Sinks: S3, Redshift, etc..
Glue Data Catalog
- Metadata (definition & schema) of the Source Tables
- Can be used by EMR
- Crawlers: Programs that run through data to infer schemas and partitions from e.g. S3, Redshift, RDS.
E.g. scenario =>
- Kinesis (ingest & process) -> S3 (through Kinesis Data Firehose) -> Spark job on AWS glue (sort, optimize, crawl and catalog) -> Batch process catalog using Amazon MR (Elastic Map Reduce), query with Athena.

AWS Data Pipeline

Move data between different AWS compute and storage services, as well as on-premises data sources, at specified intervals.
Output to Amazon S3, Amazon RDS, Amazon DynamoDB, and Amazon EMR.
A pipeline consists of
- Data nodes for source or destination storages (s3, mysql, dynamodb...)
- E.g. import, copy, export.
Differences from ETL
- ETL pipelines are a subset of data pipelines.
- ETL may modify data, data pipeline won't.
- ETLs end in warehouses, built for warehousing.

AWS Batch

Managed HPC (high performance computing)
Batch plans, schedules, and executes your batch computing workloads using Amazon EC2 and Spot Instances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

11. Big Data & Data Analytics.md

11. Big Data & Data Analytics.md

Big Data & Data Analytics

Amazon EMR - Elastic Map Reduce

Query Engines

AWS Glue

AWS Data Pipeline

AWS Batch

Files

11. Big Data & Data Analytics.md

Latest commit

History

11. Big Data & Data Analytics.md

File metadata and controls

Big Data & Data Analytics

Amazon EMR - Elastic Map Reduce

Query Engines

AWS Glue

AWS Data Pipeline

AWS Batch