Amazon Athena Query Federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own code. This enables you to integrate with new data sources, proprietary data formats, or build in new user defined functions. Initially these customizations will be limited to the parts of a query that occur during a TableScan operation but will eventually be expanded to include other parts of the query lifecycle using the same easy to understand interface.

Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.

tldr; Get Started:

Ensure you have the proper permissions/policies to deploy/use Athena Federated Queries
Navigate to Servless Application Repository and search for "athena-federation". Be sure to check the box to show entries that require custom IAM roles.
Look for entries published by the "Amazon Athena Federation" author.
Deploy the application
To use Federated Queries, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.
Run a query "show databases in `lambda:<func_name>`" where <func_name> is the name of the Lambda function you deployed in the previous steps.

For more information please consult:

We've written integrations with more than 20 databases, storage formats, and live APIs in order to refine this interface and balance flexibility with ease of use. We hope that making this SDK and initial set of connectors Open Source will allow us to continue to improve the experience and performance of Athena Query Federation.

Serverless Big Data Using AWS Lambda

Queries That Span Data Stores

Imagine a hypothetical e-commerce company who's architecture uses:

Payment processing in a secure VPC with transaction records stored in HBase on EMR
Redis is used to store active orders so that the processing engine can get fast access to them.
DocumentDB (e.g. a mongodb compatible store) for Customer account data like email address, shipping addresses, etc..
Their e-commerce site using auto-scaling on Fargate with their product catalog in Amazon Aurora.
Cloudwatch Logs to house the Order Processor's log events.
A write-once-read-many datawarehouse on Redshift.
Shipment tracking data in DynamoDB.
A fleet of Drivers performing last-mile delivery while utilizing IoT enabled tablets.
Advertising conversion data from a 3rd party source.

Customer service agents begin receiving calls about orders 'stuck' in a weird state. Some show as pending even though they have delivered, others show as delivered but haven't actually shipped. It would be great if we could quickly run a query across this diverse architecture to understand which orders might be affected and what they have in common.

Using Amazon Athena Query Federation and many of the connectors found in this repository, our hypothetical e-commerce company would be able to run a query that:

Grabs all active orders from Redis. (see athena-redis)
Joins against any orders with 'WARN' or 'ERROR' events in Cloudwatch logs by using regex matching and extraction. (see athena-cloudwatch)
Joins against our EC2 inventory to get the hostname(s) and status of the Order Processor(s) that logged the 'WARN' or 'ERROR'. (see athena-cmdb)
Joins against DocumentDB to obtain customer contact details for the affected orders. (see athena-docdb)
Joins against DynamoDB to get shipping status and tracking details. (see athena-dynamodb)
Joins against HBase to get payment status for the affected orders. (see athena-hbase)

WITH logs 
     AS (SELECT log_stream, 
                message                                          AS 
                order_processor_log, 
                Regexp_extract(message, '.*orderId=(\d+) .*', 1) AS orderId, 
                Regexp_extract(message, '(.*):.*', 1)            AS log_level 
         FROM 
     "lambda:cloudwatch"."/var/ecommerce-engine/order-processor".all_log_streams 
         WHERE  Regexp_extract(message, '(.*):.*', 1) != 'WARN'), 
     active_orders 
     AS (SELECT * 
         FROM   redis.redis_db.redis_customer_orders), 
     order_processors 
     AS (SELECT instanceid, 
                publicipaddress, 
                state.NAME 
         FROM   awscmdb.ec2.ec2_instances), 
     customer 
     AS (SELECT id, 
                email 
         FROM   docdb.customers.customer_info), 
     addresses 
     AS (SELECT id, 
                is_residential, 
                address.street AS street 
         FROM   docdb.customers.customer_addresses),
     shipments 
     AS ( SELECT order_id, 
                 shipment_id, 
                 from_unixtime(cast(shipped_date as double)) as shipment_time,
                 carrier
        FROM lambda_ddb.default.order_shipments),
     payments
     AS ( SELECT "summary:order_id", 
                 "summary:status", 
                 "summary:cc_id", 
                 "details:network" 
        FROM "hbase".hbase_payments.transactions)
         
SELECT _key_            AS redis_order_id, 
       customer_id, 
       customer.email   AS cust_email, 
       "summary:cc_id"  AS credit_card,
       "details:network" AS CC_type,
       "summary:status" AS payment_status,
       status           AS redis_status, 
       addresses.street AS street_address, 
       shipments.shipment_time as shipment_time,
       shipments.carrier as shipment_carrier,
       publicipaddress  AS ec2_order_processor, 
       NAME             AS ec2_state, 
       log_level, 
       order_processor_log 
FROM   active_orders 
       LEFT JOIN logs 
              ON logs.orderid = active_orders._key_ 
       LEFT JOIN order_processors 
              ON logs.log_stream = order_processors.instanceid 
       LEFT JOIN customer 
              ON customer.id = customer_id 
       LEFT JOIN addresses 
              ON addresses.id = address_id 
       LEFT JOIN shipments
              ON shipments.order_id = active_orders._key_
       LEFT JOIN payments
              ON payments."summary:order_id" = active_orders._key_

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 257 Commits
.github		.github
athena-aws-cmdb		athena-aws-cmdb
athena-cloudwatch-metrics		athena-cloudwatch-metrics
athena-cloudwatch		athena-cloudwatch
athena-docdb		athena-docdb
athena-dynamodb		athena-dynamodb
athena-elasticsearch		athena-elasticsearch
athena-example		athena-example
athena-federation-integ-test		athena-federation-integ-test
athena-federation-sdk-tools		athena-federation-sdk-tools
athena-federation-sdk		athena-federation-sdk
athena-hbase		athena-hbase
athena-jdbc		athena-jdbc
athena-neptune		athena-neptune
athena-redis		athena-redis
athena-timestream		athena-timestream
athena-tpcds		athena-tpcds
athena-udfs		athena-udfs
athena-vertica		athena-vertica
docs		docs
tools		tools
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
checkstyle.xml		checkstyle.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon Athena Query Federation

Serverless Big Data Using AWS Lambda

Queries That Span Data Stores

License

About

Releases

Packages

Languages

License

zipmex/aws-athena-query-federation

Folders and files

Latest commit

History

Repository files navigation

Amazon Athena Query Federation

Serverless Big Data Using AWS Lambda

Queries That Span Data Stores

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages