Federated Querying across Relational, Non-relational, Object, and Custom Data Sources using Amazon Athena

Querying Data from DynamoDB in Amazon Athena

Amazon Athena now enables users to run SQL queries across data stored in relational, non-relational, object, and custom data sources. With federated querying, customers can submit a single SQL query that scans data from multiple sources running on-premises or hosted in the cloud.

Athena executes federated queries using Athena Data Source Connectors that run on AWS Lambda. Athena federated query is available in Preview in the us-east-1 (N. Virginia) region.

Preparing to create federated queries is a two-part process:

  1. Deploying a Lambda function data source connector.
  2. Connecting the Lambda function to a data source. 

I assume that you have at least one DynamoDB table in us-east-1 region.

Deploy a Data Source Connector

  • Open the Amazon Athena console and choose the Connect data source. This feature is available in the region us-east-1 only.
  • On the Connect data source console, choose Query a data source feature. And choose Amazon DynamoDB as a data source.
  • Choose Next
  • For the Lambda function, choose to Configure new function. It opens in the Lambda console in a new tab with information about the connector.
  • Under ApplicationSettings, provide the required information.
    1. AthenaCatalogName – A name for the Lambda function.
    2. SpillBucket – An Amazon S3 bucket in your account to store data that exceeds Lambda function response size limits.
    3. SpillPrefix – Data that exceeds Lambda function response size limits stores under the Spillbucket/Spillprefix.
  • Choose I acknowledge that this app creates custom IAM roles and choose Deploy.

Connect to a data source using a connector that deployed in the earlier step

  • Open the Amazon Athena console and choose the Connect data source. This feature is available in the region us-east-1 only.
  • On the Connect data source console, choose Query a data source feature. And choose Amazon DynamoDB as a data source and choose Next.
  • Configure the Lambda function, choose the name of the lambda that you created in the earlier step.
  • Configure Catalog name, enter a unique name to use for the data source in your SQL queries, such as dynamo_athena.
  • Choose Connect. Now the data source is available under the Data Sources section in Amazon Athena.

Querying Data using Federated Queries

To use this feature in preview, you must create an Athena workgroup named AmazonAthenaPreviewFunctionality and join that workgroup.

Create an Athena workgroup

  • Open the Amazon Athena console and choose Workgroup, and choose Create workgroup.
  • After creating a Workgroup, under Workgroup section select the created workgroup and choose Switch workgroup.
  • Select the Data source that was created in the earlier step in Athena. After choosing the data source, the DynamoDb tables are available in Athena in the default database.

Querying Data in Athena using SQL Queries

The following query is used to retrieve data from DynamoDB in Athena.

SELECT * FROM "data_source_connector"."database_name"."table_name";

Creating Athena table using CTAS with results of querying DynamoDB

The CTAS query looks like the following. Using the CTAS query, the format of data can be changed into the required format be it parquet, JSON, and CSV, etc.

CREATE TABLE database.table_name
WITH (
      external_location = 's3://bucket-name/data/',
      format = 'parquet')
AS 
SELECT * FROM "data_source_connector"."database_name"."table_name";

I hope this was helpful and look forward to your comments.

This story is authored by PV Subbareddy. Subbareddy is a Big Data Engineer specializing on AWS Big Data Services and Apache Spark Ecosystem.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.