Building a data lake on AWS using Redshift Spectrum

In one of our earlier posts, we had talked about setting up a data lake using AWS LakeFormation. Once the data lake is setup, we can use Amazon Athena to query data. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage. With Athena, there is no need for complex ETL jobs to prepare data for analysis. Today, we will explore querying the data from a data lake in S3 using Redshift Spectrum. This use case makes sense for those organizations that already have a significant exposure to using Redshift as their primary data warehouse.

Amazon Redshift Spectrum

Amazon Redshift Spectrum is used to efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer.

How is Amazon Athena different from Amazon Redshift Spectrum?

  1. Redshift Spectrum needs an Amazon Redshift cluster and an SQL client that’s connected to the cluster so that we can execute SQL commands. But Athena is serverless.
  2. In Redshift Spectrum the external tables are read-only, it does not support insert query. Athena supports the insert query which inserts records into S3.

Amazon Redshift cluster

To use Redshift Spectrum, you need an Amazon Redshift cluster and a SQL client that’s connected to your cluster so that you can execute SQL commands. The cluster and the data files in Amazon S3 must be in the same AWS Region.

Redshift cluster needs the authorization to access the external data catalog in AWS Glue or Amazon Athena and the data files in Amazon S3. Let’s kick off the steps required to get the Redshift cluster going.

Create an IAM Role for Amazon Redshift

  1. Open the IAM console, choose Roles.
  2. Then choose, Create role.
  3. Choose AWS service, and then select Redshift.
  4. Under Select your use case, select Redshift – Customizable and then choose Next: Permissions.
  • Then Attach permissions policy page appears. Attach the following policies AmazonS3FullAccess, AWSGlueConsoleFullAccess and AmazonAthenaFullAccess
  • For Role name, enter a name for your role, in this case, redshift-spectrum-role.
  • Choose Create role.

Create a Sample Amazon Redshift Cluster

  • Open the Amazon Redshift console.
  • Choose the AWS Region. The cluster and the data files in Amazon S3 must be in the same AWS Region.
  • Select CLUSTERS and choose Create cluster.
    Cluster Configuration:
    • Based on the size of data and type of data(compressed/uncompressed), select the nodes.
    • Amazon Redshift provides an option to calculate the best configuration of a cluster, based on the requirements. Then choose to Calculate the best configuration for your needs.
    • In this case, use dc2.large with 2 nodes.
  • Specify Cluster details.
    • Cluster identifier: Name-of-the-cluster.
    • Database port: Port number 5439 which is the default.
    • Master user name: Master user of the DB instance.
    • Master user password: Specify the password.
  • In the Cluster permissions section, select Available IAM roles and choose the IAM role that was created earlier, redshift-spectrum-role. Then choose the Add IAM role.
  • Select  Create cluster, wait till the status is Available.

Connect to Database

  1. Open the Amazon Redshift console and choose EDITOR.
  2. Database name is dev.

Create an External Schema and an External Table

External tables must be created in an external schema.

  • To create an external schema, run the following command. Please replace the iam_role with the role that was created earlier.
create external schema spectrum
from data catalog
database 'spectrumdb'
iam_role 'arn:aws:iam::xxxxxxxxxxxx:role/redshift-spectrum-role'
create external database if not exists;
  • Copy data using the following command. The data used above is provided by AWS. Configure aws cli on your machine and run this command.
aws s3 cp s3://awssampledbuswest2/tickit/spectrum/sales/ s3://bucket-name/data/source/ --recursive
  • To create an external table, please run the following command. The table is created in the spectrum.
create external table spectrum.table_name(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
saletime timestamp)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://bucket-name/copied-prefix/';

Now the table is available in Redshift Spectrum. We can analyze the data using SQL queries like so:

SELECT
 *
FROM
spectrum.rs_table
LIMIT 10;

Create a Table in Athena using Glue Crawler

In case you are just starting out on the AWS Glue crawler, I have explained how to create one from scratch in one of my earlier articles. In this case, I created the rs_table in spectrumdb database.

Comparison between Amazon Redshift Spectrum and Amazon Athena

I ran some basic queries in Athena and Redshift Spectrum as well. The query elapsed time comparison is as follows. It take about 3 seconds on Athena compared to about 16 seconds on Redshift Spectrum.

The idea behind this post was to get you up and running with a basic data lake on S3 that is queryable on Redshift Spectrum. I hope it was useful.

This story is authored by PV Subbareddy. He is a Big Data Engineer specializing on AWS Big Data Services and Apache Spark Ecosystem.

Optimizing QuickSight using Athena Queries and SPICE: Operating cost analysis

In this post, I will be discussing as an example how an automobile manufacturing company could utilize QuickSight to analyze their sales data and make better decisions. We will also learn how to best optimize the QuickSight operational cost structure by using SPICE engine to ingest source data at certain recurring intervals from Athena queries. This has two major advantages : dashboards and analyses load quickly as the data source is within SPICE. Secondly, cost of data ingestion is also brought down as Athena is queried only to refresh the data load in SPICE.

We will look at a sales dashboard, created using data-sets prepared from data in refined zone in a DataLake created using LakeFormation. A Data engineering pipeline writes data to this refined zone with year and month partitions every hour.

In case you wish to build a similar thing and follow along, below is the link to raw datasets:

https://github.com/koushik-bitzop/data-sets/tree/master/sales2016-2018

Creating a SPICE based Athena Data-set:

Select Athena as the data set source:

Select use custom query.

Select Edit/Preview data and then choose data source as SPICE and click on Finish.

Once query successfully ran and you could see the data, click on the Save and Visualise.

In case you want to add any calculated fields or change data types you could do that in the red highlighted section shown above.

I have discussed in detail here in my previous articles Visualizing Multiple Datasets in AWS QuickSight and Adding User-Interactivity to AWS QuickSight Dashboards

Refresh Schedule for Data-sets:

Depending on how frequent new data is arrived you could schedule the refresh. For every refresh an Athena query is executed and the results are imported into SPICE.

Note: 

  1. In this example, Quicksight SPICE pull data refresh is whole data, not incremental.
  2. It is not possible to pass quicksight pass pushdown predicates (variables) from filters in dashboard to Athena. So if you want to look at a rolling window of data such as past 24 hours or past one month or past 6 months, we can use a WHERE clause in the Athena source query to fetch just those records. Also, if the data is partitioned by year and month, only required data is scanned thereby further saving on costs.

A lowdown on QuickSight Operating cost with this architecture:

We are looking at two main cost components:

  1. Athena – S3 data scan costs
  2. QuickSight Infrastructure costs

Athena – S3 data scan costs:

Athena pricing for successful queries:
1TB scan = 5$
S3 storage cost not included.

No. of queriesData scanned in S3Scheduled RefreshTotal Data scanned
(monthly)
Bill estimated
(monthly)
Bill estimated
(annual)
1150 to 210 KBHourly1*24*30*210KB = 0.0001512TB0.000756$0.009072$

Above numbers are a bit low to make an inference. Let us say, you have 4 such queries (each query is scanning around 150 to 200 MB) powering the dashboard and SPICE ingests this data once every hour.

No. of queriesData scanned in S3Scheduled RefreshTotal Data scanned
(monthly)
Bill estimated
(monthly)
Bill estimated
(annual)
4150 to 200 MBHourly4*24*30*300MB = 576GB or 0.57TB$2.88$34.56

In case, we do not use SPICE to load this data from Athena in an hourly fashion and instead use Athena query as the direct source, then cost of the dashboards would increase proportionately with each query. So as an example, if the dashboards are being viewed at a rate of 1000 views per hour (and each dashboard has 4 source queries), then the cost above would be multiplied by a staggering 1000 times! and the annual bill would be an eye popping $ 34,560.

QuickSight infrastructure cost (Standard Edition):

No charge for readers. $9 for Author with annual subscription.

User typeNo. of usersBill estimated
(monthly)
Bill estimated
(annual)
Author1$9$108
Reader3$0$0
Total
$9 pm$108 pa

Note: For Enterprise edition, Readers are billed $0.30 for a 30-minute session up to a maximum charge of $5/reader/month for unlimited use. Authors are billed $18 with annual subscription.
For SPICE additional capacity $0.25/GB/standard and $0.38/GB/enterprise. 

So overall we can see that using SPICE with a periodic data refresh causes the costs to be optimized in a smart way. That’s it folks. I hope it was helpful. For any queries, drop them in the comments section.

This story is authored by Koushik. Koushik is a software engineer and a keen data science and machine learning enthusiast.

AWS Machine Learning Data Engineering Pipeline for Batch Data

This post walks you through all the steps required to build a data engineering pipeline for batch data using AWS Step Functions. The sequence of steps works like so : the ingested data arrives as a CSV file in a S3 based data lake in the landing zone, which automatically triggers a Lambda function to invoke the Step Function. I have assumed that data is being ingested daily in a .csv file with a filename_date.csv naming convention like so customers_20190821.csv. The step function, as the first step, starts a landing to raw zone file transfer operation via a Lambda Function. Then we have an AWS Glue crawler crawl the raw data into an Athena table, which is used as a source for AWS Glue based PySpark transformation script. The transformed data is written in the refined zone in the parquet format. Again an AWS Glue crawler runs to “reflect” this refined data into another Athena table. Finally, the data science team can consume this refined data available in the Athena table, using an AWS Sagemaker based Jupyter notebook instance. It is to be noted that the data science does not need to do any data pull manually, as the data engineering pipeline automatically pulls in the delta data, as per the data refresh schedule that writes new data in the landing zone.

Let’s go through the steps

How to make daily data available to Amazon SageMaker?

What is Amazon SageMaker?

Amazon SageMaker is an end-to-end machine learning (ML) platform that can be leveraged to build, train, and deploy machine learning models in AWS. Using the Amazon SageMaker Notebook module, improves the efficiency of interacting with the data without the latency of bringing it locally.
For deep dive into Amazon SageMaker, please go through the official docs.

In this blog post, I will be using a dummy customers data. The customers data consists of retailer information and units purchased.

Updating Table Definitions with AWS Glue

The data catalog feature of AWS Glue and the inbuilt integration to Amazon S3 simplifies the process of identifying data and deriving the schema definition out of the source data. Glue crawlers within Data catalog, are used to build out the metadata tables of data stored in Amazon S3.

I created a crawler named raw for the data in raw zone (s3://bucketname/data/raw/customers/). In case you are just starting out on AWS Glue crawler, I have explained how to create one from scratch in one of my earlier article. If you run this crawler, it creates customers table in specified database (raw).

Create an invocation Lambda Function

In case you are just starting out on Lambda functions, I have explained how to create one from scratch with an IAM role to access the StepFunctions, Amazon S3, Lambda and CloudWatch in my earlier article.

Add trigger to the created Lambda function named invoke-step-functions. Configure Bucket, Prefix and  Suffix accordingly.

Once file is arrived at landing zone, it triggers the invoke Lambda function which extracts year, month, day from file name that comes from event. It passes year, month, day with two characters from uuid as input to the AWS StepFunctions.Please replace the following code in invoke-step-function Lambda.

import json
import uuid
import boto3
from datetime import datetime

sfn_client = boto3.client('stepfunctions')

stm_arn = 'arn:aws:states:us-west-2:XXXXXXXXXXXX:stateMachine:Datapipeline-for-SageMaker'

def lambda_handler(event, context):
    
    # Extract bucket name and file path from event
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    path = event['Records'][0]['s3']['object']['key']
    
    file_name_date = path.split('/')[2]
    processing_date_str = file_name_date.split('_')[1].replace('.csv', '')
    processing_date = datetime.strptime(processing_date_str, '%Y%m%d')
    
    # Extract year, month, day from date
    year = processing_date.strftime('%Y')
    month = processing_date.strftime('%m')
    day = processing_date.strftime('%d')
    
    uuid_temp = uuid.uuid4().hex[:2]
    execution_name = '{processing_date_str}-{uuid_temp}'.format(processing_date_str=processing_date_str, uuid_temp=uuid_temp)
    
    # Starts the execution of AWS StepFunctions
    response = sfn_client.start_execution(
          stateMachineArn = stm_arn,
          name= str(execution_name),
          input= json.dumps({"year": year, "month": month, "day": day})
      )
    
    return {"year": year, "month": month, "day": day}

Create a Generic FileTransfer Lambda

Create a Lambda function named generic-file-transfer as we created earlier in this article. In the file transfer Lambda function, it transfers files from landing zone to raw zone and landing zone to archive zone based on event coming from the StepFunction.

  1. If step is landing-to-raw-file-transfer, the Lambda function copies files from landing to raw zone.
  2. If step is landing-to-archive-file-transfer, the Lambda function copies files from landing to archive zone and deletes files from landing zone.

Please replace the following code in generic-file-transfer Lambda.

import json
import boto3

s3 = boto3.resource('s3')

def lambda_handler(event, context):
    
    # Extract Parameters from Event (invoked by StepFunctions)
    step = event['step']
    year = event['year']
    month = event['month']
    day = event['day']
    
    bucket_name = event['bucket_name']
    source_prefix = event['source_prefix']
    destination_prefix = event['destination_prefix']
    
    bucket = s3.Bucket(bucket_name)
    
    for objects in bucket.objects.filter(Prefix = source_prefix):
        file_path = objects.key
        
        if ('.csv' in file_path) and (step == 'landing-to-raw-file-transfer'):
            
            # Extract filename from file_path
            file_name_date = file_path.split('/')[2]
            file_name = file_name_date.split('_')[0]
            
            # Add filename to the destination prefix
            destination_prefix = '{destination_prefix}{file_name}/year={year}/month={month}/day={day}/'.format(destination_prefix=destination_prefix, file_name=file_name, year=year, month=month, day=day)
            print(destination_prefix)
            
            source_object = {'Bucket': bucket_name, "Key": file_path}
            
            # Replace source prefix with destination prefix
            new_path = file_path.replace(source_prefix, destination_prefix)
            
            # Copies file
            new_object = bucket.Object(new_path)
            new_object.copy(source_object)
         
        if ('.csv' in file_path) and (step == 'landing-to-archive-file-transfer'):
            
            # Add filename to the destination prefix
            destination_prefix = '{destination_prefix}{year}-{month}-{day}/'.format(destination_prefix=destination_prefix, year=year, month=month, day=day)
            print(destination_prefix)
            
            source_object = {'Bucket': bucket_name, "Key": file_path}
            
            # Replace source prefix with destination prefix
            new_path = file_path.replace(source_prefix, destination_prefix)
            
            # Copies file
            new_object = bucket.Object(new_path)
            new_object.copy(source_object)
            
            # Deletes copied file
            bucket.objects.filter(Prefix = file_path).delete()
            
    return {"year": year, "month": month, "day": day}

Generic FileTransfer Lambda function setup is now complete. We need to check all files are copied successfully from one zone to another zone. If you have large files that needs to be copied, you could check out our Lightening fast distributed file transfer architecture.

Create Generic FileTransfer Status Check Lambda Function

Create a Lambda function named generic-file-transfer-status. If the step is landing to raw file transfer, the Lambda function checks if all files are copied from landing to raw zone by comparing the number of objects in landing and raw zones. If count doesn’t match it will raise an exception, and that exception is handled in AWS StepFunctions and retries after some backoff rate. If the count matches, all files are copied successfully. If the step is landing to archive file transfer, the Lambda function checks that any files are left in landing zone. Please replace the following code in generic-file-transfer-status Lambda function.

import json
import boto3

s3 = boto3.resource('s3')

def lambda_handler(event, context):
    
    # Extract Parameters from Event (invoked by StepFunctions)
    step = event['step']
    year = event['year']
    month = event['month']
    day = event['day']
    
    bucket_name = event['bucket_name']
    source_prefix = event['source_prefix']
    destination_prefix = event['destination_prefix']
    
    bucket = s3.Bucket(bucket_name)
    
    class LandingToRawFileTransferIncompleteException(Exception):
        pass

    class LandingToArchiveFileTransferIncompleteException(Exception):
        pass
    
    if (step == 'landing-to-raw-file-transfer'):
        if file_transfer_status(bucket, source_prefix, destination_prefix):
            print('File Transfer from Landing to Raw Completed Successfully')
        else:
            raise LandingToRawFileTransferIncompleteException('File Transfer from Landing to Raw not completed')
    
    if (step == 'landing-to-archive-file-transfer'):
        if is_empty(bucket, source_prefix):
            print('File Transfer from Landing to Archive Completed Successfully')
        else:
            raise LandingToArchiveFileTransferIncompleteException('File Transfer from Landing to Archive not completed.')
    
    return {"year": year, "month": month, "day": day}

def file_transfer_status(bucket, source_prefix, destination_prefix):
    
    try:
        
        # Checks number of objects at the source prefix (count of objects at source i.e., landing zone)
        source_object_count = 0
        for obj in bucket.objects.filter(Prefix = source_prefix):
            path = obj.key
            if (".csv" in path):
                source_object_count = source_object_count + 1
        print(source_object_count)
        
        # Checks number of objects at the destination prefix (count of objects at destination i.e., raw zone)
        destination_object_count = 0
        for obj in bucket.objects.filter(Prefix = destination_prefix):
            path = obj.key
            
            if (".csv" in path):
                destination_object_count = destination_object_count + 1
        
        print(destination_object_count)
        return (source_object_count == destination_object_count)

    except Exception as e:
        print(e)
        raise e

def is_empty(bucket, prefix):
    
    try:
        # Checks if any files left in the prefix (i.e., files in landing zone)
        object_count = 0
        for obj in bucket.objects.filter(Prefix = prefix):
            path = obj.key

            if ('.csv' in path):
                object_count = object_count + 1
                    
        print(object_count)
        return (object_count == 0)
        
    except Exception as e:
        print(e)
        raise e

Create a Generic Crawler invocation Lamda

Create a Lambda function named generic-crawler-invoke. The Lambda function invokes a crawler. The crawler name is passed as argument from AWS StepFunctions through event object. Please replace the following code in generic-crawler-invoke Lambda function.

import json
import boto3

glue_client = boto3.client('glue')

def lambda_handler(event, context):
    
    # Extract Parameters from Event (invoked by StepFunctions)
    year = event['year']
    month = event['month']
    day = event['day']
    
    crawler_name = event['crawler_name']
    
    try:
        response = glue_client.start_crawler(Name = crawler_name)
    except Exception as e:
        print('Crawler in progress', e)
        raise e
    
    return {"year": year, "month": month, "day": day}

Create a Generic Crawler Status Lambda

Create a Lambda function named generic-crawler-status. The Lambda function checks whether the crawler ran successfully or not. If crawler is in running state, the Lambda function raises an exception and the exception will be handled in the Step Function and retries after a certain backoff rate. Please replace the following code in generic-crawler-status Lambda.

import json
import boto3

glue_client = boto3.client('glue')

def lambda_handler(event, context):
    
    class CrawlerInProgressException(Exception):
        pass
    
    # Extract Parametres from Event (invoked by StepFunctions)
    year = event['year']
    month = event['month']
    day = event['day']
    
    crawler_name = event['crawler_name']
    
    response = glue_client.get_crawler_metrics(CrawlerNameList =[crawler_name])
    print(response['CrawlerMetricsList'][0]['CrawlerName']) 
    print(response['CrawlerMetricsList'][0]['TimeLeftSeconds']) 
    print(response['CrawlerMetricsList'][0]['StillEstimating']) 
    
    if (response['CrawlerMetricsList'][0]['StillEstimating']):
        raise CrawlerInProgressException('Crawler In Progress!')
    elif (response['CrawlerMetricsList'][0]['TimeLeftSeconds'] > 0):
        raise CrawlerInProgressException('Crawler In Progress!')
    
    return {"year": year, "month": month, "day": day}

Create an AWS Glue Job

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. For deep dive into AWS Glue, please go through the official docs.

Create an AWS Glue Job named raw-refined. In case you are just starting out on AWS Glue Jobs, I have explained how to create one from scratch in my earlier article. This Glue job converts file format from csv to parquet and stores in refined zone. The push down predicate is used as filter condition for reading data of only the processing date using the partitions.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
# args = getResolvedOptions(sys.argv, ['JOB_NAME'])

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'year', 'month', 'day'])

year = args['year']
month = args['month']
day = args['day']

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "raw", table_name = "customers", push_down_predicate ="((year == " + year + ") and (month == " + month + ") and (day == " + day + "))", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("email_id", "string", "email_id", "string"), ("retailer_name", "string", "retailer_name", "string"), ("units_purchased", "long", "units_purchased", "long"), ("purchase_date", "string", "purchase_date", "string"), ("sale_id", "string", "sale_id", "string"), ("year", "string", "year", "string"), ("month", "string", "month", "string"), ("day", "string", "day", "string")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://bucketname/data/refined/customers/", "partitionKeys": ["year","month","day"]}, format = "parquet", transformation_ctx = "datasink4")

job.commit()

Create a Refined Crawler as we created Raw Crawler earlier in this article. Please point the crawler path to refined zone(s3://bucketname/data/refined/customers/) and database as refined. No need to create a Lambda function for refined crawler invocation and status, as we will pass crawler names from the StepFunction.

Resources required to create an the StepFunction have been created.

Creating the AWS StepFunction

StepFunction is where we create and orchestrate steps to process data according to our workflow. Create an AWS StepFunctions named Datapipeline-for-SageMaker.  In case you are just starting out on AWS StepFunctions, I have explained how to create one from scratch here.

Data is being ingested into landing zone. It triggers a Lambda function which in turn invokes the execution of the StepFunction. The steps in the StepFunction are as follows:

  1. Transfers files from landing zone to raw zone.
  2. Checks all files are copied to raw zone successfully or not.
  3. Invokes raw Crawler which crawls data in raw zone and updates/creates definition of table in the specified database.
  4. Checks if the Crawler is completed successfully or not.
  5. Invokes Glue Job and waits for it to complete.
  6. Invokes refined Crawler which crawls data from refined zone in and updates/creates definition of table in the specified database.
  7. Checks if the Crawler is completed successfully or not.
  8. Transfers files from landing zone to archive zone and deletes files from landing zone.
  9. Checks all files are copied and deleted from landing zone successfully.

Please update the StepFunctions definition with the following code.

{
  "Comment": "Datapipeline For MachineLearning in AWS Sagemaker",
  "StartAt": "LandingToRawFileTransfer",
  "States": {
    "LandingToRawFileTransfer": {
      "Comment": "Transfers files from landing zone to Raw zone.",
      "Type": "Task",
      "Parameters": {
        "step": "landing-to-raw-file-transfer",
        "bucket_name": "bucketname",
        "source_prefix": "data/landing/",
        "destination_prefix": "data/raw/",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-file-transfer",
      "TimeoutSeconds": 4500,
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "LandingToRawFileTransferFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "LandingToRawFileTransferFailed"
        }
      ],
      "Next": "LandingToRawFileTransferPassed"
    },
    "LandingToRawFileTransferFailed": {
      "Type": "Fail",
      "Cause": "Landing To Raw File Transfer failed"
    },
    "LandingToRawFileTransferPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Next": "LandingToRawFileTransferStatus"
    },
    "LandingToRawFileTransferStatus": {
      "Comment": "Checks whether all files are copied from landing to raw zone successfully.",
      "Type": "Task",
      "Parameters": {
        "step": "landing-to-raw-file-transfer",
        "bucket_name": "bucketname",
        "source_prefix": "data/landing/",
        "destination_prefix": "data/raw/",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-file-transfer-status",
      "Retry": [
        {
          "ErrorEquals": [
            "LandingToRawFileTransferInCompleteException"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        },
        {
          "ErrorEquals": [
            "States.All"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        }
      ],
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "LandingToRawFileTransferStatusFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "LandingToRawFileTransferStatusFailed"
        }
      ],
      "Next": "LandingToRawFileTransferStatusPassed"
    },
    "LandingToRawFileTransferStatusFailed": {
      "Type": "Fail",
      "Cause": "Landing To Raw File Transfer failed"
    },
    "LandingToRawFileTransferStatusPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Next": "StartRawCrawler"
    },
    "StartRawCrawler": {
      "Comment": "Crawls data from raw zone and adds table definition to the specified Database. IF table definition exists updates the definition.",
      "Type": "Task",
      "Parameters": {
        "crawler_name": "raw",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-crawler-invoke",
      "TimeoutSeconds": 4500,
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "StartRawCrawlerFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "StartRawCrawlerFailed"
        }
      ],
      "Next": "StartRawCrawlerPassed"
    },
    "StartRawCrawlerFailed": {
      "Type": "Fail",
      "Cause": "Crawler invocation failed"
    },
    "StartRawCrawlerPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Next": "RawCrawlerStatus"
    },
    "RawCrawlerStatus": {
      "Comment": "Checks whether crawler is successfully completed.",
      "Type": "Task",
      "Parameters": {
        "crawler_name": "raw",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-crawler-status",
      "Retry": [
        {
          "ErrorEquals": [
            "CrawlerInProgressException"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        },
        {
          "ErrorEquals": [
            "States.All"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        }
      ],
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "RawCrawlerStatusFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "RawCrawlerStatusFailed"
        }
      ],
      "Next": "RawCrawlerStatusPassed"
    },
    "RawCrawlerStatusFailed": {
      "Type": "Fail",
      "Cause": "Crawler invocation failed"
    },
    "RawCrawlerStatusPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Next": "GlueJob"
    },
    "GlueJob": {
      "Comment": "Invokes Glue job and waits for Glue job to complete.",
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "retail-raw-refined",
        "Arguments": {
          "--refined_prefix": "data/refined",
          "--year.$": "$.year",
          "--month.$": "$.month",
          "--day.$": "$.day"
        }
      },
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "GlueJobFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "GlueJobFailed"
        }
      ],
      "Next": "GlueJobPassed"
    },
    "GlueJobFailed": {
      "Type": "Fail",
      "Cause": "Crawler invocation failed"
    },
    "GlueJobPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.Arguments.--year",
        "month.$": "$.Arguments.--month",
        "day.$": "$.Arguments.--day"
      },
      "Next": "StartRefinedCrawler"
    },
    "StartRefinedCrawler": {
      "Comment": "Crawls data from refined zone and adds table definition to the specified Database.",
      "Type": "Task",
      "Parameters": {
        "crawler_name": "refined",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-crawler-invoke",
      "TimeoutSeconds": 4500,
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "StartRefinedCrawlerFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "StartRefinedCrawlerFailed"
        }
      ],
      "Next": "StartRefinedCrawlerPassed"
    },
    "StartRefinedCrawlerFailed": {
      "Type": "Fail",
      "Cause": "Crawler invocation failed"
    },
    "StartRefinedCrawlerPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Next": "RefinedCrawlerStatus"
    },
    "RefinedCrawlerStatus": {
      "Comment": "Checks whether crawler is successfully completed.",
      "Type": "Task",
      "Parameters": {
        "crawler_name": "refined",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-crawler-status",
      "Retry": [
        {
          "ErrorEquals": [
            "CrawlerInProgressException"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        },
        {
          "ErrorEquals": [
            "States.All"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        }
      ],
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "RefinedCrawlerStatusFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "RefinedCrawlerStatusFailed"
        }
      ],
      "Next": "RefinedCrawlerStatusPassed"
    },
    "RefinedCrawlerStatusFailed": {
      "Type": "Fail",
      "Cause": "Crawler invocation failed"
    },
    "RefinedCrawlerStatusPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Next": "LandingToArchiveFileTransfer"
    },
    "LandingToArchiveFileTransfer": {
      "Comment": "Transfers files from landing zone to archived zone",
      "Type": "Task",
      "Parameters": {
        "step": "landing-to-archive-file-transfer",
        "bucket_name": "bucketname",
        "source_prefix": "data/landing/",
        "destination_prefix": "data/raw/",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-file-transfer",
      "TimeoutSeconds": 4500,
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "LandingToArchiveFileTransferFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "LandingToArchiveFileTransferFailed"
        }
      ],
      "Next": "LandingToArchiveFileTransferPassed"
    },
    "LandingToArchiveFileTransferFailed": {
      "Type": "Fail",
      "Cause": "Crawler invocation failed"
    },
    "LandingToArchiveFileTransferPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Next": "LandingToArchiveFileTransferStatus"
    },
    "LandingToArchiveFileTransferStatus": {
      "Comment": "Checks whether all files are copied from landing to archived successfully.",
      "Type": "Task",
      "Parameters": {
        "step": "landing-to-archive-file-transfer",
        "bucket_name": "bucketname",
        "source_prefix": "data/landing/",
        "destination_prefix": "data/raw/",
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "Resource": "arn:aws:lambda:us-west-2:XXXXXXXXXXXX:function:generic-file-transfer-status",
      "Retry": [
        {
          "ErrorEquals": [
            "LandingToArchiveFileTransferInCompleteException"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        },
        {
          "ErrorEquals": [
            "States.All"
          ],
          "IntervalSeconds": 30,
          "BackoffRate": 2,
          "MaxAttempts": 5
        }
      ],
      "Catch": [
        {
          "ErrorEquals": [
            "States.TaskFailed"
          ],
          "Next": "LandingToArchiveFileTransferStatusFailed"
        },
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "LandingToArchiveFileTransferStatusFailed"
        }
      ],
      "Next": "LandingToArchiveFileTransferStatusPassed"
    },
    "LandingToArchiveFileTransferStatusFailed": {
      "Type": "Fail",
      "Cause": "LandingToArchiveFileTransfer invocation failed"
    },
    "LandingToArchiveFileTransferStatusPassed": {
      "Type": "Pass",
      "ResultPath": "$",
      "Parameters": {
        "year.$": "$.year",
        "month.$": "$.month",
        "day.$": "$.day"
      },
      "End": true
    }
  }
}

After updating the AWS StepFunctions definition, the visual workflow looks like the following.

Now upload file in data/landing/ zone in the bucket  where the trigger has been configured with the Lambda. The execution of StepFunction has started and the visual workflow looks like the following.

In RawCrawlerStatus step, if the Lambda is failing we retry till sometime and then mark the StepFunction as failed. If the StepFunction ran successfully. The visual workflow of the StepFunction looks like following.

Machine Learning workflow using Amazon SageMaker

The final step in this data pipeline is to make the processed data available in a Jupyter notebook instance of the Amazon SageMaker. Jupyter notebooks are popularly used among data scientists to do exploratory data analysis, build and train machine learning models.

Create Notebook Instance in Amazon SageMaker

Step1: In the Amazon SageMaker console choose Create notebook instance.

Step2: In the Notebook Instance settings populate the Notebook instance name, choose an instance type depends on data size, and a role for the notebook instances in Amazon SageMaker to interact with Amazon S3. The SageMaker execution role needs to have the required permission to Athena, the S3 buckets where the data resides, and KMS if encrypted.

Step3: Wait for the Notebook instances to be created and the Status to change to InService.

Step4: Choose the Open Jupyter, which will open the notebook interface in a new browser tab.

Click new to create a new notebook in Jupyter. Amazon SageMaker provides several kernels for Jupyter including support for Python 2 and 3, MXNet, TensorFlow, and PySpark. Choose Python as the kernel for this exercise as it comes with the Pandas library built in.

Step5: Within the notebook, execute the following commands to install the Athena JDBC driver. PyAthena is a Python DB API 2.0 (PEP 249) compliant client for the Amazon Athena JDBC driver.

import sys
!{sys.executable} -m pip install PyAthena

Step6: After the Athena driver is installed, you can use the JDBC connection to connect to Athena and populate the Pandas data frames. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing/ modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. Pandas is the ideal tool for all of these tasks.

from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='<ATHENA QUERY RESULTS LOCATION>',
               region_name='REGION, for example, us-east-1')

df = pd.read_sql("SELECT * FROM <DATABASE>.<TABLENAME> limit 10;", conn)
df

As shown above, the dataframe always stays consistent with the latest incoming data because of the data engineering pipeline setup earlier in the ML workflow. This dataframe can be used for downstream ad-hoc model building purposes or for exploratory data analysis.

That’s it folks. Thanks for the read.

This story is authored by PV Subbareddy. Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Advanced Analytics – Presto Functions and Operators Quick Review

This post is a lot different from our earlier entries. Think of it as a reference flag post for people interested in a quick lookup for advanced analytics functions and operators used in modern data lake operations based on Presto. So you could, of course, use it in Presto installations, but also in some other commercial products such as AWS Athena that is used widely these days to facilitate analytic operations on Enterprise Data Lakes built on top of Amazon S3.

Without further ado, let’s dive straight into the nuts and bolts of these queries for advanced analytics:

JSON Functions

is_json_scalar(json) → boolean

  • Determines if json is a scalar (i.e. a JSON number, a JSON string, true, false or null).
  • Example:
QueryOutput
SELECT is_json_scalar(‘1’)true
SELECT is_json_scalar(‘[1, 2, 3]’)false

json_array_contains(json, value) → boolean

  • Determines if value exists in json (a string containing a JSON array)
  • Example:
QueryOutput
SELECT json_array_contains(‘[1, 2, 3]’, 2)true

json_array_get(json_array, index) → json

  • Example:
QueryOutput
SELECT json_array_get(‘[“a”, [3, 9], “c”]’, 0)a
SELECT json_array_get(‘[“a”, [3, 9], “c”]’, 10)null
SELECT json_array_get(‘[“c”, [3, 9], “a”]’, -2)JSON ‘[3,9]’

json_array_length(json) → bigint

  • Returns the array length of json (a string containing a JSON array)
  • Example:
QueryOutput
SELECT json_array_length(‘[1, 2, 3]’)3

json_extract(json, json_path) → json

  • Evaluates the JSONPath-like expression json_path on json (a string containing JSON) and returns the result as a JSON string
  • Example:
QueryOutput
SELECT json_extract(json_parse(‘{“email”: {“abcd”: “xyz@yahoo.com”}, “phone_numbers”: [5678908, 587578575, 668798]}’), ‘$.email.abc’)“xyz@yahoo.com”

json_extract_scalar(json, json_path) → varchar

  • Just like json_extract(), but returns the result value as a string (as opposed to being encoded as JSON). The value referenced by json_path must be a scalar (boolean, number or string)
  • Example:
QueryOutput
SELECT json_extract_scalar(json_parse(‘{“email”: {“abcd”: “xyz@yahoo.com”}, “phone_numbers”: [5678908, 587578575, 9999999]}’), ‘$.phone_numbers[0]’)9999999
SELECT json_extract_scalar(json_parse(‘{“email”: {“abcd”: “xyz@yahoo.com”}, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’), ‘$.phone_numbers[1].mob’)587578575
SELECT json_extract_scalar(json_parse(‘{“email”: “xyz@yahoo.com”, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’), ‘$[email]’)xyz@yahoo.com

json_parse(string) → json

  • Returns the JSON value deserialized from the input JSON text. This is an inverse function to json_format()
  • Example:
QueryOutput
SELECT json_parse(‘{“email”: {“abcd”: “xyz@yahoo.com”}, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’){“email”: {“abcd”: “xyz@yahoo.com”}, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}

json_size(json, json_path) → bigint

  • Just like json_extract(), but returns the size of the value. For objects or arrays, the size is the number of members, and the size of a scalar value is zero.
  • Example:
QueryOutput
SELECT json_size(json_parse(‘{“email”: “xyz@yahoo.com”, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’), ‘$.phone_numbers’)3
SELECT json_size(‘{“x”: {“a”: 1, “b”: 2}}’, ‘$.x.a’)0

Date and Time Functions and Operators

current_date → date

  • Returns the current date as of the start of the query.

current_time → time with time zone

  • Returns the current time as of the start of the query. (with UTC)

current_timestamp → timestamp with time zone

  • Returns the current timestamp as of the start of the query.

current_timezone() → varchar

  • Returns the current time zone in the format defined by IANA (e.g., America/Los_Angeles) or as fixed offset from UTC (Example: +08:35)

date(x) → date

  • This is an alias for CAST(x AS date).
  • Example:
QueryOutput
SELECT DATE(‘2019-08-07’)SELECT CAST(‘2019-08-07’ AS DATE);

now() → timestamp with time zone

  • This is an alias for current_timestamp.
  • SELECT now();

date_trunc(unit, x) → [same as input]

  • Returns x truncated to unit.
  • Example:
QueryOutput
SELECT date_trunc(‘second’, current_timestamp)2019-08-16 06:51:29.000 UTC (returns value upto unit)
SELECT date_trunc(‘minute’, current_timestamp)2019-08-16 06:54:00.000 UTC

Interval Functions

UnitDescription
millisecondMilliseconds
secondSeconds
minuteMinutes
hourHours
dayDays
weekWeeks
monthMonths
quarterQuarters of a year
yearYear

date_add(unit, value, timestamp) → [same as input]

  • Adds an interval value of type unit to timestamp. Subtraction can be performed by using a negative value.
  • Example:
QueryOutput
SELECT date_add(‘month’, 1, current_timestamp)2019-09-16 06:59:55.425 UTC
SELECT date_add(‘day’, 1, current_timestamp)2019-08-17 07:01:26.834 UTC

date_diff(unit, timestamp1, timestamp2) → bigint

  • Returns timestamp2 – timestamp1 expressed in terms of unit.
  • Example:
QueryOutput
SELECT date_diff(‘day’, current_timestamp, date_add(‘day’, 10, current_timestamp))10

parse_duration(string) → interval

  • Parses string of format value unit into an interval, where value is fractional number of unit values
  • Example:
QueryOutput
SELECT parse_duration(‘42.8ms’)0 00:00:00.043
SELECT parse_duration(‘3.81 d’)3 19:26:24.000
SELECT parse_duration(‘5m’)0 00:05:00.000

date_format(timestamp, format) → varchar

  • Formats timestamp as a string using format (converts timestamp to string)
  • Example:
QueryOutput
SELECT date_format(current_timestamp, ‘%Y-%m-%d’)2019-08-16

date_parse(string, format) → timestamp

  • Parses string into a timestamp using format. (converts string to timestamp)
  • Example:
QueryOutput
SELECT date_parse(‘2019-08-16’, ‘%Y-%m-%d’)2019-08-16 00:00:00.000

day(x) → bigint

  • Returns the day of the month from x.
  • Example:
QueryOutput
SELECT day(date_parse(‘2019-08-16’, ‘%Y-%m-%d’))16

day_of_month(x) → bigint

  • This is an alias for day().

day_of_week(x) → bigint

  • Returns the ISO day of the week from x. The value ranges from 1 (Monday) to 7 (Sunday).
  • Example:
QueryOutput
SELECT day_of_week(date_parse(‘2019-08-16’, ‘%Y-%m-%d’))5

year(x) → bigint

  • Returns the year from x.

Aggregate Functions

array_agg(x) → array<[same as input]>

  • Returns an array created from the input x elements
  • The array_agg() function is an aggregate function that accepts a set of values and returns an array in which each value in the input set is assigned to an element of the array.
  • Syntax:
    array_agg(expression [ORDER BY [sort_expression {ASC | DESC}], […])
  • Example:
Query
SELECT title, array_agg (first_name || ‘ ‘ || last_name) actors FROM film
SELECT title, array_agg (first_name || ‘ ‘ || last_name ORDER BY first_name) actors FROM film

avg(x) → double

  • Returns the average (arithmetic mean) of all input values

bool_and(boolean) → boolean

  • Returns TRUE if every input value is TRUE, otherwise FALSE.

bool_or(boolean) → boolean

  • Returns TRUE if any input value is TRUE, otherwise FALSE.

count(*) → bigint

  • Returns the number of input rows.

count(x) → bigint

  • Returns the number of non-null input values.

count_if(x) → bigint

  • Returns the number of TRUE input values. This function is equivalent to count(CASE WHEN x THEN 1 END).

arbitrary(x) → [same as input]

  • Returns an arbitrary non-null value of x, if one exists.
  • Arbitrary chooses one value out of a set of values. arbitrary is useful for silencing warnings about values neither grouped by or aggregated over.

max_by(x, y) → [same as x]

  • Returns the value of x associated with the maximum value of y over all input values.
  • The max_by takes two arguments and returns the value of the first argument for which the value of the second argument is maximized.
  • If multiple rows maximize the result of the second value, and arbitrary first value is chosen from among them. max_by can be used with both numeric and non-numeric data.
  • Example:
QueryOutput
SELECT max_by(close_date, close_value) as date_of_max_sale FROM sales_pipelinequery returns the date where close_value is maximum

max_by(x, y, n) → array<[same as x]>

  • Returns n values of x associated with the n largest of all input values of y in descending order of y.

min_by(x, y) → [same as x]

  • Returns the value of x associated with the minimum value of y over all input values.

min_by(x, y, n) → array<[same as x]>

  • Returns n values of x associated with the n smallest of all input values of y in ascending order of y.

max(x, n) → array<[same as x]>

  • Returns n largest values of all input values of x.

min(x) → [same as input]

  • Returns the minimum value of all input values.

min(x, n) → array<[same as x]>

  • Returns n smallest values of all input values of x.

reduce_agg(inputValue T, initialState S, inputFunction(S, T, S), combineFunction(S, S, S)) → S

  • Reduces all input values into a single value. inputFunction will be invoked for each input value. In addition to taking the input value, inputFunction takes the current state, initially initialState, and returns the new state. combineFunction will be invoked to combine two states into a new state. The final state is returned
  • Example:
QueryOutput
SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> a + b) FROM (VALUES (1, 2) (1, 3), (1, 4), (2, 20), (2, 30), (2, 40) ) AS t(id, value) GROUP BY id(1, 9), (2, 90)

Array Functions and Operators

Subscript Operator: [ ]

  • The [ ] operator is used to access an element of an array and is indexed starting from one:
  • If a column has values like [1, 3], [45, 46]
  • Example:
QueryOutput
SELECT column[1]1, 45

Concatenation Operator: ||

  • The || operator is used to concatenate an array with an array or an element of the same type.
  • Example:
QueryOutput
SELECT ARRAY [1] || ARRAY [2][1, 2]
SELECT ARRAY [1] || 2[1, 2]
SELECT 2 || ARRAY [1][2, 1]

array_distinct(x) → array

  • Remove duplicate values from the array x.
  • Example:
QueryOutput
SELECT array_distinct(ARRAY[1, 1, 1, 2])[1, 2]

array_intersect(x, y) → array

  • Returns an array of the elements in the intersection of x and y, without duplicates.
  • Common elements in two arrays without duplicates.
  • Example:
QueryOutput
SELECT array_intersect(ARRAY[1, 1, 1, 2], ARRAY[10, 15, 1, 100])[1]

array_union(x, y) → array

  • Returns an array of the elements in the union of x and y, without duplicates.
  • Example:
QueryOutput
SELECT array_union(ARRAY[1, 1, 1, 2], ARRAY[10, 15, 1, 100])[1, 2, 10, 15, 100]

array_except(x, y) → array

  • Returns an array of elements in x but not in y, without duplicates.
  • Example:
QueryOutput
SELECT array_except(ARRAY[1, 1, 1, 2], ARRAY[10, 15, 1, 100])[2]

array_join(x, delimiter, null_replacement) → varchar

  • Concatenates the elements of the given array using the delimiter and an optional string to replace nulls.
  • Example:
QueryOutput
SELECT array_join(ARRAY[1, 71, 81, 92, null], ‘/’, ‘abcd’)1/71/81/92/abcd

array_max(x) → x

  • Returns the maximum value of input array.
  • Example:
QueryOutput
SELECT array_max(ARRAY[1, 71, 81, 92, 100])100
SELECT array_max(ARRAY[1, 71, 81, 92, null])null

array_min(x) → x

  • Returns the minimum value of input array.
  • Example:
QueryOutput
SELECT array_min(ARRAY[1, 71, 81, 92, 100])1

array_position(x, element) → bigint

  • Returns the position of the first occurrence of the element in array x (or 0 if not found).
  • Example:
QueryOutput
SELECT array_position(ARRAY[1, 71, 81, 92, 100], 100)5

array_remove(x, element) → array

  • Remove all elements that equal element from array x.
  • Example:
QueryOutput
SELECT array_remove(ARRAY[1, 71, 81, 92, 100, 1], 1)[71, 81, 92, 100]

array_sort(x) → array

  • Sorts and returns the array x. The elements of x must be orderable. Null elements will be placed at the end of the returned array.
  • Example:
QueryOutput
SELECT array_sort(ARRAY[1, null, 8, 9, 71, 81, 92, 100, 12, 51, 10, 7, 1, null])[1, 1, 7, 8, 9, 10, 12, 51, 71, 81, 92, 100, null, null]

array_sort(array(T), function(T, T, int)) → array(T)

  • Sorts and returns the array based on the given comparator function. The comparator will take two nullable arguments representing two nullable elements of the array. It returns -1, 0, or 1 as the first nullable element is less than, equal to, or greater than the second nullable element. 
  • If the comparator function returns other values (including NULL), the query will fail and raise an error.
  • Example:
QueryOutput
SELECT array_sort(ARRAY [3, 2, 5, 1, 2], (x, y) -> IF(x < y, 1, IF(x = y, 0, -1)))[5, 3, 2, 2, 1] 

cardinality(x) → bigint

  • Returns the cardinality (size) of the array x.
  • Example:
QueryOutput
SELECT cardinality(ARRAY[1, 81, 92, 100, 12, 51, 10])7

arrays_overlap(x, y) → boolean

  • Tests if arrays x and y have any any non-null elements in common. Returns null if there are no non-null elements in common but either array contains null.
  • Example:
QueryOutput
SELECT arrays_overlap(ARRAY[1, 81, 92, 100, 12, 51, 10], ARRAY[101])false
SELECT arrays_overlap(ARRAY[1, 81, 92, 100, 12, 51, 10], ARRAY[101, 51])true

concat(array1, array2, …, arrayN) → array

  • Concatenates the arrays array1, array2, …, arrayN. This function provides the same functionality as the SQL-standard concatenation operator (||).

contains(x, element) → boolean

  • Returns true if the array x contains the element.
  • Example:
QueryOutput
SELECT contains(ARRAY[1, 81, 92, 100, 12, 51, 10], 1)true
SELECT contains(ARRAY[‘abcd’, ‘test’, ‘xyz’], ‘xyz’)true

element_at(array(E), index) → E

  • SQL array indices start at 1
  • Returns element of array at given index. If index > 0, this function provides the same functionality as the SQL-standard subscript operator ([]). If index < 0, element_at accesses elements from the last to the first.
  • Example:
QueryOutput
SELECT contains(ARRAY[1, 81, 92, 100, 12, 51, 10], 2)81

filter(array(T), function(T, boolean)) → array(T)

  • Constructs an array from those elements of array for which function returns true.
  • Example:
QueryOutput
SELECT filter(ARRAY [5, -6, NULL, 7],  x -> x > 0)[5, 7]

flatten(x) → array

  • Flattens an array(array(T)) to an array(T) by concatenating the contained arrays.

reduce(array(T), initialState S, inputFunction(S, T, S), outputFunction(S, R)) → R

  • Returns a single value reduced from array. inputFunction will be invoked for each element in array in order. 
  • In addition to taking the element, inputFunction takes the current state, initially initialState, and returns the new state. outputFunction will be invoked to turn the final state into the result value.
  • It may be the identity function (i -> i).
  • Example:
QueryOutput
SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + COALESCE(x, 0), s -> s)75
SELECT reduce(ARRAY [5, 20, NULL, 50], 1, (s, x) -> s * COALESCE(x, 1), s -> s)5000

repeat(element, count) → array

  • Repeat element for count times

reverse(x) → array

  • Returns an array which has the reversed order of array x.
  • Example:
QueryOutput
SELECT reverse(ARRAY[2, 5])[5, 2]

sequence(start, stop) → array(bigint)

  • Generate a sequence of integers from start to stop, increments by 1 if start is less than or equal to stop, otherwise decrements by 1.
  • Example:
QueryOutput
SELECT sequence(2, 7)[2, 3, 4, 5, 6, 7]
SELECT sequence(7, 1)[7, 6, 5, 4, 3, 2, 1]

sequence(start, stop, step) → array(bigint)

  • Generate a sequence of integers from start to stop, incrementing by step.

sequence(start, stop) → array(date)

  • Generate a sequence of dates from start date to stop date, incrementing by 1 day if start date is less than or equal to stop date, otherwise -1 day.

sequence(start, stop, step) → array(date)

  • Generate a sequence of dates from start to stop, incrementing by step. The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.

shuffle(x) → array

  • Generate a random permutation of the given array x.
  • Example:
QueryOutput
SELECT shuffle(ARRAY[1, 7, 2])[7, 1, 2]

slice(x, start, length) → array

  • Subsets array x starting from index start (or starting from the end if start is negative) with a length of length.
  • Example:
QueryOutput
SELECT slice(ARRAY[1, 7, 2, 87, 12, 09], 2, 4)[7, 2, 87, 12]

transform(array(T), function(T, U)) → array(U)

  • Returns an array that is the result of applying function to each element of array
  • Example:
QueryOutput
SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1)[6, 1, 7]
SELECT transform(ARRAY [‘x’, ‘abc’, ‘z’], x -> x || ‘0’)[‘x0’, ‘abc0’, ‘z0’]

zip(array1, array2[, …]) → array(row)

  • Merges the given arrays, element-wise, into a single array of rows. The M-th element of the N-th argument will be the N-th field of the M-th output element. If the arguments have an uneven length, missing values are filled with NULL.
  • Example:
QueryOutput
SELECT zip(ARRAY[1, 2], ARRAY[‘1b’, null, ‘3b’])[ROW(1, ‘1b’), ROW(2, null), ROW(null, ‘3b’)]

zip_with(array(T), array(U), function(T, U, R)) → array(R)

  • Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function:
  • Example:
QueryOutput
SELECT zip_with(ARRAY[1, 3, 5], ARRAY[‘a’, ‘b’, ‘c’], (x, y) -> (y, x))[ROW(‘a’, 1), ROW(‘b’, 3), ROW(‘c’, 5)]

Window functions

Window functions perform calculations across rows of the query result. They run after the HAVING clause but before the ORDER BY clause. Invoking a window function requires special syntax using the OVER clause to specify the window. A window has three components:

  1. The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions.
  2. The ordering specification, which determines the order in which input rows will be processed by the window function.
  3. The window frame, which specifies a sliding window of rows to be processed by the function for a given row. If the frame is not specified, it defaults to RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This frame contains all rows from the start of the partition up to the last peer of the current row.

rank() → bigint

  • Returns the rank of a value in a group of values. The rank is one plus the number of rows preceding the row that are not peer with the row. Thus, tie values in the ordering will produce gaps in the sequence. The ranking is performed for each window partition.
  • As shown below, the rank function produces a numerical rank within the current row’s partition for each distinct ORDER BY value, in the order defined by the ORDER BY clause. rank needs no explicit parameter, because its behavior is entirely determined by the OVER clause.
  • Query:
WITH dataset AS 
    (SELECT 'name' AS rows, ARRAY['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'e'] AS sample )
SELECT rows,
         value,
         rank()
    OVER (ORDER BY value) AS rank
FROM dataset
CROSS JOIN UNNEST(sample) AS t(value)
  • Output:
 rowsvaluerank
1namea1
2namea1
3namea1
4nameb4
5nameb4
6namec6
7namec6
8named8
9namee9

row_number() → bigint

  • returns a unique, sequential number for each row, starting with one, according to the ordering of rows within the window partition.
  • Query:
WITH dataset AS 
    (SELECT 'name' AS rows, ARRAY['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'e'] AS sample )
SELECT rows,
         value,
         row_number()
    OVER (ORDER BY value) AS row_num
FROM dataset
CROSS JOIN UNNEST(sample) AS t(value)
  • Output:
 rowsvaluerow_num
1namea1
2namea2
3namea3
4nameb4
5nameb5
6namec6
7namec7
8named8
9namee9

dense_rank() → bigint

  • returns the rank of a value in a group of values. This is similar to rank(), except that tie values do not produce gaps in the sequence.
  • Query: 
WITH dataset AS 
    (SELECT 'name' AS rows, ARRAY['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'e'] AS sample )
SELECT rows,
         value,
         dense_rank()
    OVER (ORDER BY value) AS dense_ran
FROM dataset
CROSS JOIN UNNEST(sample) AS t(value)
  • Output: 
 rowsvaluedense_ran
1namea1
2namea1
3namea1
4nameb2
5nameb2
6namec3
7namec3
8named4
9namee5

Map Functions and Operators

map(array(K), array(V)) → map(K, V)

  • Returns a map created using the given key/value arrays.
  • Example:
QueryOutput
SELECT map(ARRAY[1,3], ARRAY[2,4]) AS a{1=2, 3=4}
SELECT map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]){k1=value1, k2=value2}

Subscript Operator: [ ]

  • The [ ] operator is used to retrieve the value corresponding to a given key from a map.
  • Example:
QueryOutput
SELECT map(ARRAY[1,3], ARRAY[2,4]) AS a{1=2, 3=4}
SELECT a[1] FROM (SELECT map(ARRAY[1,3], ARRAY[2,4]) AS a)2

element_at(map(K, V), key) → V

  • Returns value for given key, or NULL if the key is not contained in the map.
  • Example:
Query
SELECT element_at(a, ‘k1’) FROM (SELECT map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]) AS a)

cardinality(x) → bigint

  • Returns the cardinality (size) of the map x.

map() → map<unknown, unknown>

  • Returns an empty map.
  • Example:
QueryOutput
SELECT map(){}

map_from_entries(array(row(K, V))) → map(K, V)

  • Returns a map created from the given array of entries.
  • Example:
QueryOutput
SELECT map_from_entries(ARRAY[(1, ‘x’), (2, ‘y’)]){1 -> ‘x’, 2 -> ‘y’}

map_agg(key, value) → map(K, V)

  • Returns a map created from the input key/value pairs.

multimap_from_entries(array(row(K, V))) → map(K, array(V))

  • Returns a multimap created from the given array of entries. Each key can be associated with multiple values.
  • Example:
QueryOutput
SELECT multimap_from_entries(ARRAY[(1, ‘x’), (2, ‘y’), (1, ‘z’)]){1 -> [‘x’, ‘z’], 2 -> [‘y’]}

map_concat(map1(K, V), map2(K, V), …, mapN(K, V)) → map(K, V)

  • Returns the union of all the given maps. If a key is found in multiple given maps, that key’s value in the resulting map comes from the last one of those maps.

map_filter(map(K, V), function(K, V, boolean)) → map(K, V)

  • Constructs a map from those entries of map for which function returns true.
  • Example:
QueryOutput
SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true){}
SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY[‘a’, NULL, ‘c’]), (k, v) -> v IS NOT NULL){10 -> a, 30 -> c}

map_keys(x(K, V)) → array(K)

  • Returns all the keys in the map x.
  • Example:
QueryOutput
SELECT map_keys(map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]))[k1, k2]

map_values(x(K, V)) → array(V)

  • Returns all the values in the map x.
  • Example:
QueryOutput
SELECT map_values(map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]))[value1, value2]

transform_keys(map(K1, V), function(K1, V, K2)) → map(K2, V)

  • Returns a map that applies function to each entry of map and transforms the keys.
  • Example:
QueryOutput
SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1){}
SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY [‘a’, ‘b’, ‘c’]), (k, v) -> k + 1){2 -> a, 3 -> b, 4 -> c}
SELECT transform_keys(MAP(ARRAY [‘a’, ‘b’], ARRAY [1, 2]), (k, v) -> k || CAST(v as VARCHAR)){a1 -> 1, b2 -> 2}

transform_values(map(K, V1), function(K, V1, V2)) → map(K, V2)

  • Returns a map that applies function to each entry of map and transforms the values.
  • Example:
QueryOutput
SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1){}
SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v + k){1 -> 11, 2 -> 22, 3 -> 33}

Thanks for the read. Please feel free to reach out with your comments.

This story is authored by PV Subbareddy. Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Email Deliverability Analytics using SendGrid and AWS Big Data Services

Email Deliverability Analytics using SendGrid and AWS Big Data Services

In this post, we will run though a case study to setup an email deliverability analytics pipeline using SendGrid and AWS Big Data Services such as S3, Glue and Athena. To start off, when we send mails from SendGrid to recipients. we get responses (multiple response types are possible such as processed, delivered, blocked, deferred etc) from Email Service Providers such as gmail, yahoo etc. We could use this response data to improve our Email Deliverability by analyzing this email response data. This is achieved by logging these responses (via API Gateway and Lambda function) into Amazon S3 and then analyzing them using Athena. The chain of events is put in place by using a web hook that triggers a post request to AWS API Gateway on an event notification (response) from SendGrid. The API Gateway is further configured to trigger a Lambda Function which writes the email response data into S3. We then use Glue crawler to update the metadata in data catalogue, thereby making it available for Athena to perform SQL based analysis.

Without further ado, let’s set the ball rolling. Go to SendGrid and select Settings>Mail_Settings. Click on Event Notifications

We are gonna enable it by giving an Endpoint and select the Events for which you want to get a response. 

The above endpoint points to the AWS API Gateway (shown below) which is a POST request and it triggers the Lambda function as you can see.

Now our Lambda function stores the event payload data in S3 Bucket
Lambda code:

const AWS = require('aws-sdk')
    var s3Bucket = new AWS.S3( { params: {Bucket: "Your-Bucket"} } );
    
    exports.handler = (event, context, callback) => {
        console.log(event); // the response data
        let x = "";
        event.map((item)=>{
            x = x + JSON.stringify(item) + "\n"
        }) 
        let uuid = create_UUID();
        var filePath = "receivelogs/"+uuid;
        console.log(filePath);
        var data = {
            Key: filePath, 
            Body: x
        };
        s3Bucket.putObject(data, function(err, data){
            if (err) { 
                console.log('Error uploading data: ', data);
                callback(err, null);
            } else {
                console.log('Successfully uploaded the response');
                callback(null, data);
            }
        });
};
// this function will generate Unique User ID. Used as FileName
function create_UUID(){
   var dt = new Date().getTime();
   var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
       var r = (dt + Math.random()*16)%16 | 0;
       dt = Math.floor(dt/16);
       return (c=='x' ? r :(r&0x3|0x8)).toString(16);
   });
   return uuid;
}

When you send mail, the response is triggered from SendGrid via POST request to API Gateway and then the response gets stored in S3 via Lambda function.

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. We use a crawler to populate the AWS Glue Data Catalog with tables. Below is the step-by-step process to setup the Glue crawler to read an S3 based data source and make it available as a database table for AWS Athena based analytics.

In the step above, you may need to create a new IAM role that provides access to the underlying S3 data.

So in the steps above, we have concluded the setup for the crawler to fetch the underlying data on S3.

When you run this crawler on the S3 based data source, it updates the metadata of objects in that path in Glue data catalogue. Now, Athena can query ( SQL operations) those objects in S3 using metadata available in data catalogue. A lot of business executives aren’t comfortable with SQL queries, perhaps an add-on to this data pipeline could be using AWS Quicksight for a more BI driven analysis.

Thanks for the read!

This story is authored by Santosh Kumar. He is an AWS Cloud Engineer.