Real Time Streaming Data Analytics using Amazon Kinesis Family

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics (KDA) is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. KDS reduces the complexity of building, managing and integrating streaming applications with other AWS services. SQL users can easily query streaming data or build entire streaming applications using templates and an interactive SQL editor. Java developers can quickly build sophisticated streaming applications using open source Java libraries and AWS integrations to transform and analyze data in real-time.

For deep dive into Amazon Kinesis Data Analytics, please go through the official docs.

Amazon Kinesis Data Streams

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more.

For more details about Amazon Kinesis Data Streams, please go through the official docs.

Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk, enabling near real-time analytics with existing business intelligence tools and dashboards you’re already using today. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt the data before loading it, minimizing the amount of storage used at the destination and increasing security.

For more details about Amazon Kinesis Data Firehose, please go through the official docs.

To Create an Amazon Kinesis Data Stream using Console

  • Open the Kinesis console at https://console.aws.amazon.com/kinesis.
  • In the navigation bar, expand the Region selector and choose a Region.
  • Choose Create data stream.
  • On the Create Kinesis stream page, enter a name for your stream and the number of shards you need, and then click Create Kinesis stream.
    On the Kinesis streams page, your stream’s Status is shown as Creating while the stream is being created. When the stream is ready to use, the Status changes to Active.

Amazon Kinesis Data Generator

The Amazon Kinesis Data Generator (KDG) makes it easy to send data to Kinesis Streams or Kinesis Firehose.

While following this link, choose to Create a Cognito User with Cloud Formation.

  • Choose Create a Cognito User with Cloud Formation.
  • After choosing the above option, console redirects to the Cloud Formation Stack creation page. The console looks like the following.
  • Click on Next, provide the CloudFormation Stack Name and provide username, password details for creating Cognito User for Kinesis Data Generator. 
  • Click on Next and choose Create Stack.
  • After Status of Stack changes to Create complete, click on Outputs tab and open the link under the outputs section.
  • After opening the above link, provide the username and password created in earlier steps.
  • Select Region and Stream/delivery name as created.
    The Record template is 
{
    "sensor_id": {{random.number(50)}},
    "current_temperature": {{random.number(
        {
            "min":0,
            "max":150
        }
    )}},
    "location": "{{random.arrayElement(
        ["AUS","USA","UK"]
    )}}"
}

To Create the Kinesis Data Analytics Application

  • Open the Kinesis Data Analytics console at https://console.aws.amazon.com/kinesisanalytics.
  • Choose Create application.
  • On the Create application page, type an application name, type a description, choose SQL for the application’s Runtime setting, and then choose Create application.

Doing this creates a Kinesis data analytics application with a status of READY. The console shows the application hub where you can configure input and output.

In the next step, you configure input for the application. In the input configuration, you add a streaming data source to the application and discover a schema for an in-application input stream by sampling data on the streaming source.

Configure Streaming Source as Input to Kinesis Data Analytics Application

  • On the Kinesis Analytics applications page in the console, choose Connect streaming data.
  • Source section, where you specify a streaming source for your application. You can select an existing stream source or create one. By default the console names the in-application input stream that is created as INPUT_SQL_STREAM_001. For this exercise, keep this name as it appears.
    Stream reference name – This option shows the name of the in-application input stream that is created, SOURCE_SQL_STREAM_001. You can change the name of the stream.
  • Choose Discover Schema, which automatically discovers the schema of input stream.
  • Choose Save and continue.
    Now, we have an application with input configuration added to it. In the next step, we will add SQL code to perform some analytics on the data in-application input stream.

 Real-Time Analytics on Input Stream Data

  • On the Kinesis Analytics applications page in the console, choose Go to SQL editor.
  • In the Would you like to start running “ApplicationName”? dialog box, choose Yes, start application.
  • The console opens the SQL editor page. Review the page, including the buttons (Add SQL from templates, Save and run SQL) and various tabs.
  • Run Analytics on the input stream data using the following sample query. This Query detects an anomaly in the input stream and sends the anomaly data to anomaly_data_stream and normal data to output_data_stream. Load the following query in SQL editor and choose Save and run SQL.
CREATE OR REPLACE STREAM "anomaly_data_stream" (
	"sensor_id" INTEGER,
	"current_temperature" INTEGER, 
	"location" VARCHAR(16));

CREATE OR REPLACE  PUMP "STREAM_PUMP_ANOMALY" AS INSERT INTO "anomaly_data_stream"
SELECT STREAM "sensor_id",
				"current_temperature",
				"location"
FROM "SOURCE_SQL_STREAM_001" WHERE "current_temperature" > 100;

CREATE OR REPLACE STREAM "output_data_stream" (
	"sensor_id" INTEGER,
	"current_temperature" INTEGER, 
	"location" VARCHAR(16));

CREATE OR REPLACE  PUMP "STREAM_PUMP_OUTPUT" AS INSERT INTO "output_data_stream"
SELECT STREAM "sensor_id",
				"current_temperature",
				"location"
FROM "SOURCE_SQL_STREAM_001" WHERE "current_temperature" < 100;

It creates the in-application stream output_data_stream and anomaly_data_stream.
It creates the pump STREAM_PUMP_OUTPUT and STREAM_PUMP_ANOMALY, and uses it to select rows from SOURCE_SQL_STREAM_001 and insert them in the output_data_stream and anomaly_data_stream. You can see the results in the Real-time analytics tab.

  • The SQL editor has the following tabs:

    The Source data tab shows an in-application input stream data that is mapped to the streaming source. Choose the in-application stream, and you can see data coming in. ROWTIME – Each row in an in-application stream has a special column called ROWTIME. This column is the timestamp when Amazon Kinesis Data Analytics inserted the row in the first in-application stream (the in-application input stream that is mapped to the streaming source).

    The Real-time Analytics tab shows all the other in-application streams created by your application code. It also includes the error stream. Choose DESTINATION_SQL_STREAM to view the rows your application code inserted. 

    The Destination tab shows the external destination where Kinesis Data Analytics writes the query results. We haven’t configured any external destination for our application output yet. 

To create a delivery stream from Kinesis Data Firehose to Amazon S3

  • Open the Kinesis Data Firehose console at https://console.aws.amazon.com/firehose/.
  • Choose Create Delivery Stream. In this case, the name of the stream is anomaly-delivery-stream.
  • On the Destination page, choose the following options.
    • Destination – Choose Amazon S3.
    • Delivery stream name – Type a name for the delivery stream
    • S3 bucket – Choose an existing bucket, or choose New S3 Bucket. If you create a new bucket, type a name for the bucket and choose the region your console is currently using.
    • S3 prefix – Stream stores data in the provided prefix. For anomaly data, the prefix becomes 
      data/anomaly/year=!{timestamp:YYYY}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
    • S3 error prefix – errors in delivering stream to s3, stores in error prefix.
  • Choose Next.
  • On the Configuration page, leave the fields at the default settings. The only required step is to select an IAM role that enables Kinesis Data Firehose to access your resources, as follows:
    1. For IAM Role, choose Select an IAM role.
    2. In the drop-down menu, under Create/Update existing IAM role, choose Firehose delivery IAM role, leave the fields at their default settings, and then choose Allow.
  • Choose Next.
  • Review your settings, and then choose Create Delivery Stream.

The anomaly-delivery-stream created successfully. In the same way, create another Firehose stream named output-delivery-stream.

Configuring Application Output to Amazon Kinesis Data Firehose

We can optionally add an output configuration to the application, to persist everything written from an in-application stream to an external destination such as an Amazon Kinesis data stream, a Kinesis Data Firehose delivery stream, or an AWS Lambda function.

In this application, we are connecting the in-application stream to a Kinesis Data Firehose delivery stream.

In the Destination Tab, choose in-application stream as anomaly_data_stream and Firehose stream as anomaly-delivery-stream and select the format as JSON. In this way configure for output_data_stream as well.
You can see the following after configuring:

Data writes into S3 using Kinesis Firehose Delivery Stream. Now we can query the data in Athena by running a Crawler once on that path.

Thanks for the read. Hope it was helpful.

This story is authored by PV Subbareddy. Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.