Setting Up a Data Lake on AWS Cloud Using LakeFormation

Setting up a Data Lake involves multiple steps such as collecting, cleansing, moving, and cataloging data, and then securely making that data available for downstream analytics and Machine Learning. AWS LakeFormation simplifies these processes and also automates certain processes like data ingestion. In this post, we shall be learning how to build a very simple data lake using LakeFormation with hypothetical retail sales data.

AWS Lake Formation provides its own permissions model that augments the AWS IAM permissions model. This centrally defined permissions model enables fine-grained access to data stored in data lake through a simple grant/revoke mechanism. These permissions are enforced at the table and column level on the data catalogue and are mapped to the underlying objects in S3. LakeFormation permissions are applicable across the full portfolio of AWS analytics and Machine Learning services, including Amazon Athena and Amazon Redshift.

So, let’s get on with the setup.

Adding an administrator

First and foremost step in using LakeFormation is to create an administrator. An administrator has full access to LakeFormation system and initial access to data configuration and access permissions. 

After adding an administrator, navigate to the Dashboard using the sidebar. This illustrates the typical process of Data lake setup.

Register location

From Register and Ingest sub menu in the sidebar, If you wish to setup data ingestion, that is, import unprocessed/landing data, AWS LakeFormation comes with in-house Blueprints that one could use to build Workflows. These workflows could be scheduled as per the needs of the end-user. Sources of data for these workflows can be a JDBC source, log files and many more. Learn more about importing data using workflows here.

If your ingestion process doesn’t involve any of the above mentioned ways and writes directly to S3, it’s alright. Either way we end up registering that S3 location as one of the Data Lake locations.

Once created you shall see its listing in the Data Lake locations.

You could not only access this location from here but also set permission to objects stored in that path. If preferred, one could register lake locations precisely for each processing zone and set permissions accordingly. I registered it to the whole bucket.

I created 2 retail datasets (.csv), one with 20 records and the other with 5 records. I have uploaded one of the datasets (20 records) to S3 with raw/retail_sales prefix.

Creating a Database

Lake Formation internally uses the Glue Data Catalog, so it shows all the databases available. From the Data Catalog sub menu in the sidebar, navigate to Databases to create and manage all the databases. I created a database called merchandise with default permissions.

Once created, you shall see its listing, and also manage, grant/revoke permissions and view tables in that DB.

Creating Crawlers and ETL jobs

From the Register and Ingest sub menu in the sidebar, navigate to Crawlers, Jobs to create and manage all Glue related services. Lake Formation redirects to AWS Glue and internally uses it. I created a crawler to get the metadata for objects residing in raw zone.

After running this crawler manually, now raw data can be queried from Athena.

I created an ETL job to run a transformation on this raw table data. 

All it does is change the class type of purchase date, which is from string class to date class. Creates partitions while writing to refined zone in parquet format. These partitions are created from the processing date but not the purchase date.

retail-raw-refined ETL job python script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datetime
from pyspark.sql.functions import *
from pyspark.sql.types import *
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import *

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "merchandise", table_name = "raw_retail_sales", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "merchandise", table_name = "raw_retail_sales", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("email_id", "string", "email_id", "string"), ("retailer_name", "string", "retailer_name", "string"), ("units_purchased", "long", "units_purchased", "long"), ("purchase_date", "string", "purchase_date", "date"), ("sale_id", "string", "sale_id", "string")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]

#convert glue object to sparkDF
sparkDF = datasource0.toDF()
sparkDF = sparkDF.withColumn('purchase_date', unix_timestamp(sparkDF.purchase_date, 'dd/MM/yyyy').cast(TimestampType()))

applymapping1 = DynamicFrame.fromDF(sparkDF, glueContext,"datafields")
# applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("email_id", "string", "email_id", "string"), ("retailer_name", "string", "retailer_name", "string"), ("units_purchased", "long", "units_purchased", "long"), ("purchase_date", "string", "purchase_date", "date"), ("sale_id", "string", "sale_id", "string")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://test-787/refined/retail_sales"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = dropnullfields3]
now =
path = "s3://test-787/refined/retail_sales/"+'year='+str(now.year)+'/month='+str(now.month)+'/day='+str('/'
datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": path}, format = "parquet", transformation_ctx = "datasink4")

The lakeformation:GetDataAccess permission is needed for this job to work. I created a new policy named LakeFormationGetDataAccess and attached it to AWSGlueServiceRoleDefault role.

    "Version": "2012-10-17",
    "Statement": [
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "lakeformation:GetDataAccess",
            "Resource": "*"

After running the job manually, it will load new transformed data with partitions in the refined zone as specified in the job.

I created another crawler to get the metadata for these objects residing in refined zone.

After running this crawler manually, now refined data can be queried from Athena.

You could now see the newly added partition columns (year, month, day).

Let us add some new raw data and see how our ETL job process that delta difference.

We only want to process new data and old data is either moved to archive location or deleted from raw zone, whatever is preferred.

Run the ETL job again. See new files being added into refined zone.

Load new partitions using msck repair table query.

Note: Try creating another IAM user and as an administrator in the LakeFormation, give this user limited access to the tables, try querying using Athena. See if the permissions are working.

Pros and cons of LakeFormation

The UI is made simple, all under one roof. Most of the times, one needs to keep multiple tabs open and opening S3 locations is troublesome. This is made easy by register data lake locations feature, one not only can access these locations directly but also revoke/grant permissions of the objects residing there. 

Managing permissions on an Object level in S3 is a hectic process. But with LakeFormation permissions can be managed at the data catalog level. This enables one to grant/revoke permissions to users or roles on a table/column level. These permissions are internally mapped to underlying objects sitting in S3.

Though managing permissions, data ingestion workflow are made easy, but still most of the Glue processes like ETL, Crawler, ML specific transformations have to be setup manually.

This story is authored by Koushik Busim. Koushik is a software engineer and a keen data science and machine learning enthusiast.

Serverless Architecture for Lightening Fast Distributed File Transfer on AWS Data Lake

Today, we are very excited to share our insights on setting up a serverless architecture for setting up a lightening fast way* to copy large number of objects across multiple folders or partitions in an AWS data lake on S3. Typically in a data lake, data is kept across various zones depending on data lifecycle. For example, as the data arrives from source, it can be kept in the raw zone and then post processing moved to a processed zone, so that the lake is ready for the next influx of data. The rate of object transfer is a crucial factor, as it affects the overall efficiency of the data processing lifecycle in the data lake.

*In our tests, we copied more than 300K objects ranging from 1KB to 10GB in size from the raw zone into the processed zone. Compared to the best known tool for hyper fast file transfer on AWS called s3s3mirror, we were able to finish this transfer of about 24GB of data in about 50% less time. More details have been provided at the end of the post.

We created a lambda invoke architecture that copies files/objects concurrently. The below picture accurately depicts it.

OMS (Orchestrator-Master-Slave) Lambda Architecture

For example, If we have an S3 bucket with the following folder structure with the actual objects further contained within this hierarchy of folders, sub-folders and partitions.

S3 file structure

Let us look at how we can use OMS Architecture (Orchestrator-Master-Slave) to achieve hyper-fast distributed/concurrent file transfer. The above architecture can be divided into two halves, Orchestrator-Master, Master-Slave.


The Orchestrator simply invokes a Master Lambda for each folder. Each Master then iterates the objects in that folder (including all sub-folders and partitions) and invokes a Slave Lambda for each object to copy it to the destination.

Orchestrator-Master Lambda invoke

Let us look at the Orchestrator Lambda code.

import os
import boto3
import json
from datetime import datetime

client_lambda = boto3.client('lambda')
master_lambda = "Source-to-Destination-File-Transfer-Master"

folder_names = ["folder1", "folder2", "folder3", "folder4", "folder5", "folder6", "folder7", "folder8", "folder9"]

def lambda_handler(event, context):
    t =
        for folder_name in folder_names:
            payload_data = {
              'folder_name': folder_name
            payload = json.dumps(payload_data)
                FunctionName = master_lambda,
                InvocationType = 'Event',
                LogType = 'None',
                Payload = payload
    except Exception as e:
        raise e


Master-Slave Lambda invoke

Let us look at the Master Lambda code.

import os
import boto3
import json
from botocore.exceptions import ClientError

s3 = boto3.resource('s3')
client_lambda = boto3.client('lambda')

source_bucket_name = 'source bucket name'
source_bucket = s3.Bucket(source_bucket_name)

slave_lambda = "Source-to-Destination-File-Transfer-Slave"

def lambda_handler(event, context):

        source_prefix = "" #add if any
        source_prefix = source_prefix + "/" + event['table_name'] + "/"

        for obj in source_bucket.objects.filter(Prefix = source_prefix):
            path = obj.key
            payload_data = {
               'file_path': path
            payload = json.dumps(payload_data)
                FunctionName = slave_lambda,
                InvocationType = 'Event',
                LogType = 'None',
                Payload = payload

    except Exception as e:
        raise e


Let us look at the Slave Lambda code.

import os
import boto3
import json
import re
from botocore.exceptions import ClientError

s3 = boto3.resource('s3')

source_prefix = "" #add if any
source_bucket_name = "source bucket name"
source_bucket = s3.Bucket(source_bucket_name )

destination_bucket_name = "destination bucket name"
destination_bucket = s3.Bucket(destination_bucket_name )

def lambda_handler(event, context):
        destination_prefix = "" #add if any
        source_obj = { 'Bucket': source_bucket_name, 'Key': event['file_path']}
        file_path = event['file_path']
        #copying file
        new_key = file_path.replace(source_prefix, destination_prefix)
        new_obj = source_bucket.Object(new_key)
    except Exception as e:
        raise e

You must ensure that these Lambda functions have been configured to meet the maximum execution time and memory limit constraints as per your case. We tested by setting the upper limit of execution time as 5 minutes and 1GB of available memory.

Calculating the Rate of File Transfer

To calculate the rate of file transfer we are printing start time at the beginning of Orchestrator Lambda execution. Once the file transfer is complete, we use another lambda to extract the last modified date attribute of the last copied object.


import json
import boto3
from datetime import datetime
from dateutil import tz

s3 = boto3.resource('s3')

destination_bucket_name = "destination bucket name"
destination_bucket = s3.Bucket(destination_bucket_name)
destination_prefix = "" #add if any

def lambda_handler(event, context):
    #initializing with some old date
    last_modified_date = datetime(1940, 7, 4).replace(tzinfo = tz.tzlocal()) 

    for obj in my_bucket.objects.filter(Prefix = destination_prefix):
        obj_date = obj.last_modified.replace(tzinfo = tz.tzlocal())
        if last_modified_date < obj_date:
            last_modified_date = obj_date
    print("end-time: ", last_modified_date)

Now we have both start-time from Orchestrator Lambda and end-time from Extract-last-modified Lambda, their difference is the time taken for file transfer.

Before writing this post, we copied 24.1GB of objects using the above architecture, results are shown in the following screenshots:

duration	=	end-time - start-time
		=	10:04:49 - 10:03:28
		=	00:01:21 (hh-mm-ss)

To check the efficiency of our OMS Architecture, we compared the results of OMS with s3s3mirror, a utility for mirroring content from one S3 bucket to another or to/from the local filesystem. Below screenshot has the file transfer stats of s3s3 for the same set of files:

As we see the difference was 1 minutes and 8 seconds for total data transfer of about 24GB, it can be much higher for large data sets if we add more optimizations. I have only shared a generalized view of the OMS Architecture, it can be further fine-tuned to specific needs and get a highly optimized performance. For instance, if you have partitions in each folder and the OMS Architecture could yield much better results if you invoke Master Lambda for each partition inside the folder instead of invoking the master just at the folder level.

Thanks for the read. Looking forward to your thoughts.

This story is co-authored by Koushik and Subbareddy. Koushik is a software engineer and a keen data science and machine learning enthusiast. Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Machine Learning Operations (MLOps) Pipeline using Google Cloud Composer

In an earlier post, we had described the need for automating the Data Engineering pipeline for Machine Learning based systems. Today, we will expand the scope to setup a fully automated MLOps pipeline using Google Cloud Composer.

Cloud Composer

Cloud Composer is official defined as a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers. Built on the popular Apache Airflow open source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.

So let’s get on with the required steps to create this MLOps infrastructure on Google Cloud Platform

Creating a Cloud Composer Environment

Step1: Please enable the Cloud Composer API.

Step2: Go to create environment page in GCP console. Composer is available in Big Data section.

Step3: Click on create to start creating a Composer environment

Step4: Please select the Service account which has the required permissions to access GCS, Big Query, ML Engine and  Composer environment. The required roles for accessing Composer environment is Composer Administrator and Composer Worker. 
For more details about access control in Composer environment please see this.

Step5: Please use Python Version 3 and latest Image version.

Step6: Click on create. It will take about 15-20 minutes to create the environment. Once it completes, the environment page shall look like the following.

Click on Airflow to see Airflow WebUI. The Airflow WebUI looks as follows

DAGs folder is where our dag file is stored. DAG folder is nothing but a folder inside a GCS bucket which is created by the environment. To know more about the concept of DAG and general introduction to Airflow, please refer to this post.

You could see Composer related logs in Logging.

Step7: Please add the following PyPI packages in Composer environment.

Click on created environment and navigate to PYPI packages and click on edit to add packages

The required packages are:

# to read data from MongoDB
# to read data from firestore

Create a ML model

Step1: Please create a folder structure like the following on your instance.

└── trainer

Step2: Please place the following code in file, which shall upload the model to GCS bucket as shown below. This model would be used to create model versions as explained a bit later.

from import bigquery
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np
from import storage
import datetime
import json
import pickle
client = bigquery.Client()
sql = '''

df = client.query(sql).to_dataframe()
df = df[['is_stressed', 'is_engaged', 'status']]

df['is_stressed'] = df['is_stressed'].fillna('n')
df['is_engaged'] = df['is_engaged'].fillna('n')
df['stressed'] = np.where(df['is_stressed']=='y', 1, 0)
df['engaged'] = np.where(df['is_engaged']=='y', 1, 0)
df['status'] = np.where(df['status']=='complete', 1, 0)

feature_cols = ['stressed', 'engaged']
X = df[feature_cols]
y = df.status
logreg = LogisticRegression(),y_train)
pkl_filename = "model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(logreg, file)
BUCKET_NAME=BUCKET_NAME# Upload the model to GCS
bucket = storage.Client().bucket(BUCKET_NAME)
file_path ='machine_learning/models/%Y%m%d_%H%M%S')
blob = bucket.blob('{}/{}'.format(

file_location = 'gs://{BUCKET_NAME}/{file_path}'.format(BUCKET_NAME=BUCKET_NAME, file_path=file_path)
file_config = json.dumps({'file_location': file_location})

bucket = storage.Client().bucket(COMPOSER_BUCKET)
blob = bucket.blob('data/file_config.json')

Step3: Create an empty file inside the trainer directory.

Step4: Please place the following code in file. The file contains required packages to execute code.

import setuptools



Step5: Packaging the code using the following command. It creates a gz file inside ml_model directory.

python3 sdist

Step6: The package name is the name that is specified in file. The package name becomes ml_model-1.0.tar.gz
Copy the package to gs://{your-GCS-bucket}/machine_learning/. This becomes the base directory for your machine learning activities described in this post.

Creating a DAG

In this use case, we have created a DAG file which exports some table data from a MongoDB instance into a GCS bucket and then creates a BigQuery table off of that exported data. It trains a model and creates version for that model. The DAG file supports full data extraction and daily data extraction explained in the code below using a variable tot_data. This variable is extracted from Airflow configurations set by the user. This process is also described later in this post.

Please place the following code in the DAG file.

import airflow
from airflow import DAG
from airflow.models import Variable
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta, datetime
from airflow.operators.python_operator import PythonOperator
import pprint
import json
import re

from pymongo import MongoClient
from import storage
from import blob
from import storage
import os

from airflow import models
from mlengine_operator import MLEngineTrainingOperator, MLEngineVersionOperator

ts =
today = str( + 'T00:00:00.000Z'
yester_day = str( - timedelta(days = 1)) + 'T00:00:00.000Z'

str_ts = ts.strftime('%Y_%m_%d_%H_%m_%S')

config = Variable.get("mongo_conf", deserialize_json=True)
host = config['host']
db_name = config['db_name']
table_name = config['table_name']
file_prefix = config['file_prefix']
bucket_name = config['bucket_name']
# file_path = file_prefix + '/' + table_name + '.json'
file_path = '{file_prefix}/{table_name}/{table_name}_{str_ts}.json'.format(file_prefix=file_prefix, str_ts=str_ts, table_name=table_name)
file_location = 'gs://' + bucket_name + '/' + file_prefix + '/' + table_name + '/' + table_name + '_*.json'
config['file_location'] = file_location
bq_dataset = config['bq_dataset']
tot_data = config['tot_data'].lower()

BUCKET_NAME = config['ml_configuration']['BUCKET_NAME']
BASE_DIR = config['ml_configuration']['BASE_DIR']
PACKAGE_NAME = config['ml_configuration']['PACKAGE_NAME']
TRAINER_BIN = os.path.join(BASE_DIR, 'packages', PACKAGE_NAME)
TRAINER_MODULE = config['ml_configuration']['TRAINER_MODULE']
RUNTIME_VERSION = config['ml_configuration']['RUNTIME_VERSION']
PROJECT_ID = config['ml_configuration']['PROJECT_ID']
MODEL_NAME = config['ml_configuration']['MODEL_NAME']

MODEL_FILE_BUCKET = config['ml_configuration']['MODEL_FILE_BUCKET']
model_file_loc = config['ml_configuration']['MODEL_FILE_LOCATION']

bucket = storage.Client().bucket(MODEL_FILE_BUCKET)
blob = bucket.get_blob(model_file_loc)
file_config = json.loads(blob.download_as_string().decode("utf-8"))
export_uri = file_config['file_location']

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
            out[name[:-1]] = x

    return out

def mongoexport():
        client = storage.Client()
        bucket = client.get_bucket(bucket_name)
        blob = bucket.blob(file_path)

        client = MongoClient(host)
        db = client[db_name]
        tasks = db[table_name]
        # if tot_data is set to 'yes' in airflow configurations, full data 
        # is processed.  
        if tot_data == 'no':
          query = {"edit_datetime": { "$gte": yester_day, "$lt": today}}
          data = tasks.find(query)
          data = tasks.find()
        emp_list = []
        for record in data:
                emp_list.append(json.dumps(record, default=str))
        flat_list =[]
        for data in emp_list:
        data = '\n'.join(json.dumps({re.sub('[^0-9a-zA-Z_ ]+', '', str(k)).lower().replace(' ', '_'): str(v) for k, v in record.items()}) for record in flat_list)

default_args = {
    'start_date': airflow.utils.dates.days_ago(0),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)

with DAG('ml_pipeline', schedule_interval=None, default_args=default_args) as dag:

    # priority_weight has type int in Airflow DB, uses the maximum.
    pymongo_export_op = PythonOperator(

    update_bq_table_op = BashOperator(
        bq rm -f {bq_dataset}.{table_name}
        bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values=True {bq_dataset}.{table_name} {file_location}
        '''.format(bq_dataset=bq_dataset, table_name=table_name, file_location=file_location)

    date_nospecial = '{{ execution_date.strftime("%Y%m%d") }}'
    date_min_nospecial = '{{ execution_date.strftime("%Y%m%d_%H%m") }}'
    uuid = '{{ macros.uuid.uuid4().hex[:8] }}'

    training_op = MLEngineTrainingOperator(
      job_id='{}_{}_{}'.format(table_name, date_nospecial, uuid),

    create_version_op = MLEngineVersionOperator(
          'name': 'version_{}_{}'.format(date_min_nospecial, uuid),
          'deploymentUri': export_uri,
          'runtimeVersion': RUNTIME_VERSION,
          'pythonVersion': '3.5',
          'framework': 'SCIKIT_LEARN',

    pymongo_export_op >> update_bq_table_op >> training_op >> create_version_op

Once file is created, please upload the file to DAGs folder. And also please add the following plugin dependency file named mlengine_operator in DAGs folder.
Place the following code in file.

import re

from apiclient import errors

from airflow.contrib.hooks.gcp_mlengine_hook import MLEngineHook
from airflow.exceptions import AirflowException
from airflow.operators import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.utils.log.logging_mixin import LoggingMixin

log = LoggingMixin().log

def _normalize_mlengine_job_id(job_id):

    # Add a prefix when a job_id starts with a digit or a template
    match ='\d|\{{2}', job_id)
    if match and match.start() is 0:
        job = 'z_{}'.format(job_id)
        job = job_id

    # Clean up 'bad' characters except templates
    tracker = 0
    cleansed_job_id = ''
    for m in re.finditer(r'\{{2}.+?\}{2}', job):
        cleansed_job_id += re.sub(r'[^0-9a-zA-Z]+', '_',
        cleansed_job_id += job[m.start():m.end()]
        tracker = m.end()

    # Clean up last substring or the full string if no templates
    cleansed_job_id += re.sub(r'[^0-9a-zA-Z]+', '_', job[tracker:])

    return cleansed_job_id

class MLEngineBatchPredictionOperator(BaseOperator):
    template_fields = [

    def __init__(self,
        super(MLEngineBatchPredictionOperator, self).__init__(*args, **kwargs)

        self._project_id = project_id
        self._job_id = job_id
        self._region = region
        self._data_format = data_format
        self._input_paths = input_paths
        self._output_path = output_path
        self._model_name = model_name
        self._version_name = version_name
        self._uri = uri
        self._max_worker_count = max_worker_count
        self._runtime_version = runtime_version
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

        if not self._project_id:
            raise AirflowException('Google Cloud project id is required.')
        if not self._job_id:
            raise AirflowException(
                'An unique job id is required for Google MLEngine prediction '

        if self._uri:
            if self._model_name or self._version_name:
                raise AirflowException('Ambiguous model origin: Both uri and '
                                       'model/version name are provided.')

        if self._version_name and not self._model_name:
            raise AirflowException(
                'Missing model: Batch prediction expects '
                'a model name when a version name is provided.')

        if not (self._uri or self._model_name):
            raise AirflowException(
                'Missing model origin: Batch prediction expects a model, '
                'a model & version combination, or a URI to a savedModel.')

    def execute(self, context):
        job_id = _normalize_mlengine_job_id(self._job_id)
        prediction_request = {
            'jobId': job_id,
            'predictionInput': {
                'dataFormat': self._data_format,
                'inputPaths': self._input_paths,
                'outputPath': self._output_path,
                'region': self._region

        if self._uri:
            prediction_request['predictionInput']['uri'] = self._uri
        elif self._model_name:
            origin_name = 'projects/{}/models/{}'.format(
                self._project_id, self._model_name)
            if not self._version_name:
                    'modelName'] = origin_name
                prediction_request['predictionInput']['versionName'] = \
                    origin_name + '/versions/{}'.format(self._version_name)

        if self._max_worker_count:
                'maxWorkerCount'] = self._max_worker_count

        if self._runtime_version:
                'runtimeVersion'] = self._runtime_version

        hook = MLEngineHook(self._gcp_conn_id, self._delegate_to)

        # Helper method to check if the existing job's prediction input is the
        # same as the request we get here.
        def check_existing_job(existing_job):
            return existing_job.get('predictionInput', None) == \

            finished_prediction_job = hook.create_job(
                self._project_id, prediction_request, check_existing_job)
        except errors.HttpError:

        if finished_prediction_job['state'] != 'SUCCEEDED':
            self.log.error('MLEngine batch prediction job failed: {}'.format(
            raise RuntimeError(finished_prediction_job['errorMessage'])

        return finished_prediction_job['predictionOutput']

class MLEngineModelOperator(BaseOperator):
    template_fields = [

    def __init__(self,
        super(MLEngineModelOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._model = model
        self._operation = operation
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

    def execute(self, context):
        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)
        if self._operation == 'create':
            return hook.create_model(self._project_id, self._model)
        elif self._operation == 'get':
            return hook.get_model(self._project_id, self._model['name'])
            raise ValueError('Unknown operation: {}'.format(self._operation))

class MLEngineVersionOperator(BaseOperator):
    template_fields = [

    def __init__(self,

        super(MLEngineVersionOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._model_name = model_name
        self._version_name = version_name
        self._version = version or {}
        self._operation = operation
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

    def execute(self, context):
        if 'name' not in self._version:
            self._version['name'] = self._version_name

        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)

        if self._operation == 'create':
            assert self._version is not None
            return hook.create_version(self._project_id, self._model_name,
        elif self._operation == 'set_default':
            return hook.set_default_version(self._project_id, self._model_name,
        elif self._operation == 'list':
            return hook.list_versions(self._project_id, self._model_name)
        elif self._operation == 'delete':
            return hook.delete_version(self._project_id, self._model_name,
            raise ValueError('Unknown operation: {}'.format(self._operation))

class MLEngineTrainingOperator(BaseOperator):
    template_fields = [

    def __init__(self,
        super(MLEngineTrainingOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._job_id = job_id
        self._package_uris = package_uris
        self._training_python_module = training_python_module
        self._training_args = training_args
        self._region = region
        self._scale_tier = scale_tier
        self._runtime_version = runtime_version
        self._python_version = python_version
        self._job_dir = job_dir
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to
        self._mode = mode

        if not self._project_id:
            raise AirflowException('Google Cloud project id is required.')
        if not self._job_id:
            raise AirflowException(
                'An unique job id is required for Google MLEngine training '
        if not package_uris:
            raise AirflowException(
                'At least one python package is required for MLEngine '
                'Training job.')
        if not training_python_module:
            raise AirflowException(
                'Python module name to run after installing required '
                'packages is required.')
        if not self._region:
            raise AirflowException('Google Compute Engine region is required.')

    def execute(self, context):
        job_id = _normalize_mlengine_job_id(self._job_id)
        training_request = {
            'jobId': job_id,
            'trainingInput': {
                'scaleTier': self._scale_tier,
                'packageUris': self._package_uris,
                'pythonModule': self._training_python_module,
                'region': self._region,
                'args': self._training_args,

        if self._runtime_version:
            training_request['trainingInput']['runtimeVersion'] = self._runtime_version

        if self._python_version:
            training_request['trainingInput']['pythonVersion'] = self._python_version

        if self._job_dir:
            training_request['trainingInput']['jobDir'] = self._job_dir

        if self._mode == 'DRY_RUN':
  'In dry_run mode.')
  'MLEngine Training job request is: {}'.format(

        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)

        # Helper method to check if the existing job's training input is the
        # same as the request we get here.
        def check_existing_job(existing_job):
            return existing_job.get('trainingInput', None) == \

            finished_training_job = hook.create_job(
                self._project_id, training_request, check_existing_job)
        except errors.HttpError:

        if finished_training_job['state'] != 'SUCCEEDED':
            self.log.error('MLEngine training job failed: {}'.format(
            raise RuntimeError(finished_training_job['errorMessage'])

Import variables from composer_conf.json file into Airflow Variables.
Go to Airflow WebUI → Admin → Variables → Browse to file path or configure variables manually.
Please place the following in composer_conf

  "mongo_conf": {
    "host": "mongodb://<instance-internal-ip>:27017",
    "db_name": "DBNAME",
    "table_name": "TABLENAME",
    "file_prefix": "Folder In GCS Bucket",
    "bq_dataset": "BigQuery Dataset",
    "bucket_name": "GCS Bucket",
    "tot_data": "yes",
    "ml_configuration": {
      "BUCKET_NAME": "GCS Bucket",
      "BASE_DIR": "gs://GCS Bucket/machine_learning/",
      "TRAINER_MODULE": "trainer.train",
      "RUNTIME_VERSION": "1.13",
      "PROJECT_ID": "GCP Project",
      "MODEL_FILE_BUCKET": "BUCKET CREATED BY Composer Environment",

Please store any configuration files or credentials file that are used by Composer in the data folder in the bucket created by Composer environment.

After configuring variables accordingly, you can see the DAG named ml_pipeline in the Airflow WebUI.

Please trigger the DAG file from Airflow WebUI. Once the DAG ran successfully. It looks like the following:

Thanks for the read and look forward to your comments.

This story is authored by PV Subbareddy. Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Email Deliverability Analytics using SendGrid and AWS Big Data Services

Email Deliverability Analytics using SendGrid and AWS Big Data Services

In this post, we will run though a case study to setup an email deliverability analytics pipeline using SendGrid and AWS Big Data Services such as S3, Glue and Athena. To start off, when we send mails from SendGrid to recipients. we get responses (multiple response types are possible such as processed, delivered, blocked, deferred etc) from Email Service Providers such as gmail, yahoo etc. We could use this response data to improve our Email Deliverability by analyzing this email response data. This is achieved by logging these responses (via API Gateway and Lambda function) into Amazon S3 and then analyzing them using Athena. The chain of events is put in place by using a web hook that triggers a post request to AWS API Gateway on an event notification (response) from SendGrid. The API Gateway is further configured to trigger a Lambda Function which writes the email response data into S3. We then use Glue crawler to update the metadata in data catalogue, thereby making it available for Athena to perform SQL based analysis.

Without further ado, let’s set the ball rolling. Go to SendGrid and select Settings>Mail_Settings. Click on Event Notifications

We are gonna enable it by giving an Endpoint and select the Events for which you want to get a response. 

The above endpoint points to the AWS API Gateway (shown below) which is a POST request and it triggers the Lambda function as you can see.

Now our Lambda function stores the event payload data in S3 Bucket
Lambda code:

const AWS = require('aws-sdk')
    var s3Bucket = new AWS.S3( { params: {Bucket: "Your-Bucket"} } );
    exports.handler = (event, context, callback) => {
        console.log(event); // the response data
        let x = "";>{
            x = x + JSON.stringify(item) + "\n"
        let uuid = create_UUID();
        var filePath = "receivelogs/"+uuid;
        var data = {
            Key: filePath, 
            Body: x
        s3Bucket.putObject(data, function(err, data){
            if (err) { 
                console.log('Error uploading data: ', data);
                callback(err, null);
            } else {
                console.log('Successfully uploaded the response');
                callback(null, data);
// this function will generate Unique User ID. Used as FileName
function create_UUID(){
   var dt = new Date().getTime();
   var uuid = 'xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx'.replace(/[xy]/g, function(c) {
       var r = (dt + Math.random()*16)%16 | 0;
       dt = Math.floor(dt/16);
       return (c=='x' ? r :(r&0x3|0x8)).toString(16);
   return uuid;

When you send mail, the response is triggered from SendGrid via POST request to API Gateway and then the response gets stored in S3 via Lambda function.

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. We use a crawler to populate the AWS Glue Data Catalog with tables. Below is the step-by-step process to setup the Glue crawler to read an S3 based data source and make it available as a database table for AWS Athena based analytics.

In the step above, you may need to create a new IAM role that provides access to the underlying S3 data.

So in the steps above, we have concluded the setup for the crawler to fetch the underlying data on S3.

When you run this crawler on the S3 based data source, it updates the metadata of objects in that path in Glue data catalogue. Now, Athena can query ( SQL operations) those objects in S3 using metadata available in data catalogue. A lot of business executives aren’t comfortable with SQL queries, perhaps an add-on to this data pipeline could be using AWS Quicksight for a more BI driven analysis.

Thanks for the read!

This story is authored by Santosh Kumar. He is an AWS Cloud Engineer.

Creating an Automated Data Engineering Pipeline for Batch Data in Machine Learning

A common use case in Machine Learning life cycle is to have access to the latest training data so as to prevent model deterioration. A lot of times data scientists find it cumbersome to manually export data from data sources such as relational databases or NoSQL data stores or even distributed data. This necessitates automating the data engineering pipeline in Machine Learning. In this post, we will describe how to set up this pipeline for batch data. This workflow is orchestrated via Airflow and can be set up to run at regular intervals: such as hourly, daily, weekly, etc depending on the specific business requirements.

Quick note – In case you are interested in building a real time data engineering pipeline for ML, please look at this post.

In this use case, we are going to export MongoDB data into Google BigQuery via Cloud Storage. The updated data in BigQuery is then made available in Jupyter Notebook as a Pandas Dataframe for downstream model building and analytics. As the pipeline automates the data ingestion and preprocessing, the data scientists always have access to the latest batch data in their Jupyter Notebooks hosted on Google AI Platform. 

We have a MongoDB service running in an instance and we have Airflow and mongoexport running on docker on another instance. Mongoexport is a utility that produces a JSON or CSV export of data stored in MongoDB. Now the data in MongoDB shall be extracted and transformed using mongoexport and loaded into CloudStorage. Airflow is used to schedule and orchestrate these exports. Once the data is available in CloudStorage it could be queried in BigQuery. We then get this data from BigQuery to Jupyter Notebook. Following is a step by step sequence of steps to set up this data pipeline.

You can create an instance in GCP by going to Compute Engine. Click on create instance.

sudo apt-get update
curl -fsSL -o
sudo usermod -aG docker $USER
sudo apt-get install -y python-pip
export AIRFLOW_HOME=~/airflow
sudo pip install apache-airflow
sudo pip install apache-airflow[postgres,s3]
airflow initdb
airflow webserver -p 8080 -D
airflow scheduler -D
sudo docker pull mongo
sudo docker run --name mongo_client -d mongo

Please run the file using ./ command (please make sure file is executable), which would install Docker, Airflow, pulls Mongo image and runs the mongo image in a container named mongo_client.

After installation, for Airflow webUIhttp://<public-ip-instance>:8080 (You may need to open port 8080 in the network just for your public IP)

Please make sure the Google service account in the running instance must have permissions for accessing Big Query and Cloud Storage. After installation, add the Airflow job Python file ( inside the airflow/dags folder.

Before running the Python file, please make sure that you create Dataset and create the table in BigQuery. Also change the appropriate values for the MongoDB source database, MongoDB source table, Cloud Storage destination bucket and BigQuery destination dataset in the Airflow job Python file ( Big Query destination table name is the same as the source table in Mongo DB.

import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import json
from import json_normalize

# Following are default arguments which could be overridden
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': airflow.utils.dates.days_ago(0),
    'email': [''],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1),

bucket_name = '<Your_Bucket>'
db_name = '<Database_Name>'
dataset = '<Dataset_Name>'
table_name = '<Table_Name>'

time_stamp =
cur_date = time_stamp.strftime("%Y-%m-%d")

# It will flatten the nested json
def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
            out[name[:-1]] = x

    return out

def convert_string(y):
    string_type = {}

    def convert(x, name=''):
        if type(x) is dict:
            for a in x:
                convert(str(x[a]), name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
            string_type[name[:-1]] = x

    return string_type

def json_flat():
    lines = [line.rstrip('\n') for line in open('/home/dev/'+ table_name + '-unformat.json')]
    flat_list = []
    for line in lines:
        line = line.replace("\"$", "\"")
        line = json.loads(line)
        except Exception as e:
    flatted_json = '\n'.join(i for i in flat_list)

    with open('/home/dev/' + table_name + '.json', 'a') as file:
    return flatted_json 

dag = DAG('mongoexport-daily-gcs-bq', default_args=default_args, params = {'cur_date': cur_date, 'db_name': db_name, 'table_name': table_name, 'dataset': dataset, 'bucket_name': bucket_name})
#exports provide a table data into docker container 
t1 = BashOperator(
    bash_command='sudo docker exec -i mongo_client sh -c "mongoexport --host=<instance_public_ip> --db {{params.db_name}} --collection {{params.table_name}} --out {{params.table_name}}-unformat.json"',

# copies exported file into instance

t2 = BashOperator(
    bash_command='sudo docker cp mongo_client:/{{params.table_name}}-unformat.json /home/dev/',

t3 = PythonOperator(
# copies the flatten data from cloud storage
t4 = BashOperator(
    bash_command='gsutil cp /home/dev/{{params.table_name}}.json gs://{{params.bucket_name}}/raw/{{params.table_name}}/date={{params.cur_date}}/',
t5 = BashOperator(
    bash_command='gsutil cp /home/dev/{{params.table_name}}.json gs://{{params.bucket_name}}/curated/{{params.table_name}}/',

# removes the existing bigquery table
t6 = BashOperator(
    bash_command='bq rm -f {{params.dataset}}.{{params.table_name}}',
# creates a table in bigquery
t7 = BashOperator(
    bash_command='bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON {{params.dataset}}.{{params.table_name}} gs://{{params.bucket_name}}/curated/{{params.table_name}}/{{params.table_name}}.json',
# removes data from container
t8 = BashOperator(
    bash_command='sudo docker exec -i mongo_client sh -c "rm -rf {{params.table_name}}*.json"',
# removes data from instance
t9 = BashOperator(
    bash_command='rm -rf /home/dev/{{params.table_name}}*.json',

t1 >> t2
t2 >> t3
t3 >> [t4, t5]
[t4, t5] >> t6
t6 >> t7
t7 >> [t8, t9]

Then run the python file using python <file-path>.py  

(example: python airflow/dags/

After running the python file, the dag name shows in Airflow webUI. And you could trigger the dag manually. Please make sure toggle button is in ON status

Once the job completes, the data is stored in the bucket and also available in the destination table in BigQuery. You could see the table is created in BigQuery. Click on querytable to perform SQL operations and you could see your results in the preview tab at the bottom.

Now, you could access the data in Jupyter Notebook from BigQuery. Search for notebook in GCP console. 

Run the below commands in Jupyter Notebook.

from import bigquery
client = bigquery.Client()
sql = """
df = client.query(sql).to_dataframe()

This loads the BigQuery data into Pandas dataframe and can be used for model creation as required. Later when the data pipeline is run as per schedule, the refreshed data would automatically be available in this Jupyter notebook via this SQL query.

Hope this helps you to automate your batch Data Engineering pipeline for Machine Learning. 

This story is co-authored by Santosh and Subbareddy. Santosh is an AWS Cloud Engineer and Subbareddy is a Big Data Engineer.

Real Time Data Engineering Pipeline for Machine Learning

Our focus in this post is to leverage Google Cloud Platform’s Big Data Services to build an end to end Data Engineering pipeline for streaming processes.

So what is Data Engineering?
Data Engineering is associated with data specifically around data delivery, storage and processing. The main goal is to provide a reliable infrastructure for data which includes operations such as collect, move, store and prepare data.

Most companies store their data in different formats across databases and as text files. This is where data engineers come in to picture, they build pipelines that transform this data into formats that data scientists could use.

Need for Data Engineering in Machine Learning:
Data engineers are responsible for:

  • Develop machine learning models.
  • Improve existing machine learning models.
  • Research and implement best practices to enhance existing machine learning infrastructure.
  • Developing, constructing, testing and maintaining architectures, such as databases and large-scale processing systems.
  • Analyzing large and complex data sets to derive valuable insights.

This is the reference architecture used to build the end to end pipe data pipeline :

Google Cloud Platform Data Engineering Pipeline for Streaming Processes

The Google Cloud Services used in above streaming process are:

  1. Cloud Firestore: Lets us store data in cloud so that we could sync it across all other devices and also share among multiple users. It is a NoSQL query document data which lets us store, query and sync.
  2. Cloud Function: A lightweight compute solution for developers to create single-purpose, stand-alone functions that respond to cloud events without the need to manage a server or runtime environment.
  3. Cloud Pub/Sub: A fully-managed real-time messaging service that allows you to send and receive messages across independent applications.
  4. Cloud Dataflow: A cloud-based data processing service for both batch and real-time data streaming applications. It enables developers to set up data processing pipelines for integrating, preparing and analyzing large data sets.
  5. Cloud Storage: A data storage service in which data is maintained, managed, backed up remotely and made available to users over a network.
  6. BigQuery: It was designed for analyzing data on the order of billions of rows, using a SQL-like syntax. It runs on the Google Cloud Storage infrastructure and could be accessed with a REST-oriented application programming interface (API).
  7. Jupyter notebook: An open source web application that you could use to create and share documents that contain live code, equations, visualizations, and text.

Create data engineering pipeline via Firestore Streaming

Step1: Add a new record in a collection (think of it as a table), say pubsub-event in firestore.

Step2: It triggers the cloud function named pubsub_event

Document Path: pubsub-event/{eventId}  listens for changes to all pubsub-event documents.

Below is the Cloud Function written in node js which triggers whenever there is a change in our source Firestore collection and publishes the data to Pub/Sub

const PubSub = require('@google-cloud/pubsub');
const pubsubClient = new PubSub();
const functions = require('firebase-functions');

exports.helloFirestore = functions.firestore
  .onCreate((snap, context) => {
    const event =;
    const payload_data = {};
    for (let key of Object.keys(event)) {
    	payload_data[key] = event[key];
    // The name for the new topic
    const topicName = 'pubsub-gcs';
    const dataBuffer = Buffer.from(JSON.stringify(payload_data));
    // Creates the new topic
    return pubsubClient
      .then(messageId => {
        console.log(`Message ${messageId} published.`);
        return messageId;
      .catch(err => {
        console.error('ERROR:', err);


Below is the dependencies of the Cloud Function.

  "name": "functions",
  "description": "Cloud Functions for Firebase",
  "scripts": {
    "serve": "firebase serve --only functions",
    "shell": "firebase functions:shell",
    "start": "npm run shell",
    "deploy": "firebase deploy --only functions",
    "logs": "firebase functions:log"
  "engines": {
    "node": "8"
  "dependencies": {
    "@google-cloud/pubsub": "^0.18.0",
    "firebase-admin": "~7.0.0",
    "firebase-functions": "^2.3.1"
  "devDependencies": {
    "firebase-functions-test": "^0.1.6"
  "private": true

Step3: Cloud Function pubsub_event publishes data to Pub/Sub topic projects/ProjectName/topics/pubsub-gcs

Step4: As shown above, create an export job : ps-to-text-pubsub-gcs (implemented via Dataflow). This job reads data every 5 minutes (configurable to other values as well) from Pub/Sub topic pubsub-gcs and dumps this into the destination bucket on GCS.

Click on run the job.

 Step6: Now, we have data in CloudStorage. We shall use BigQuery to perform all the data manipulation operations. But first we need to create dataset in BigQuery to query data from GCS into Bigquery.

Go to BigQuery and create dataset. So that we create our table to access that data.

The dataset shall be created. By clicking on the dataset you shall see an option to CREATE TABLE.

Click on CREATE TABLE then we shall get the data from CloudStorage. While setting up the required inputs as indicated below, please make sure that you select “Table type” as External Table. This ensures that BigQuery is able to automatically load new data as it comes into GCS.

To create table in BigQuery from CloudStorage. Click on the browse button and configure file path.

Files that are having pubsub-event-* as prefix. This prefix is very important as it makes sure that all subsequent data dumps into GCS destination folder are also picked automatically by BigQuery. Select the file format to be JSON. Check the auto-detect schema box. Then click create table.

Quick Tip: For reading nested json files in BigQuery, please go through this resource. Now the data which is present in CloudStorage is also available in BigQuery and you could run sql commands to manipulate the data.

Click on table you have created, accounts is my table name and click on query table to make SQL operations and you could see your results in the preview tab at the bottom.

Step7: Now, we are on to the last step to access this BigQuery data in Jupyter Notebooks and use that as the source data to train and build our ML models.

Search for notebook in GCP console. 

You shall see something like this 

Click on OPEN JUPYTERLAB then it will redirect you to notebook.

from import bigquery

client = bigquery.Client()

sql = """

df = client.query(sql).to_dataframe()

So in this way, we have built a data pipeline that continuously dumps data from Firestore into GCS every 5 minutes, which is then readily available in Jupyter Notebook via BigQuery for any downstream analytics and ML model building.

Look forward to your comments.

This story is co-authored by Santosh Kumar and PV Subbareddy. Santosh is a Software Engineer specializing on Cloud Services and DevOps. Subbareddy is a Big Data Engineer specializing on AWS Big Data Services and Apache Spark Ecosystem.

Processing Kinesis Data Streams with Spark Streaming

Solution Overview : In this blog, we are going to build a real time anomaly detection solution using Spark Streaming. Kinesis Data Streams would act as the input streaming source and the anomalous records would be written as Data Streams in DynamoDB.

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events.

Data Streams

The unit of data stored by Kinesis Data Streams is a data record. A data stream represents a group of data records.

For deep dive into Kinesis Data Streams, please go through these official docs.

Kinesis Data Streams Producers

A producer puts data records into Amazon Kinesis Data Streams. For example, a web server sending log data to a Kinesis Data Stream is a producer.

For more details about Kinesis Data Streams Producers, please go through these official docs.

Kinesis Data Streams Consumers

A consumer, known as an Amazon Kinesis Data Streams application, is an application that you build to read and process data records from Kinesis Data Streams.

For more details about Kinesis Data Streams Consumers, please go through these official docs.

Creating a Kinesis Data Stream

Step1. Go to Amazon Kinesis console -> click on Create Data Stream

Step2. Give Kinesis Stream Name and Number of shards as per volume of the incoming data. In this case, Kinesis stream name as kinesis-stream and number of shards are 1.

Shards in Kinesis Data Streams

A shard is a uniquely identified sequence of data records in a stream. A stream is composed of one or more shards, each of which provides a fixed unit of capacity.

For more about shards, please go through these official docs.

Step3. Click on Create Kinesis Stream

Kinesis Data Streams can be connected with Kinesis Data Firehoseto write the streamsinto S3.

Configure Kinesis Data Streams with Kinesis Data Producers

The Amazon Kinesis Data Generator (KDG) makes it easy to send data to Kinesis Streams or Kinesis Firehose.

While following this link, choose to Create a Cognito User with Cloud Formation.

After selecting the above option, we will navigate to the Cloud Formation console:

Click on Next and provide Username and Password for Cognito User for Kinesis Data Generator.

Click on Next and Create Stack.

CloudFormation Stack is created.

Click on Outputs tab and open the link

After opening the link, enter the usernameand password of Cognito user.

After Sign In is completed, select the RegionStream and configure the number of records per second. Choose record template as your requirement.

In this case, the template data format is

{{name.firstName}},{{random.number({“min”:10, “max”:550})}},{{random.arrayElement([“OK”,”FAIL”,”WARN”] )}}

The template data looks like the following

You can send different types of dummy data to Kinesis Data Streams.

Kinesis Data Streams with Kinesis Data Producers are ready. Now we shall build a Spark Streaming application which consumes data streams from Kinesis Data Streams and dumps the output streams into DynamoDB.

Create DynamoDB Tables To Store Data Frame

Go to Amazon DynamoDB console -> Choose Create Table and name the table, in this case, data_dump

In the same way, create another table named anomaly_data. Make sure Kinesis Data streams and DynamoDb tables are in the same region.

Spark Streaming with Kinesis Data Streams

Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window.

For deep dive into Spark Streams, please go through docs.

In this case, the Scala programming language is used. Scala version is 2.11.12. Please install scala, sbt and spark.

Create a folder structure like the following

| -- src/main/scala/packagename/object
| -- build.sbt
| -- project/assembly.sbt

In this case, the structure looks like the following

After creating the folder structure,

Please replace build.sbt file with the following code. The following code will add the required dependencies like spark, spark kinesis assembly, spark streaming and many more.

name := "kinesis-spark-streams-dynamo"

version := "0.1"

scalaVersion := "2.11.12"

libraryDependencies += "com.audienceproject" %% "spark-dynamodb" % "0.4.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.3"
libraryDependencies += "" % "guava" % "14.0.1"
libraryDependencies += "com.amazonaws" % "aws-java-sdk-dynamodb" % "1.11.466"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.4.3"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kinesis-asl" % "2.4.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.3"

assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first

Please replace assembly.sbt file with the following code. This will add the assembly plugin which can be used for creating the jar.

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")

Please replace kinesis-spark-streams-dynamo file with the following code.

package com.wisdatum.kinesisspark

import com.amazonaws.auth.DefaultAWSCredentialsProviderChain
import org.apache.spark._
import org.apache.spark.streaming._
import scala.collection.JavaConverters._
import org.apache.spark.streaming.kinesis.KinesisInputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.dstream.DStream
import com.amazonaws.regions.RegionUtils
import org.apache.log4j.{Level, Logger}
import com.audienceproject.spark.dynamodb.implicits._

object KinesisSparkStreamsDynamo {
def getRegionNameByEndpoint(endpoint: String): String = {
val uri = new
throw new IllegalArgumentException(s"Could not resolve region for endpoint: $endpoint"))

def main(args: Array[String]) {

val rootLogger = Logger.getRootLogger()

val conf = new SparkConf().setAppName("KinesisSparkExample").setMaster("local[*]")
val ssc = new StreamingContext(conf, Seconds(1))
val Array(appName, streamName, endpointUrl, dynamoDbTableName) = args
val credentials = new DefaultAWSCredentialsProviderChain().getCredentials()

require(credentials != null,
"No AWS credentials found. Please specify credentials using one of the methods specified " +
val kinesisClient = new AmazonKinesisClient(credentials)
val numShards = kinesisClient.describeStream(streamName).getStreamDescription().getShards().size
println("numShards are " + numShards)

val numStreams = numShards

val batchInterval = Milliseconds(100)

val kinesisCheckpointInterval = batchInterval

val regionName = getRegionNameByEndpoint(endpointUrl)

val anomalyDynamoTable = "data_anomaly"

println("regionName is " + regionName)

val kinesisStreams = (0 until numStreams).map { i =>

val unionStreams = ssc.union(kinesisStreams)

val inputStreamData = { byteArray =>
val Array(sensorId, temp, status) = new String(byteArray).split(",")
StreamData(sensorId, temp.toInt, status)

val inputStream: DStream[StreamData] = inputStreamData

inputStream.window(Seconds(20)).foreachRDD { rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._

val inputStreamDataDF = rdd.toDF()

val dataDumpDF = spark.sql("SELECT * FROM hot_sensors ORDER BY currentTemp DESC")

val anomalyDf = spark.sql("SELECT * FROM hot_sensors WHERE currentTemp > 100 ORDER BY currentTemp DESC")

// To make sure data is not deleted by the time we query it interactively

case class StreamData(id: String, currentTemp: Int, status: String)

appName: The application name that will be used to checkpoint the Kinesis sequence numbers in the DynamoDB table.

  1. The application name must be unique for a given account and region.
  2. If the table exists but has incorrect checkpoint information (for a different stream, or old expired sequenced numbers), then there may be temporary errors.

The interval (e.g., Duration(2000) = 2 seconds) at which the Kinesis Client Library saves its position in the stream. For starters, set it to the same as the batch interval of the streaming application.

Valid Kinesis endpoints URL can be found here.

For more details about building KinesisInputDStream, please go through the documentation.

Configure AWS Credentials using environment variables or using aws configure command.

Make sure all the resources are under the same account and region. Region of CloudFormation Stack that was created is in us-west-2 even though all the resources are in another region, this would not affect the process.

Building Executable Jar

  • Open Terminal -> Go to project root directory, in this case 
  • Run sbt assembly

The jar has been packaged into project root directory/target/scala-2.11/XXXX.jar. Name of the jar is the name that provided in build.sbt file.

Run the Jar using spark-submit

  • Open Terminal -> Go to Spark bin directory
  • Run the following command, and it looks like
./bin/spark-submit ~/Desktop/kinesis-spark-streams-dynamo/target/scala-2.11/kinesis-spark-streams-dynamo-assembly-0.1.jar appName streamName endpointUrl dynamoDbTable

To know more about how to submit applications using spark-submit, please review this.

Arguments that are passed are highlighted in the above highlighted blue box. Place the arguments as needed.

Read Kinesis Data Streams in Spark Streams

  1. Go to Amazon Kinesis Data Generator-> Sign In using Cognito user
  2. Click on Send Data, it starts sending data to Kinesis Data Streams

Data would be sent to Kinesis Data Stream, in this case, kinesis-stream, it looks like this.

Monitoring Kinesis Data Streams

Go to Amazon Kinesis Console -> Choose Data streams -> Select created Data Stream -> click on Monitoring

The terminal looks like the following when it starts receiving the data from Kinesis Data Streams

The data_dump table has the whole data that is coming from Kinesis Data Streams. And the data in the data_dump table looks like

The data_anomaly table has data where currentTemp is greater than 100. Here the anomaly is temperature greater than 100. And the data in the data_anomaly table looks like

I hope this article was helpful in setting up Kinesis Data Streams that are consumed and processed using Spark Streaming and stored in DynamoDB.

This story is authored by P V Subbareddy. He is a Big Data Engineer specializing on AWS Big Data Services and Apache Spark Ecosystem.

Case Study – Apache Log Analysis using Logstash-Elasticsearch-Kibana (ELK) Stack

In the previous blog,  we loaded apache log data into Elasticsearch with Logstash.  Now our goal  is to read this data into Kibana to help us run some analytics use cases. Quick note – the entire log file will not only be read into Elasticsearch but will also be displayed onto the standard output. It takes about 3-4 minutes to display the entire log file. ( remove “ignore_older => 0” from the config file to read older logs). To cross check if the data has been loaded and indices have been created in Elasticsearch,  type the following in the browser http://localhost:9200/_cat/indices ( replace “localhost” by the server name that Elasticsearch is running on). This will show all the indexes created, logstash will create indexes that start as logstash-*. Once you find logstash indexes, its time to get them into Kibana.

Kibana accesses Elasticsearch indices using “index patterns”.  We specify the  pattern of the index name we are searching for, and create an index pattern for Kibana to fetch the data from Elasticsearch. If the difference between index name and index pattern is not immediately clear, please wait till we create index patterns in Kibana.

Log into Kibana from browser using http://localhost:5601/ (replace “localhost” by IP/name of the server Kibana is running on). Kibana home page will open up, if it doesn’t please check that Elasticsearch and Kibana are up and running on the server. In case you need to troubleshoot, please check the earlier post on troubleshooting kibana.

From Kibana home page (left side Menu), click on “Management->Index Patterns-> “+Create Index Pattern button. The following page opens up

In the Index Patterns field, type “logstash-*” and Kibana will display all the indexes in Elasticsearch whose name matches the given pattern. Click on “next” and choose “@timestamp” so we can filter our data by time.

Click on “Create index pattern” button and an index pattern will be created with all the fields being displayed

With index pattern created, we are ready to use apache_log data in Kibana. Click on “Discover” from left side Menu and choose logstash* from the drop down and all the data from the log will be displayed here. If you are using the same log as mine, initially you will not be able to see any data. That’s because the filter field on the right corner of the page will default to time “last 15 minutes”. Since, this log is an old one, click on the time and choose “Quick ” and then select “last 5 years” option and bingo! the log data shows up on the screen.

If the above setting is not clear, please check the screenshot below

In case you need a refresher on Kibana visualizations, check this out. You can use Discoverer to get a pie-chart of the different requests coming in. So let’s say you want to analyze the various request keywords for your web server traffic. This visualization shows the various requests (aggregate by “Terms” and field is “request.keyword”) that hit the apache server.

How does it help? Well, for websites with huge volume of traffic, this helps understand the pattern of resource consumption. Common questions that we can answer:

  • Is the new blog post garnering all the attention?
  • Is it the new pair of shoes that are being seen so frequently?
  • Are people interested in self help books or easy comedy?

Another use-case may be to analyze the HTTP response codes of the web server. We are pulling up the same pie-chart for the different response codes server has generated.

What do we infer from this visual? Well, is the web server able to provide a proper response as expected? Are we returning too many ‘page not found’ errors? Why do we have too many ‘authentication failed’ errors? Are a majority of users really forgetting their passwords or something malicious is going on?

In addition, we can also create dashboard level metrics for error code like so.

For time-series analysis, we need to click on Visualise->Time series->Visual Builder. Here, the screen is divided in two horizontal planes. In the bottom plane, choose “Panel Options” tab and type the index pattern as “logstash*” and the time series will show up as a graph like so

It shows the access rate for the given time period. Since, most of the log data is around the same time, let’s change the date (from Last 5 years) to around May 18, 2015 (we can change the date as below)

and the output changes like shown below. Here, the log data has been generated for every 5 times for the particular day selected.

Let’s say this is an access log for an online shopping website and a lot of users have accessed this on May 18 2015. Why? Probably because the company has come up with certain discounts or launched a new product.  If this data is considered in real time, we can visualize the number of people accessing the server currently. If its the festive season, and we are expecting a lot of traffic, we can also foresee when the servers will be stretched based on the historical pattern and act accordingly.  It also helps in marketing and sales: a lot of people are currently logged in, should I add an additional 5% discount to amp up my sales immediately?

If it’s a banking institution that the system is designed for, we can ask questions such as: Why are so many users trying to access the system at same time? Are they really bonafide users or some malware trying to break into the server? By installing a few plugins, we can also visualize which geographic area the requests are originating from . So, we will even get to know if requests are being made from a certain place. These visualizations are really powerful and user friendly and one doesn’t need to have a lot of technical expertise to use Kibana.

That’s about it on this one. I hope the blog posts in this series on ELK stack have been useful for the interested folks to sharpen their data analytics and visualizations chops.  

P.S. : some quick troubleshooting tips on Kibana index patterns:

What if the “create index pattern” page is displaying loading wheel indefinitely on clicking “create index pattern”?

Since Kibana opens in a webpage, we can use browser troubleshooting to see what’s wrong on our page. Right click on the page->Inspect->choose console tab. This shouldn’t show any errors, there can be log messages but not error. I had the forbidden error in red. On trying to refresh any index pattern, this error came up on screen Config: Error 403 Forbidden: blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

This implies, the indexes are all read-only and hence no changes are possible. This happens when kibana runs out of space on the server it’s installed on. We ran out of disk space and had to add more space. Kibana forces read-only on the indexes but does not get them back to normal state in an out-of-space situation. We had to manually move them out like so

curl -XPUT http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'

(localhost to be replaced by your server IP/name). On completing successfully, it displays {“acknowledged:true”} . You can refresh Kibana from the webpage and try and create the index patterns now.

Introduction and Installation of Logstash

So far in this series on ELK (Elasticsearch-Logstash-Kibana) stack, we have talked about Elasticsearch and Kibana(here and here). It’s about time we look at the third pillar of this stack – Logstash.

So, what is Logstash?

Logstash is an open source data collection engine with real-time pipelining capabilities.  This implies that Logstash can dynamically put data from different sources together and clean/filter the data and send it to any output destinations we need. It’s a pluggable architecture that allows great flexibility. Logstash can take a wide variety of unstructured data and then clean it and transform it to gain near real-time insights immediately.

What is Grok?

As mentioned in Logstash documentation -“Grok is the bread and butter of Logstash filters and is used ubiquitously to derive structure out of unstructured data.”  A bit more details on Grok will help you understand Logstash better, please see :

We will now jump directly into Logstash installation.

Run the following at command line of your ubuntu server

$sudo apt-get update && sudo apt-get install logstash

Information about updated versions of packages is checked with “update” command and hence is run before every installation.

Once the installation is complete, we can test Logstash with a very basic Logstash pipeline (the concept of Logstash pipeline is like so : {inputs->filters->outputs}).

$/usr/share/logstash/bin/logstash -e 'input { stdin { } } output { stdout {} }'

The input is stdin and the output is stdout.  Once Logstash is up and running, we can enter “hello world” on the command line and Logstash prints it back with the timestamp, version and host. This confirms that the setup is working.

If you prefer to test with a sample log file, then we can do so by  downloading apache log file and creating the configuration file required for Logstash.

Elasticsearch documentation provides sample apache log files to test Logstash. To download, type the following at command prompt


The file can also be accessed from

The configuration file can be created like so

$sudo vi /etc/logstash/conf.d/logstash.conf

This will open an empty file. Press “Esc and then i” to get into “insert” mode and copy the following:

input {
       file {
           path => "/home/<YOUR_USER_NAME"/apache_logs"
           start_position => "beginning"
           ignore_older => 0
filter {
         grok {
                 match =>  { "message" => "%{COMBINEDAPACHELOG}" }
                 match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z " ]
output {
          elasticsearch {
                     hosts => [ "localhost:9200" ]
          stdout {
                  codec => rubydebug

Please replace the placeholders above with the relevant values for you. Once done, press “ESC + wq!”. This will save the data into the file and exit the editor.

A Logstash pipeline accepts inputs and filters of our choice and outputs to destination/destinations of our choice. This is exactly what we are defining in the config file above.  In the input section, we specified the file we want Logstash to monitor and then explicitly mentioned that we want it to start reading from the beginning. Otherwise, Logstash only checks for the updates (i,e. tails the file) and by mentioning ignore_older, we are telling Logstash to consider old files too.

In the output section, we are specifying that we want the output redirected to Elasticsearch that is running on port 9200 on our local machine. Apart from this, we also want it to display the output to standard output i,e. our screen. codec is the name of Logstash codec. Codecs provide a way to encode and decode input and output data.

Run Logstash with the config file like so

$/usr/share/logstash/bin/logstash  -f  /etc/logstash/conf.d/logstash.conf

(note – Logstash dumps a lot of messages on the screen and may take sometime to start up)

Successfully started pipeline implies Logstash started correctly and is listening to any changes on the log file. Any new updates to the log file will be displayed here.

For exhaustive list of options for the config file, please check the Elasticsearch official documentation –

Examples for config file can be referenced on –

Troubleshooting Logstash installation:

It’s common to run into installation issues while setting up Logstash. We have described some common cases here.

Unrecognized VM option ‘UseParNewGC’

This indicates that the Java version on the system is not compatible with Logstash. At the time of writing of this post, Logstash requires java 8 and is not compatible with higher versions. So, first let’s check the version present on our system

$java -version

To check all the java versions available on the system, you can type

$update-java-alternatives --list

This will display the available java versions. From the screenshot below, we can see that java version 8 is already present on the system.

So, we can either make version 8 as default version or export the java_home variable. We will export java_home and then run logstash.

$echo $JAVA_HOME

$export JAVA_HOME="/usr/lib/jvm/java-8-oracle"

echo $JAVA_HOME will display the path of java home if its set. A blank implies java home variable is not set.  This command gives the complete path to java installation and we can use it in the export command to set new java_home variable.

Once export runs successfully, you can use “echo” command as shown above and check if the java_home is set to version 8. You can then start Logstash and it should run successfully.

In case java version 8 is not available on your system, you can download either from oracle or from openjdk.

$sudo apt-get install oracle-java8-installer

Installation will dump a lot of messages on the screen as shown below

You can now export this path into java_home and run Logstash again and it should be good to go.

With this we are moving closer to our final post in the ELK series, wherein we will talk about a case study outlining a real life use case involving ELK stack. Stay tuned!

Introduction to Visualizations using Kibana with Elasticsearch

We talked about Kibana installation on Ubuntu in the previous blog post, let’s look at how to setup some basic visualizations using Kibana with Elasticsearch.

Log into Kibana using Once the Kibana page opens up, from the left side Menu, click on “Management” and then choose Index Patterns->create Index Pattern and fill in the data as shown below (Once you start typing, it should display “bank” as an option). Go ahead and click next on step 2 and the index will be created and the fields will be displayed.

All the fields that are part of the accounts.json file will be displayed here (the file can be downloaded as per steps in the first blog post of this Elasticsearch series)

Let’s get a basic understanding of what Kibana offers. From left sidebar Menu, choose the first option “Discover”. Kibana will load data for index “bank”. If you already have more than one index, you can choose from the dropdown as shown below

You can use the filters on top of the screen to fetch the data of your choice

You can type them in the field provided above as follows (we are querying for all those accounts where the balance is greater than 40000)

Now, let’s see how many accounts are present for different age groups. So, select  age field and click on visualize

You will get an age wise descending graph depicting the number of accounts present for a certain age with an account balance greater than 40,000

We can also use “Visualize” option from the left side Menu to create useful visualizations for the data. Let’s create a pie chart.

Click on Visualize->Pie

Choose split slices and then aggregation=Range, Field=balance, From and To fields you can give as below or however you wish your data to be divided. The  balance ranges are all shown in different colors, depicting the number of accounts in each balance range.

Let’s try another visualization. Click on Visualization->Area(in basic charts). choose the index “bank” and then under “buckets” choose “split chart”->Aggregation=Range, Field=balance, and enter the different ranges for balance. You get to see something like this

Hmm, not very useful.  Let’s change it a bit. Choose “X-axis”under “buckets” and then repeat the same data for aggregation, field and balance ranges.

Well, that’s definitely more readable. It’s obvious that the number of accounts are higher for 10000-20000 range than 0-10000. Then, there’s a very slight increase for range 20000-30000 and then it reduces for 30000-40000.  Kibana offers a rich suite of visualizations, we just need to learn how to select the right option to make sense of the data that we have.

We can read more about the visualizations, fields, parameters and advanced options offered by Kibana on the official documentation page. You can check it out here:

Now that we are familiar with Kibana interface, lets do some basic search operations using Elasticsearch and replicate the same using Kibana. To do this, we shall use the accounts.json file again. Let’s use that data and gather some insights.

So, let’s say the bank wishes to know about the number of customers it has in each age group. We need to group together data based on customers’ age. This type of clustering of data is called aggregation and is one of the most powerful ways for data analysis.

On the command line, you can invoke this aggregation using curl (on the server running Elasticsearch)

curl -H -XGET 'localhost:9200/bank/account/_search?size=0&pretty' -d '
  "aggs": {
           "age": {
                   "terms": {
                             "field": "age"

Description of the query:

_search in the Curl request says our request is about searching the data, Size=0 specifies that we do not want the entire output of matched documents printed onto our screen.  This makes sure that only the result of our aggregation is displayed onto the screen. “pretty” indicates, the output of the query should be in readable format, otherwise, we will have a log file kind of output which is user friendly.

“aggs” says its an aggregate function that we are requesting and “age” is the name of our custom aggregate and “terms” indicates the fields we will be using for running our aggregation.  “field” implies each field we will be using.

and the output will look like

The output shows that the data has been grouped by age along with the number of customers who fall in each age group. (“key” refers to the age of the customer and “doc_count” to the total number of customers who are of the mentioned age).

Let’s visualize the same output on Kibana (you can cross check with the data on the screen above)

To sort by the age of the customer, choose Alphabetical as shown in the screen below:

Let’s say we want to know the number of customers with bank balance greater than 10000, for different age groups. This requires us to filter our search for balance greater than 10,000 and then aggregate them by the age.

If you wish to see all the documents that are being considered for aggregation, you can run the above command without the “size=0” option. It will display all the documents being considered for aggregation along with the aggregation results.

The output will look like so (Each document whose search criteria is met, will be displayed):

Querying for the same data using Python

import requests
res = requests.get('http://localhost:9200')

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

body = {
"query": {
"term": {
"account_number": 99
}'bank', doc_type='account', body=body)
for doc in output['hits']['hits']:
print("%s) %s" % (doc['_id'], doc['_source']))

I would highly recommend that you practice these visualizations (and some more) on your local dev machine so that you can get a much deeper understanding of these concepts.