Data Lake Archives -

Understanding Partition Projections in AWS Athena

If you are somebody who uses AWS Athena to query large highly partitioned tables on a daily basis you must know how difficult it is to maintain the partitions. As your partitions grow, you also need to update the metadata in Glue Data Catalog, or else the new data isn’t scanned. Some of us even... » read more

Read

Engineering@ZenOfAI written 5 years ago

Building a data lake on AWS using Redshift Spectrum

In one of our earlier posts, we had talked about setting up a data lake using AWS LakeFormation. Once the data lake is setup, we can use Amazon Athena to query data. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so... » read more

Read

Engineering@ZenOfAI written 5 years ago

Federated Querying across Relational, Non-relational, Object, and Custom Data Sources using Amazon Athena

Querying Data from DynamoDB in Amazon Athena Amazon Athena now enables users to run SQL queries across data stored in relational, non-relational, object, and custom data sources. With federated querying, customers can submit a single SQL query that scans data from multiple sources running on-premises or hosted in the cloud. Athena executes federated queries using... » read more

Read

Engineering@ZenOfAI written 6 years ago

Real Time Streaming Data Analytics using Amazon Kinesis Family

Amazon Kinesis Data Analytics Amazon Kinesis Data Analytics (KDA) is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. KDS reduces the complexity of building, managing and integrating streaming applications with other AWS services. SQL users can easily query streaming data or build... » read more

Read

Engineering@ZenOfAI written 6 years ago

AWS Machine Learning Data Engineering Pipeline for Batch Data

This post walks you through all the steps required to build a data engineering pipeline for batch data using AWS Step Functions. The sequence of steps works like so : the ingested data arrives as a CSV file in a S3 based data lake in the landing zone, which automatically triggers a Lambda function to... » read more

Read

Engineering@ZenOfAI written 6 years ago

Advanced Analytics – Presto Functions and Operators Quick Review

This post is a lot different from our earlier entries. Think of it as a reference flag post for people interested in a quick lookup for advanced analytics functions and operators used in modern data lake operations based on Presto. So you could, of course, use it in Presto installations, but also in some other... » read more

Read

Engineering@ZenOfAI written 6 years ago

Setting Up a Data Lake on AWS Cloud Using LakeFormation

Setting up a Data Lake involves multiple steps such as collecting, cleansing, moving, and cataloging data, and then securely making that data available for downstream analytics and Machine Learning. AWS LakeFormation simplifies these processes and also automates certain processes like data ingestion. In this post, we shall be learning how to build a very simple... » read more

Read

Engineering@ZenOfAI written 6 years ago

Serverless Architecture for Lightening Fast Distributed File Transfer on AWS Data Lake

Today, we are very excited to share our insights on setting up a serverless architecture for setting up a lightening fast way* to copy large number of objects across multiple folders or partitions in an AWS data lake on S3. Typically in a data lake, data is kept across various zones depending on data lifecycle.... » read more

Read

Engineering@ZenOfAI written 6 years ago