Big Data Archives -

Understanding Partition Projections in AWS Athena

If you are somebody who uses AWS Athena to query large highly partitioned tables on a daily basis you must know how difficult it is to maintain the partitions. As your partitions grow, you also need to update the metadata in Glue Data Catalog, or else the new data isn’t scanned. Some of us even... » read more

Read

Engineering@ZenOfAI written 5 years ago

Handling Spaces in Column Names During Kinesis Firehose JSON-Parquet Data Transformation

Parquet is an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in a row-oriented approach, parquet is more efficient in terms of storage and performance. A common industry standard is to use parquet files in S3 to query... » read more

Read

Engineering@ZenOfAI written 5 years ago

Building a data lake on AWS using Redshift Spectrum

In one of our earlier posts, we had talked about setting up a data lake using AWS LakeFormation. Once the data lake is setup, we can use Amazon Athena to query data. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so... » read more

Read

Engineering@ZenOfAI written 5 years ago

Federated Querying across Relational, Non-relational, Object, and Custom Data Sources using Amazon Athena

Querying Data from DynamoDB in Amazon Athena Amazon Athena now enables users to run SQL queries across data stored in relational, non-relational, object, and custom data sources. With federated querying, customers can submit a single SQL query that scans data from multiple sources running on-premises or hosted in the cloud. Athena executes federated queries using... » read more

Read

Engineering@ZenOfAI written 6 years ago

How to Customize QuickSight Dashboards for User Specific Data

We have been getting a lot of queries on how to customize a single QuickSight dashboard for user specific data. We can accomplish this by filtering the dashboard data with login username using AWS QuickSight’s Row-Level Security. To further explain this use-case, let’s consider the sales department in a company. Every day your team of... » read more

Read

Engineering@ZenOfAI written 6 years ago

Machine Learning based Fuzzy Matching using AWS Glue ML Transforms

Machine Learning Transforms in AWS Glue AWS Glue provides machine learning capabilities to create custom transforms to do Machine Learning based fuzzy matching to deduplicate and cleanse your data. For this we are going to use a transform named FindMatches. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even... » read more

Read

Engineering@ZenOfAI written 6 years ago

Processing High Volume Big Data Concurrently with No Duplicates using AWS SQS

In this blog post, we’ll be looking at how one could leverage AWS Simple Queue Service (Standard queue) to achieve high concurrency while processing with no duplicates. Also we compare it with other AWS services like DynamoDB, SQS FIFO queue and Kinesis in terms of cost and performance. A simple use case for the below... » read more

Read

Engineering@ZenOfAI written 6 years ago

Real Time Streaming Data Analytics using Amazon Kinesis Family

Amazon Kinesis Data Analytics Amazon Kinesis Data Analytics (KDA) is the easiest way to analyze streaming data, gain actionable insights, and respond to your business and customer needs in real time. KDS reduces the complexity of building, managing and integrating streaming applications with other AWS services. SQL users can easily query streaming data or build... » read more

Read

Engineering@ZenOfAI written 6 years ago

AWS Machine Learning Data Engineering Pipeline for Batch Data

This post walks you through all the steps required to build a data engineering pipeline for batch data using AWS Step Functions. The sequence of steps works like so : the ingested data arrives as a CSV file in a S3 based data lake in the landing zone, which automatically triggers a Lambda function to... » read more

Read

Engineering@ZenOfAI written 6 years ago

Advanced Analytics – Presto Functions and Operators Quick Review

This post is a lot different from our earlier entries. Think of it as a reference flag post for people interested in a quick lookup for advanced analytics functions and operators used in modern data lake operations based on Presto. So you could, of course, use it in Presto installations, but also in some other... » read more

Read

Engineering@ZenOfAI written 6 years ago