Data Engineering Archives -

Understanding Partition Projections in AWS Athena

If you are somebody who uses AWS Athena to query large highly partitioned tables on a daily basis you must know how difficult it is to maintain the partitions. As your partitions grow, you also need to update the metadata in Glue Data Catalog, or else the new data isn’t scanned. Some of us even... » read more

Read

Engineering@ZenOfAI written 5 years ago

Handling Spaces in Column Names During Kinesis Firehose JSON-Parquet Data Transformation

Parquet is an open source file format for Hadoop. Parquet stores nested data structures in a flat columnar format. Compared to a traditional approach where data is stored in a row-oriented approach, parquet is more efficient in terms of storage and performance. A common industry standard is to use parquet files in S3 to query... » read more

Read

Engineering@ZenOfAI written 5 years ago

Building a data lake on AWS using Redshift Spectrum

In one of our earlier posts, we had talked about setting up a data lake using AWS LakeFormation. Once the data lake is setup, we can use Amazon Athena to query data. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so... » read more

Read

Engineering@ZenOfAI written 5 years ago

Federated Querying across Relational, Non-relational, Object, and Custom Data Sources using Amazon Athena

Querying Data from DynamoDB in Amazon Athena Amazon Athena now enables users to run SQL queries across data stored in relational, non-relational, object, and custom data sources. With federated querying, customers can submit a single SQL query that scans data from multiple sources running on-premises or hosted in the cloud. Athena executes federated queries using... » read more

Read

Engineering@ZenOfAI written 6 years ago

How to Customize QuickSight Dashboards for User Specific Data

We have been getting a lot of queries on how to customize a single QuickSight dashboard for user specific data. We can accomplish this by filtering the dashboard data with login username using AWS QuickSight’s Row-Level Security. To further explain this use-case, let’s consider the sales department in a company. Every day your team of... » read more

Read

Engineering@ZenOfAI written 6 years ago

Machine Learning based Fuzzy Matching using AWS Glue ML Transforms

Machine Learning Transforms in AWS Glue AWS Glue provides machine learning capabilities to create custom transforms to do Machine Learning based fuzzy matching to deduplicate and cleanse your data. For this we are going to use a transform named FindMatches. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even... » read more

Read

Engineering@ZenOfAI written 6 years ago

Processing High Volume Big Data Concurrently with No Duplicates using AWS SQS

In this blog post, we’ll be looking at how one could leverage AWS Simple Queue Service (Standard queue) to achieve high concurrency while processing with no duplicates. Also we compare it with other AWS services like DynamoDB, SQS FIFO queue and Kinesis in terms of cost and performance. A simple use case for the below... » read more

Read

Engineering@ZenOfAI written 6 years ago

AWS Machine Learning Data Engineering Pipeline for Batch Data

This post walks you through all the steps required to build a data engineering pipeline for batch data using AWS Step Functions. The sequence of steps works like so : the ingested data arrives as a CSV file in a S3 based data lake in the landing zone, which automatically triggers a Lambda function to... » read more

Read

Engineering@ZenOfAI written 6 years ago

Serverless Architecture for Lightening Fast Distributed File Transfer on AWS Data Lake

Today, we are very excited to share our insights on setting up a serverless architecture for setting up a lightening fast way* to copy large number of objects across multiple folders or partitions in an AWS data lake on S3. Typically in a data lake, data is kept across various zones depending on data lifecycle.... » read more

Read

Engineering@ZenOfAI written 6 years ago

Machine Learning Operations (MLOps) Pipeline using Google Cloud Composer

In an earlier post, we had described the need for automating the Data Engineering pipeline for Machine Learning based systems. Today, we will expand the scope to setup a fully automated MLOps pipeline using Google Cloud Composer. Cloud Composer Cloud Composer is official defined as a fully managed workflow orchestration service that empowers you to... » read more

Read

Engineering@ZenOfAI written 6 years ago