Apache Spark Archives -

Machine Learning based Fuzzy Matching using AWS Glue ML Transforms

Machine Learning Transforms in AWS Glue AWS Glue provides machine learning capabilities to create custom transforms to do Machine Learning based fuzzy matching to deduplicate and cleanse your data. For this we are going to use a transform named FindMatches. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even... » read more

Read

Engineering@ZenOfAI written 6 years ago

AWS Machine Learning Data Engineering Pipeline for Batch Data

This post walks you through all the steps required to build a data engineering pipeline for batch data using AWS Step Functions. The sequence of steps works like so : the ingested data arrives as a CSV file in a S3 based data lake in the landing zone, which automatically triggers a Lambda function to... » read more

Read

Engineering@ZenOfAI written 6 years ago

Processing Kinesis Data Streams with Spark Streaming

Solution Overview : In this blog, we are going to build a real time anomaly detection solution using Spark Streaming. Kinesis Data Streams would act as the input streaming source and the anomalous records would be written as Data Streams in DynamoDB. Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming... » read more

Read

Engineering@ZenOfAI written 6 years ago

Linear regression using Apache Spark MLlib

What is linear Regression? Wikipedia states – In statistics, linear regression is a linear approach to modeling the relationship between dependent variable and one or more independent variables. Linear regression is a basic and commonly used type of predictive analysis. Back to school math, every straight line can be represented by the equation: y = mx + b, where y is dependent variable... » read more

Read

Abhishek Singh written 6 years ago