Machine Learning Operations (MLOps) Pipeline using Google Cloud Composer

In an earlier post, we had described the need for automating the Data Engineering pipeline for Machine Learning based systems. Today, we will expand the scope to setup a fully automated MLOps pipeline using Google Cloud Composer.

Cloud Composer

Cloud Composer is official defined as a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers. Built on the popular Apache Airflow open source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.

So let’s get on with the required steps to create this MLOps infrastructure on Google Cloud Platform

Creating a Cloud Composer Environment

Step1: Please enable the Cloud Composer API.

Step2: Go to create environment page in GCP console. Composer is available in Big Data section.

Step3: Click on create to start creating a Composer environment

Step4: Please select the Service account which has the required permissions to access GCS, Big Query, ML Engine and  Composer environment. The required roles for accessing Composer environment is Composer Administrator and Composer Worker. 
For more details about access control in Composer environment please see this.

Step5: Please use Python Version 3 and latest Image version.

Step6: Click on create. It will take about 15-20 minutes to create the environment. Once it completes, the environment page shall look like the following.

Click on Airflow to see Airflow WebUI. The Airflow WebUI looks as follows

DAGs folder is where our dag file is stored. DAG folder is nothing but a folder inside a GCS bucket which is created by the environment. To know more about the concept of DAG and general introduction to Airflow, please refer to this post.

You could see Composer related logs in Logging.

Step7: Please add the following PyPI packages in Composer environment.

Click on created environment and navigate to PYPI packages and click on edit to add packages

The required packages are:

# to read data from MongoDB
pymongo==3.8.0
oauth2client==4.1.3
# to read data from firestore
google-cloud-firestore==1.3.0
firebase-admin==2.17.0
google-api-core==1.13.0

Create a ML model

Step1: Please create a folder structure like the following on your instance.

ml_model
├── setup.py
└── trainer
    ├── __init__.py
    └── train.py

Step2: Please place the following code in train.py file, which shall upload the model to GCS bucket as shown below. This model would be used to create model versions as explained a bit later.

from google.cloud import bigquery
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np
from google.cloud import storage
import datetime
import json
import pickle
client = bigquery.Client()
sql = '''
SELECT *
FROM `<PROJECT_ID>.<DATASET>.<TABLENAME>`
'''

df = client.query(sql).to_dataframe()
df = df[['is_stressed', 'is_engaged', 'status']]

df['is_stressed'] = df['is_stressed'].fillna('n')
df['is_engaged'] = df['is_engaged'].fillna('n')
df['stressed'] = np.where(df['is_stressed']=='y', 1, 0)
df['engaged'] = np.where(df['is_engaged']=='y', 1, 0)
df['status'] = np.where(df['status']=='complete', 1, 0)

feature_cols = ['stressed', 'engaged']
X = df[feature_cols]
y = df.status
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
pkl_filename = "model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(logreg, file)
BUCKET_NAME=BUCKET_NAME# Upload the model to GCS
bucket = storage.Client().bucket(BUCKET_NAME)
file_path = datetime.datetime.now().strftime('machine_learning/models/%Y%m%d_%H%M%S')
blob = bucket.blob('{}/{}'.format(
    file_path,
    pkl_filename))
blob.upload_from_filename(pkl_filename)

file_location = 'gs://{BUCKET_NAME}/{file_path}'.format(BUCKET_NAME=BUCKET_NAME, file_path=file_path)
file_config = json.dumps({'file_location': file_location})

bucket = storage.Client().bucket(COMPOSER_BUCKET)
blob = bucket.blob('data/file_config.json')
blob.upload_from_string(file_config)

Step3: Create an empty init.py file inside the trainer directory.

Step4: Please place the following code in setup.py file. The setup.py file contains required packages to execute code.

import setuptools

REQUIRED_PACKAGES = [
    'pandas-gbq==0.3.0',
    'cloudml-hypertune',
    'google-cloud-bigquery==1.14.0',
    'urllib3'
]

setuptools.setup(
    name='ml_model',
    version='1.0',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
    include_package_data=True,
    description='',
)

Step5: Packaging the code using the following command. It creates a gz file inside ml_model directory.

python3 setup.py sdist

Step6: The package name is the name that is specified in setup.py file. The package name becomes ml_model-1.0.tar.gz
Copy the package to gs://{your-GCS-bucket}/machine_learning/. This becomes the base directory for your machine learning activities described in this post.

Creating a DAG

In this use case, we have created a DAG file which exports some table data from a MongoDB instance into a GCS bucket and then creates a BigQuery table off of that exported data. It trains a model and creates version for that model. The DAG file supports full data extraction and daily data extraction explained in the code below using a variable tot_data. This variable is extracted from Airflow configurations set by the user. This process is also described later in this post.

Please place the following code in the DAG file.

import airflow
from airflow import DAG
from airflow.models import Variable
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta, datetime
from airflow.operators.python_operator import PythonOperator
import pprint
import json
import re

from pymongo import MongoClient
from google.cloud import storage
from google.cloud.storage import blob
from google.cloud import storage
import os

from airflow import models
from mlengine_operator import MLEngineTrainingOperator, MLEngineVersionOperator

ts = datetime.now()
today = str(ts.date()) + 'T00:00:00.000Z'
yester_day = str(ts.date() - timedelta(days = 1)) + 'T00:00:00.000Z'

str_ts = ts.strftime('%Y_%m_%d_%H_%m_%S')

config = Variable.get("mongo_conf", deserialize_json=True)
host = config['host']
db_name = config['db_name']
table_name = config['table_name']
file_prefix = config['file_prefix']
bucket_name = config['bucket_name']
# file_path = file_prefix + '/' + table_name + '.json'
file_path = '{file_prefix}/{table_name}/{table_name}_{str_ts}.json'.format(file_prefix=file_prefix, str_ts=str_ts, table_name=table_name)
file_location = 'gs://' + bucket_name + '/' + file_prefix + '/' + table_name + '/' + table_name + '_*.json'
config['file_location'] = file_location
bq_dataset = config['bq_dataset']
tot_data = config['tot_data'].lower()

BUCKET_NAME = config['ml_configuration']['BUCKET_NAME']
BASE_DIR = config['ml_configuration']['BASE_DIR']
PACKAGE_NAME = config['ml_configuration']['PACKAGE_NAME']
TRAINER_BIN = os.path.join(BASE_DIR, 'packages', PACKAGE_NAME)
TRAINER_MODULE = config['ml_configuration']['TRAINER_MODULE']
RUNTIME_VERSION = config['ml_configuration']['RUNTIME_VERSION']
PROJECT_ID = config['ml_configuration']['PROJECT_ID']
MODEL_NAME = config['ml_configuration']['MODEL_NAME']

MODEL_FILE_BUCKET = config['ml_configuration']['MODEL_FILE_BUCKET']
model_file_loc = config['ml_configuration']['MODEL_FILE_LOCATION']

bucket = storage.Client().bucket(MODEL_FILE_BUCKET)
blob = bucket.get_blob(model_file_loc)
file_config = json.loads(blob.download_as_string().decode("utf-8"))
export_uri = file_config['file_location']

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

def mongoexport():
        client = storage.Client()
        bucket = client.get_bucket(bucket_name)
        blob = bucket.blob(file_path)

        client = MongoClient(host)
        db = client[db_name]
        tasks = db[table_name]
        pprint.pprint(tasks.count_documents({}))
        # if tot_data is set to 'yes' in airflow configurations, full data 
        # is processed.  
        if tot_data == 'no':
          query = {"edit_datetime": { "$gte": yester_day, "$lt": today}}
          print(query)
          data = tasks.find(query)
        else:
          data = tasks.find()
        emp_list = []
        for record in data:
                emp_list.append(json.dumps(record, default=str))
        flat_list =[]
        for data in emp_list:
                flat_list.append((flatten_json(json.loads(data))))
        data = '\n'.join(json.dumps({re.sub('[^0-9a-zA-Z_ ]+', '', str(k)).lower().replace(' ', '_'): str(v) for k, v in record.items()}) for record in flat_list)
        blob.upload_from_string(data)

default_args = {
    'start_date': airflow.utils.dates.days_ago(0),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

with DAG('ml_pipeline', schedule_interval=None, default_args=default_args) as dag:

    # priority_weight has type int in Airflow DB, uses the maximum.
    pymongo_export_op = PythonOperator(
        task_id='pymongo_export',
        python_callable=mongoexport,
        )

    update_bq_table_op = BashOperator(
        task_id='update_bq_table',
        bash_command='''
        bq rm -f {bq_dataset}.{table_name}
        bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values=True {bq_dataset}.{table_name} {file_location}
        '''.format(bq_dataset=bq_dataset, table_name=table_name, file_location=file_location)
        )

    date_nospecial = '{{ execution_date.strftime("%Y%m%d") }}'
    date_min_nospecial = '{{ execution_date.strftime("%Y%m%d_%H%m") }}'
    uuid = '{{ macros.uuid.uuid4().hex[:8] }}'

    training_op = MLEngineTrainingOperator(
      task_id='submit_job_for_training',
      project_id=PROJECT_ID,
      job_id='{}_{}_{}'.format(table_name, date_nospecial, uuid),
      package_uris=[os.path.join(TRAINER_BIN)],
      training_python_module=TRAINER_MODULE,
      training_args=[
          '--base-dir={}'.format(BASE_DIR),
          '--event-date={}'.format(date_nospecial),
      ],
      region='us-central1',
      runtime_version=RUNTIME_VERSION,
      python_version='3.5')

    create_version_op = MLEngineVersionOperator(
      task_id='create_version',
      project_id=PROJECT_ID,
      model_name=MODEL_NAME,
      version={
          'name': 'version_{}_{}'.format(date_min_nospecial, uuid),
          'deploymentUri': export_uri,
          'runtimeVersion': RUNTIME_VERSION,
          'pythonVersion': '3.5',
          'framework': 'SCIKIT_LEARN',
      },
      operation='create')

    pymongo_export_op >> update_bq_table_op >> training_op >> create_version_op

Once file is created, please upload the file to DAGs folder. And also please add the following plugin dependency file named mlengine_operator in DAGs folder.
Place the following code in mlengine_operator.py file.

import re

from apiclient import errors

from airflow.contrib.hooks.gcp_mlengine_hook import MLEngineHook
from airflow.exceptions import AirflowException
from airflow.operators import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.utils.log.logging_mixin import LoggingMixin

log = LoggingMixin().log


def _normalize_mlengine_job_id(job_id):

    # Add a prefix when a job_id starts with a digit or a template
    match = re.search(r'\d|\{{2}', job_id)
    if match and match.start() is 0:
        job = 'z_{}'.format(job_id)
    else:
        job = job_id

    # Clean up 'bad' characters except templates
    tracker = 0
    cleansed_job_id = ''
    for m in re.finditer(r'\{{2}.+?\}{2}', job):
        cleansed_job_id += re.sub(r'[^0-9a-zA-Z]+', '_',
                                  job[tracker:m.start()])
        cleansed_job_id += job[m.start():m.end()]
        tracker = m.end()

    # Clean up last substring or the full string if no templates
    cleansed_job_id += re.sub(r'[^0-9a-zA-Z]+', '_', job[tracker:])

    return cleansed_job_id


class MLEngineBatchPredictionOperator(BaseOperator):
   
    template_fields = [
        '_project_id',
        '_job_id',
        '_region',
        '_input_paths',
        '_output_path',
        '_model_name',
        '_version_name',
        '_uri',
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 job_id,
                 region,
                 data_format,
                 input_paths,
                 output_path,
                 model_name=None,
                 version_name=None,
                 uri=None,
                 max_worker_count=None,
                 runtime_version=None,
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 *args,
                 **kwargs):
        super(MLEngineBatchPredictionOperator, self).__init__(*args, **kwargs)

        self._project_id = project_id
        self._job_id = job_id
        self._region = region
        self._data_format = data_format
        self._input_paths = input_paths
        self._output_path = output_path
        self._model_name = model_name
        self._version_name = version_name
        self._uri = uri
        self._max_worker_count = max_worker_count
        self._runtime_version = runtime_version
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

        if not self._project_id:
            raise AirflowException('Google Cloud project id is required.')
        if not self._job_id:
            raise AirflowException(
                'An unique job id is required for Google MLEngine prediction '
                'job.')

        if self._uri:
            if self._model_name or self._version_name:
                raise AirflowException('Ambiguous model origin: Both uri and '
                                       'model/version name are provided.')

        if self._version_name and not self._model_name:
            raise AirflowException(
                'Missing model: Batch prediction expects '
                'a model name when a version name is provided.')

        if not (self._uri or self._model_name):
            raise AirflowException(
                'Missing model origin: Batch prediction expects a model, '
                'a model & version combination, or a URI to a savedModel.')

    def execute(self, context):
        job_id = _normalize_mlengine_job_id(self._job_id)
        prediction_request = {
            'jobId': job_id,
            'predictionInput': {
                'dataFormat': self._data_format,
                'inputPaths': self._input_paths,
                'outputPath': self._output_path,
                'region': self._region
            }
        }

        if self._uri:
            prediction_request['predictionInput']['uri'] = self._uri
        elif self._model_name:
            origin_name = 'projects/{}/models/{}'.format(
                self._project_id, self._model_name)
            if not self._version_name:
                prediction_request['predictionInput'][
                    'modelName'] = origin_name
            else:
                prediction_request['predictionInput']['versionName'] = \
                    origin_name + '/versions/{}'.format(self._version_name)

        if self._max_worker_count:
            prediction_request['predictionInput'][
                'maxWorkerCount'] = self._max_worker_count

        if self._runtime_version:
            prediction_request['predictionInput'][
                'runtimeVersion'] = self._runtime_version

        hook = MLEngineHook(self._gcp_conn_id, self._delegate_to)

        # Helper method to check if the existing job's prediction input is the
        # same as the request we get here.
        def check_existing_job(existing_job):
            return existing_job.get('predictionInput', None) == \
                prediction_request['predictionInput']

        try:
            finished_prediction_job = hook.create_job(
                self._project_id, prediction_request, check_existing_job)
        except errors.HttpError:
            raise

        if finished_prediction_job['state'] != 'SUCCEEDED':
            self.log.error('MLEngine batch prediction job failed: {}'.format(
                str(finished_prediction_job)))
            raise RuntimeError(finished_prediction_job['errorMessage'])

        return finished_prediction_job['predictionOutput']


class MLEngineModelOperator(BaseOperator):
    template_fields = [
        '_model',
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 model,
                 operation='create',
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 *args,
                 **kwargs):
        super(MLEngineModelOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._model = model
        self._operation = operation
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

    def execute(self, context):
        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)
        if self._operation == 'create':
            return hook.create_model(self._project_id, self._model)
        elif self._operation == 'get':
            return hook.get_model(self._project_id, self._model['name'])
        else:
            raise ValueError('Unknown operation: {}'.format(self._operation))


class MLEngineVersionOperator(BaseOperator):
    
    template_fields = [
        '_model_name',
        '_version_name',
        '_version',
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 model_name,
                 version_name=None,
                 version=None,
                 operation='create',
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 *args,
                 **kwargs):

        super(MLEngineVersionOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._model_name = model_name
        self._version_name = version_name
        self._version = version or {}
        self._operation = operation
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

    def execute(self, context):
        if 'name' not in self._version:
            self._version['name'] = self._version_name

        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)

        if self._operation == 'create':
            assert self._version is not None
            return hook.create_version(self._project_id, self._model_name,
                                       self._version)
        elif self._operation == 'set_default':
            return hook.set_default_version(self._project_id, self._model_name,
                                            self._version['name'])
        elif self._operation == 'list':
            return hook.list_versions(self._project_id, self._model_name)
        elif self._operation == 'delete':
            return hook.delete_version(self._project_id, self._model_name,
                                       self._version['name'])
        else:
            raise ValueError('Unknown operation: {}'.format(self._operation))


class MLEngineTrainingOperator(BaseOperator):
    
    template_fields = [
        '_project_id',
        '_job_id',
        '_package_uris',
        '_training_python_module',
        '_training_args',
        '_region',
        '_scale_tier',
        '_runtime_version',
        '_python_version',
        '_job_dir'
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 job_id,
                 package_uris,
                 training_python_module,
                 training_args,
                 region,
                 scale_tier=None,
                 runtime_version=None,
                 python_version=None,
                 job_dir=None,
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 mode='PRODUCTION',
                 *args,
                 **kwargs):
        super(MLEngineTrainingOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._job_id = job_id
        self._package_uris = package_uris
        self._training_python_module = training_python_module
        self._training_args = training_args
        self._region = region
        self._scale_tier = scale_tier
        self._runtime_version = runtime_version
        self._python_version = python_version
        self._job_dir = job_dir
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to
        self._mode = mode

        if not self._project_id:
            raise AirflowException('Google Cloud project id is required.')
        if not self._job_id:
            raise AirflowException(
                'An unique job id is required for Google MLEngine training '
                'job.')
        if not package_uris:
            raise AirflowException(
                'At least one python package is required for MLEngine '
                'Training job.')
        if not training_python_module:
            raise AirflowException(
                'Python module name to run after installing required '
                'packages is required.')
        if not self._region:
            raise AirflowException('Google Compute Engine region is required.')

    def execute(self, context):
        job_id = _normalize_mlengine_job_id(self._job_id)
        training_request = {
            'jobId': job_id,
            'trainingInput': {
                'scaleTier': self._scale_tier,
                'packageUris': self._package_uris,
                'pythonModule': self._training_python_module,
                'region': self._region,
                'args': self._training_args,
            }
        }

        if self._runtime_version:
            training_request['trainingInput']['runtimeVersion'] = self._runtime_version

        if self._python_version:
            training_request['trainingInput']['pythonVersion'] = self._python_version

        if self._job_dir:
            training_request['trainingInput']['jobDir'] = self._job_dir

        if self._mode == 'DRY_RUN':
            self.log.info('In dry_run mode.')
            self.log.info('MLEngine Training job request is: {}'.format(
                training_request))
            return

        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)

        # Helper method to check if the existing job's training input is the
        # same as the request we get here.
        def check_existing_job(existing_job):
            return existing_job.get('trainingInput', None) == \
                training_request['trainingInput']

        try:
            finished_training_job = hook.create_job(
                self._project_id, training_request, check_existing_job)
        except errors.HttpError:
            raise

        if finished_training_job['state'] != 'SUCCEEDED':
            self.log.error('MLEngine training job failed: {}'.format(
                str(finished_training_job)))
            raise RuntimeError(finished_training_job['errorMessage'])

Import variables from composer_conf.json file into Airflow Variables.
Go to Airflow WebUI → Admin → Variables → Browse to file path or configure variables manually.
Please place the following in composer_conf

{
  "mongo_conf": {
    "host": "mongodb://<instance-internal-ip>:27017",
    "db_name": "DBNAME",
    "table_name": "TABLENAME",
    "file_prefix": "Folder In GCS Bucket",
    "bq_dataset": "BigQuery Dataset",
    "bucket_name": "GCS Bucket",
    "tot_data": "yes",
    "ml_configuration": {
      "BUCKET_NAME": "GCS Bucket",
      "BASE_DIR": "gs://GCS Bucket/machine_learning/",
      "PACKAGE_NAME": "PACKAGE NAME FROM setup.py FILE in ML",
      "TRAINER_MODULE": "trainer.train",
      "RUNTIME_VERSION": "1.13",
      "PROJECT_ID": "GCP Project",
      "MODEL_FILE_BUCKET": "BUCKET CREATED BY Composer Environment",
      "MODEL_FILE_LOCATION": "data/MODEL LOCATION FILE",
      "MODEL_NAME": "MODEL_NAME"
    }
  }

Please store any configuration files or credentials file that are used by Composer in the data folder in the bucket created by Composer environment.

After configuring variables accordingly, you can see the DAG named ml_pipeline in the Airflow WebUI.

Please trigger the DAG file from Airflow WebUI. Once the DAG ran successfully. It looks like the following:

Thanks for the read and look forward to your comments.

This story is authored by PV Subbareddy. Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Face Recognition App In React Native using AWS Rekognition

In this blog we are going to build an app for registering faces and verifying faces using Amazon Rekognition in React Native.

Installing dependencies:

Let’s go to React Native Docs, select React Native CLI Quickstart and select our appropriate Development OS and the Target OS as Android, as we are going to build an android application.

Follow the docs for installing dependencies, after installing create a new React Native Application. Use the command line interface to generate a new React Native project called FaceRegister.

react-native init FaceRegister

Preparing the Android device:

We shall need an Android device to run our React Native Android app. This can be either a physical Android device, or more commonly, we can use an Android Virtual Device (AVD) which allows us to emulate an Android device on our computer (using Android Studio).

Either way, we shall need to prepare the device to run Android apps for development.
If you have a physical Android device, you can use it for development in place of an AVD by connecting it to your computer using a USB cable and following the instructions here.

If you are using a virtual device follow this link. I shall be using physical android device.
Now go to the command line and run react-native run-android inside your React Native app directory:

cd FaceRegister
react-native run-android

If everything is set up correctly, you should see your new app running in your physical device or Android emulator.

In your system, you should see a folder named FaceRegister created. Now open FaceRegister folder with your favorite code editor and create a file called Register.js. We need an input box for the username or id for referring the image and a placeholder to preview the captured image and a submit button to register.

Open your Register.js file and copy the below code:

import React from 'react';
import { StyleSheet, View, Text, TextInput, Image, ScrollView, TouchableHighlight } from 'react-native';

class LoginScreen extends React.Component {
    constructor(props){
       super(props);
       this.state =  {
           username : '',
           capturedImage : ''
       };
   }

  
   render() {
       return (
           <View style={styles.MainContainer}>
               <ScrollView>
                   <Text style= {{ fontSize: 20, color: "#000", textAlign: 'center', marginBottom: 15, marginTop: 10 }}>Register Face</Text>
              
                   <TextInput
                       placeholder="Enter Username"
                       onChangeText={UserName => this.setState({username: UserName})}
                       underlineColorAndroid='transparent'
                       style={styles.TextInputStyleClass}
                   />
                   {this.state.capturedImage !== "" && <View style={styles.imageholder} >
                       <Image source={{uri : this.state.capturedImage}} style={styles.previewImage} />
                   </View>}
                  

                   <TouchableHighlight style={[styles.buttonContainer, styles.captureButton]}>
                       <Text style={styles.buttonText}>Capture Image</Text>
                   </TouchableHighlight>

                   <TouchableHighlight style={[styles.buttonContainer, styles.submitButton]}>
                       <Text style={styles.buttonText}>Submit</Text>
                   </TouchableHighlight>
               </ScrollView>
           </View>
       );
   }
}

const styles = StyleSheet.create({
   MainContainer: {
       marginTop: 60
   },
   TextInputStyleClass: {
     textAlign: 'center',
     marginBottom: 7,
     height: 40,
     borderWidth: 1,
     margin: 10,
     borderColor: '#D0D0D0',
     borderRadius: 5 ,
   },
   inputContainer: {
     borderBottomColor: '#F5FCFF',
     backgroundColor: '#FFFFFF',
     borderRadius:30,
     borderBottomWidth: 1,
     width:300,
     height:45,
     marginBottom:20,
     flexDirection: 'row',
     alignItems:'center'
   },
   buttonContainer: {
     height:45,
     flexDirection: 'row',
     alignItems: 'center',
     justifyContent: 'center',
     marginBottom:20,
     width:"80%",
     borderRadius:30,
     marginTop: 20,
     marginLeft: 5,
   },
   captureButton: {
     backgroundColor: "#337ab7",
     width: 350,
   },
   buttonText: {
     color: 'white',
     fontWeight: 'bold',
   },
   horizontal: {
     flexDirection: 'row',
     justifyContent: 'space-around',
     padding: 10
   },
   submitButton: {
     backgroundColor: "#C0C0C0",
     width: 350,
     marginTop: 5,
   },
   imageholder: {
     borderWidth: 1,
     borderColor: "grey",
     backgroundColor: "#eee",
     width: "50%",
     height: 150,
     marginTop: 10,
     marginLeft: 90,
     flexDirection: 'row',
     alignItems:'center'
   },
   previewImage: {
     width: "100%",
     height: "100%",
   }
});

export default LoginScreen;

Now import your Register file in your App.js file which is located in your project root folder. Open your App.js file and replace it with the below code:

import React, {Component} from 'react';
import {View} from 'react-native';
import LoginScreen from './LoginScreen';

class App extends Component {
   render() {
       return (
       <View>
           <LoginScreen />
       </View>
       );
   }
}

export default App;

Now run your app again. Run below command in the project directory:

react-native run-android

You can see a Text input for username and two buttons one(Capture image) for capturing an image and another(Submit) for submitting the details as shown below:

Let’s add the functionality to preview the captured image. We have a package called react-native-image-picker that enables to capture a picture from the device’s camera or to upload an image from the gallery. Go to the command line, in the project directory run the below command to install react-native-image-picker library:

yarn add react-native-image-picker || npm install --save react-native-image-picker

react-native link react-native-image-picker

Add the required permissions in the AndroidManifest.xml file which is located at android/app/src/main/:

<uses-permission android:name="android.permission.CAMERA" />
<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE"/>

For more information about this package follow this link.
Now add the below code in your Register.js file.

import React from 'react';
...
...
import ImagePicker from "react-native-image-picker"; //import this

class LoginScreen extends React.Component {
    constructor(props){
      ...
   }

//Add the below method...

   captureImageButtonHandler = () => {
       ImagePicker.showImagePicker({title: "Pick an Image", maxWidth: 800, maxHeight: 600}, (response) => {
           console.log('Response = ', response);
           // alert(response)
           if (response.didCancel) {
               console.log('User cancelled image picker');
           } else if (response.error) {
               console.log('ImagePicker Error: ', response.error);
           } else if (response.customButton) {
               console.log('User tapped custom button: ', response.customButton);
           } else {
               // You can also display the image using data:
               const source = { uri: 'data:image/jpeg;base64,' + response.data };
          
               this.setState({capturedImage: response.uri, base64String: source.uri });
           }
       });
   }
  
   render() {
       return (
           <View style={styles.MainContainer}>
               ...
               ...
             // Add onPress property to capture image button //
           
              <TouchableHighlight style={[styles.buttonContainer, styles.loginButton]} onPress={this.captureImageButtonHandler}>
                       <Text style={styles.loginText}>Capture Image</Text>
               </TouchableHighlight>
              ...
              ...   
           </View>
       );
   }
}

const styles = StyleSheet.create({
...
...
...

});

export default LoginScreen;

Add the captureImageButtonHandler() method in the file and add the onPress property to Capture Image button to call this method. After updating the code, reload your app. Now you can access your camera and gallery by clicking on the Capture Image button. Once you capture an image you can see the preview of that image on your screen as below:

Now we need to register the captured image by storing it in S3 bucket.

I have created an API in API Gateway from AWS console which invokes a lambda function (register-face). All I have to do is to send a POST request to the API url endpoint from client-side.

In the below image I created two resources one for adding faces and another for searching face:

This is the lambda function called register-face that is invoked when we click on the Submit button.

const AWS = require('aws-sdk')
var rekognition = new AWS.Rekognition()
var s3Bucket = new AWS.S3( { params: {Bucket: "<bucket-name>"} } );
var fs = require('fs');

exports.handler = (event, context, callback) => {
    let parsedData = JSON.parse(event)
    let encodedImage = parsedData.Image;
    var filePath = "registered/" + parsedData.name;
    console.log(filePath)
    let decodedImage = new Buffer(encodedImage.replace(/^data:image\/\w+;base64,/, ""),'base64')
    var data = {
        Key: filePath, 
        Body: decodedImage,
        ContentEncoding: 'base64',
        ContentType: 'image/jpeg'
    };
    s3Bucket.putObject(data, function(err, data){
        if (err) { 
            console.log('Error uploading data: ', data);
            callback(err, null);
        } else {
            console.log('succesfully uploaded the image!');
            callback(null, data);
        }
    });
};

In the above code, I am storing the image in the registered folder (prefix) in the S3 bucket.

Just uploading faces in S3 bucket is not enough, we need to create a collection in an AWS region to store the registered faces from S3 bucket. Because we also need to add the verification or recognition process whether the face is registered or not. For that, we shall be using Amazon Rekognition to search faces in the collection. In Amazon Rekognition there is an operation called SearchFacesByImage which searches the image from the collection. Go through the Searching Faces in a Collection to know more. 
Add the below code to the register-face lambda function.

var params ={
        CollectionId: "<collection-id>", 
        DetectionAttributes: [], 
        ExternalImageId: parsedData.name, 
        Image: {
            S3Object: {
                Bucket: "<bucket-name>", 
                Name: filePath
            }
        }
    }
    setTimeout(function () {
        rekognition.indexFaces(params, function(err, data) {
            if (err){
                console.log(err, err.stack); // an error occurred
                callback(err)
            }
            else{
                console.log(data); // successful response
                callback(null,data);
            }
        });
    }, 3000);

So, the final lambda function looks as below:

const AWS = require('aws-sdk')
var rekognition = new AWS.Rekognition()
var s3Bucket = new AWS.S3( { params: {Bucket: "<bucket-name>"} } );
var fs = require('fs');

exports.handler = (event, context, callback) => {
    console.log(event);
    console.log(typeof event);
    console.log(JSON.parse(event));
    let parsedData = JSON.parse(event)
    let encodedImage = parsedData.Image;
    var filePath = "registered/" + parsedData.name;
    console.log(filePath)
    let buf = new Buffer(encodedImage.replace(/^data:image\/\w+;base64,/, ""),'base64')
    var data = {
        Key: filePath, 
        Body: buf,
        ContentEncoding: 'base64',
        ContentType: 'image/jpeg'
    };
    s3Bucket.putObject(data, function(err, data){
        if (err) { 
            console.log('Error uploading data: ', data);
            callback(err, null);
        } else {
            console.log('succesfully uploaded the image!');
            // callback(null, data);
        }
    });
    var params ={
        CollectionId: "face-collection", 
        DetectionAttributes: [], 
        ExternalImageId: parsedData.name, 
        Image: {
            S3Object: {
                Bucket: "face-recognise-test", 
                Name: filePath
            }
        }
    }
    setTimeout(function () {
        rekognition.indexFaces(params, function(err, data) {
            if (err){
                console.log(err, err.stack); // an error occurred
                callback(err)
            }
            else{
                console.log(data);           // successful response
                callback(null,data);
            }
        });
    }, 3000);
};

In this lambda function initially, I’m storing the image(face) in S3 bucket and then adding the same face to collection from S3 bucket.

Let’s get back to our client-side and install the aws-amplify library in our project root directory from the command line with below commands:

npm install --save aws-amplify
npm install --save aws-amplify-react-native
(or)
yarn add aws-amplify
yarn add aws-amplify-react-native

Now add the below code in Register.js file:

import React from 'react';
...
...
import Amplify, {API} from "aws-amplify";
Amplify.configure({
   API: {
       endpoints: [
           {
               name: "<API-name>",
               endpoint: "<your endpoint url>"
           }
       ]
   }
});

class Registration extends React.Component {
    constructor(props){
      ...
      ...
    }
   submitButtonHandler = () => {
       if (this.state.username == '' || this.state.username == undefined || this.state.username == null) {
           alert("Please Enter the Username");
       } else if (this.state.userId == '' || this.state.userId == undefined || this.state.userId == null) {
           alert("Please Enter the UserId");
       } else if(this.state.capturedImage == '' || this.state.capturedImage == undefined || this.state.capturedImage == null) {
           alert("Please Capture the Image");
       } else {
           const apiName = "<API-name>";
           const path = "<your path>";
           const init = {
               headers : {
                   'Accept': 'application/json',
                   "X-Amz-Target": "RekognitionService.IndexFaces",
                   "Content-Type": "application/x-amz-json-1.1"
               },
               body : JSON.stringify({
                   Image: this.state.base64String,
                   name: this.state.username
               })
           }
          
           API.post(apiName, path, init).then(response => {
               alert(JSON.stringify(response))
           });
       }
   }
   
   render() {
       if(this.state.image!=="") {
           // alert(this.state.image)
       }
       return (
           <View style={styles.MainContainer}>
               <ScrollView>
...
...

                   <TouchableHighlight style={[styles.buttonContainer, styles.signupButton]} onPress={this.submitButtonHandler}>
                       <Text style={styles.signupText}>Submit</Text>
                   </TouchableHighlight>
...
...   
            </ScrollView>
           </View>
       );
   }
}

In the above code, we added configuration to the API Gateway using amplify and created a method called submitButtonHandler() where we are going to do a POST request to the lambda function to register the face when the user clicks on the submit button. So, we have added the onPress property to submit button which calls the submitButtonHandler().

Here is the complete code for Register.js file:

import React, {Component} from 'react';
import { StyleSheet, View, Text, TextInput, Image, ScrollView, TouchableHighlight } from 'react-native';
import ImagePicker from "react-native-image-picker";
import Amplify, {API} from "aws-amplify";
Amplify.configure({
    API: {
        endpoints: [
            {
                name: "<api-name>",
                Endpoint: "<your endpoint url>"
            }
        ]
    }
});

class Registration extends Component {
  
    constructor(props){
        super(props);
        this.state =  {
            username : '',
            capturedImage : ''
        };
        // this.submitButtonHandler = this.submitButtonHandler.bind(this);
    }

    captureImageButtonHandler = () => {
        ImagePicker.showImagePicker({title: "Pick an Image", maxWidth: 800, maxHeight: 600}, (response) => {
            console.log('Response = ', response);
            // alert(response)
            if (response.didCancel) {
                console.log('User cancelled image picker');
            } else if (response.error) {
                console.log('ImagePicker Error: ', response.error);
            } else if (response.customButton) {
                console.log('User tapped custom button: ', response.customButton);
            } else {
                // You can also display the image using data:
                const source = { uri: 'data:image/jpeg;base64,' + response.data };
            
                this.setState({capturedImage: response.uri, base64String: source.uri });
            }
        });
    }

    submitButtonHandler = () => {
        if (this.state.username == '' || this.state.username == undefined || this.state.username == null) {
            alert("Please Enter the Username");
        } else if(this.state.capturedImage == '' || this.state.capturedImage == undefined || this.state.capturedImage == null) {
            alert("Please Capture the Image");
        } else {
            const apiName = "<api-name>";
            const path = "<your path>";
            const init = {
                headers : {
                    'Accept': 'application/json',
                    "X-Amz-Target": "RekognitionService.IndexFaces",
                    "Content-Type": "application/x-amz-json-1.1"
                },
                body : JSON.stringify({ 
                    Image: this.state.base64String,
                    name: this.state.username
                })
            }
            
            API.post(apiName, path, init).then(response => {
                alert(response);
            });
        }
    }

    render() {
        if(this.state.image!=="") {
            // alert(this.state.image)
        }
        return (
            <View style={styles.MainContainer}>
                <ScrollView>
                    <Text style= {{ fontSize: 20, color: "#000", textAlign: 'center', marginBottom: 15, marginTop: 10 }}>Register Face</Text>
                
                    <TextInput
                        placeholder="Enter Username"
                        onChangeText={UserName => this.setState({username: UserName})}
                        underlineColorAndroid='transparent'
                        style={styles.TextInputStyleClass}
                    />


                    {this.state.capturedImage !== "" && <View style={styles.imageholder} >
                        <Image source={{uri : this.state.capturedImage}} style={styles.previewImage} />
                    </View>}

                    <TouchableHighlight style={[styles.buttonContainer, styles.captureButton]} onPress={this.captureImageButtonHandler}>
                        <Text style={styles.buttonText}>Capture Image</Text>
                    </TouchableHighlight>

                    <TouchableHighlight style={[styles.buttonContainer, styles.submitButton]} onPress={this.submitButtonHandler}>
                        <Text style={styles.buttonText}>Submit</Text>
                    </TouchableHighlight>
                </ScrollView>
            </View>
        );
    }
}

const styles = StyleSheet.create({
    TextInputStyleClass: {
      textAlign: 'center',
      marginBottom: 7,
      height: 40,
      borderWidth: 1,
      margin: 10,
      borderColor: '#D0D0D0',
      borderRadius: 5 ,
    },
    inputContainer: {
      borderBottomColor: '#F5FCFF',
      backgroundColor: '#FFFFFF',
      borderRadius:30,
      borderBottomWidth: 1,
      width:300,
      height:45,
      marginBottom:20,
      flexDirection: 'row',
      alignItems:'center'
    },
    buttonContainer: {
      height:45,
      flexDirection: 'row',
      alignItems: 'center',
      justifyContent: 'center',
      marginBottom:20,
      width:"80%",
      borderRadius:30,
      marginTop: 20,
      marginLeft: 5,
    },
    captureButton: {
      backgroundColor: "#337ab7",
      width: 350,
    },
    buttonText: {
      color: 'white',
      fontWeight: 'bold',
    },
    horizontal: {
      flexDirection: 'row',
      justifyContent: 'space-around',
      padding: 10
    },
    submitButton: {
      backgroundColor: "#C0C0C0",
      width: 350,
      marginTop: 5,
    },
    imageholder: {
      borderWidth: 1,
      borderColor: "grey",
      backgroundColor: "#eee",
      width: "50%",
      height: 150,
      marginTop: 10,
      marginLeft: 90,
      flexDirection: 'row',
      alignItems:'center'
    },
    previewImage: {
      width: "100%",
      height: "100%",
    }
});

export default Registration;

Now reload your application and register the image(face).

After registering successfully you will receive an alert message as below:

Now go to your S3 bucket and check if the image is stored as below:

And also check in your collection using below command from your command line:

aws rekognition list-faces --collection-id "<your collection id>"

You will get a JSON data with list of faces that are registered as output. So, the registration process is working successfully. Now we need to add the verification/searchface process to our application. I created another lambda function (searchFace) for face verification. Here is the code for face verification lambda function.

const AWS = require('aws-sdk')
var rekognition = new AWS.Rekognition()
var s3Bucket = new AWS.S3( { params: {Bucket: "<bucket-name>"} } );
var fs = require('fs');

exports.handler = (event, context, callback) => {
    let parsedData = JSON.parse(event)
    let encodedImage = parsedData.Image;
    var filePath = parsedData.name + ".jpg";
    console.log(filePath)
    let decodedImage = new Buffer(encodedImage.replace(/^data:image\/\w+;base64,/, ""),'base64')
    var data = {
        Key: filePath, 
        Body: decodedImage,
        ContentEncoding: 'base64',
        ContentType: 'image/jpeg'
    };
    s3Bucket.putObject(data, function(err, data){
        if (err) { 
            console.log('Error uploading data: ', data);
            callback(err);
        } else {
            console.log('succesfully uploaded the image!');
            // callback(null, data);
        }
    });
    var params2 ={
        CollectionId: "<collectio-id>", 
        FaceMatchThreshold: 85, 
        Image: {
            S3Object: {
                Bucket: "<bucket-name>", 
                Name: filePath
            }
        }, 
        MaxFaces: 5
    }
    setTimeout(function () {
        rekognition.searchFacesByImage(params2, function(err, data) {
            if (err){
                console.log(err, err.stack); // an error occurred
                callback(err)
            }
            else{
                console.log(data);           // successful response
                callback(null,data);
            }
        });
    }, 2000);
};

In the above lambda function, we are using SearchFacesByImage. It searches the image from the collection. The response will be a JSON object. Now create a new file called Verification.js in your project root directory and copy the below code in it:

import React, {Component} from 'react';
import { StyleSheet, View, Text, TextInput, Image, ScrollView, TouchableHighlight } from 'react-native';
import ImagePicker from "react-native-image-picker";
import Amplify, {API} from "aws-amplify";
Amplify.configure({
   API: {
       endpoints: [
           {
               name: "<API-name>",
               endpoint: "<your endpoint url>"
           }
       ]
   }
});

class Verification extends Component {
    constructor(props){
       super(props);
       this.state =  {
           username: ''
           capturedImage : ''
       };
   }

   captureImageButtonHandler = () => {
       ImagePicker.showImagePicker({title: "Pick an Image", maxWidth: 800, maxHeight: 600}, (response) => {
           console.log('Response = ', response);
           // alert(response)
           if (response.didCancel) {
               console.log('User cancelled image picker');
           } else if (response.error) {
               console.log('ImagePicker Error: ', response.error);
           } else if (response.customButton) {
               console.log('User tapped custom button: ', response.customButton);
           } else {
               // You can also display the image using data:
               const source = { uri: 'data:image/jpeg;base64,' + response.data };
          
               this.setState({capturedImage: response.uri, base64String: source.uri });
           }
       });
   }

   verification = () => {
       if(this.state.capturedImage == '' || this.state.capturedImage == undefined || this.state.capturedImage == null) {
           alert("Please Capture the Image");
       } else {
           const apiName = "<api-name>";
           const path = "<your path>";
          
           const init = {
               headers : {
                   'Accept': 'application/json',
                   "X-Amz-Target": "RekognitionService.SearchFacesByImage",
                   "Content-Type": "application/x-amz-json-1.1"
               },
               body : JSON.stringify({
                   Image: this.state.base64String,
                   name: this.state.username
               })
           }
          
           API.post(apiName, path, init).then(response => {
               if(JSON.stringify(response.FaceMatches.length) > 0) {
                   alert(response.FaceMatches[0].Face.ExternalImageId)
               } else {
                   alert("No matches found.")
               }
           });
       }
   }

  
  
    render() {
       if(this.state.image!=="") {
           // alert(this.state.image)
       }
       return (
           <View style={styles.MainContainer}>
               <ScrollView>
                   <Text style= {{ fontSize: 20, color: "#000", textAlign: 'center', marginBottom: 15, marginTop: 10 }}>Verify Face</Text>
              
                   {this.state.capturedImage !== "" && <View style={styles.imageholder} >
                       <Image source={{uri : this.state.capturedImage}} style={styles.previewImage} />
                   </View>}

                   <TouchableHighlight style={[styles.buttonContainer, styles.captureButton]} onPress={this.captureImageButtonHandler}>
                       <Text style={styles.buttonText}>Capture Image</Text>
                   </TouchableHighlight>

                   <TouchableHighlight style={[styles.buttonContainer, styles.verifyButton]} onPress={this.verification}>
                       <Text style={styles.buttonText}>Verify</Text>
                   </TouchableHighlight>
               </ScrollView>
           </View>
       );
   }
}

const styles = StyleSheet.create({
   container: {
     flex: 1,
     backgroundColor: 'white',
     alignItems: 'center',
     justifyContent: 'center',
   },
   buttonContainer: {
     height:45,
     flexDirection: 'row',
     alignItems: 'center',
     justifyContent: 'center',
     marginBottom:20,
     width:"80%",
     borderRadius:30,
     marginTop: 20,
     marginLeft: 5,
   },
   captureButton: {
     backgroundColor: "#337ab7",
     width: 350,
   },
   buttonText: {
     color: 'white',
     fontWeight: 'bold',
   },
   verifyButton: {
     backgroundColor: "#C0C0C0",
     width: 350,
     marginTop: 5,
   },
   imageholder: {
     borderWidth: 1,
     borderColor: "grey",
     backgroundColor: "#eee",
     width: "50%",
     height: 150,
     marginTop: 10,
     marginLeft: 90,
     flexDirection: 'row',
     alignItems:'center'
   },
   previewImage: {
     width: "100%",
     height: "100%",
   }
});

export default Verification;

In the above code there are two buttons one (Capture image) for capturing the face that needs to be verified and another (Verify) for verifying the captured face if it is registered or not. When a user clicks on Verify button verification() method will be called in which we make a POST request (invoking searchFace lambda function via API gateway).

Now we are having two screens one for Registration and another for Verification.Let’s add navigation between two screens using react-navigation. The first step is to install react-navigation in your project:

npm install --save react-navigation

The second step is to install react-native-gesture-handler:

yarn add react-native-gesture-handler
# or with npm
# npm install --save react-native-gesture-handler

Now we need to link our react-native with react-native-gesture-handler:

react-native link react-native-gesture-handler

After that go back to your App.js file and replace it with the below code:

import React, {Component} from 'react';
import {View, Text, TouchableHighlight, StyleSheet} from 'react-native';
import Registration from './Registration';
import {createStackNavigator, createAppContainer} from 'react-navigation';
import Verification from './Verification';

class HomeScreen extends React.Component {
   render() {
       return (
           <View style={{ flex: 1, alignItems: "center" }}>
               <Text style= {{ fontSize: 30, color: "#000", marginBottom: 50, marginTop: 100 }}>Register Face ID</Text>
               <TouchableHighlight style={[styles.buttonContainer, styles.button]} onPress={() => this.props.navigation.navigate('Registration')}>
                   <Text style={styles.buttonText}>Registration</Text>
               </TouchableHighlight>
               <TouchableHighlight style={[styles.buttonContainer, styles.button]} onPress={() => this.props.navigation.navigate('Verification')}>
                   <Text style={styles.buttonText}>Verification</Text>
               </TouchableHighlight>
           </View>
       );
   }
}

const MainNavigator = createStackNavigator(
   {
       Home: {screen: HomeScreen},
       Registration: {screen: Registration},
       Verification: {screen: Verification}
   },
   {
       initialRouteName: 'Home',
   }
);

const AppContainer = createAppContainer(MainNavigator);

export default class App extends Component {
   render() {
       return <AppContainer />;
   }
}

const styles = StyleSheet.create({
   buttonContainer: {
       height:45,
       flexDirection: 'row',
       alignItems: 'center',
       justifyContent: 'center',
       marginBottom:20,
       width:"80%",
       borderRadius:30,
       marginTop: 20,
       marginLeft: 5,
   },
   button: {
       backgroundColor: "#337ab7",
       width: 350,
       marginTop: 5,
   },
   buttonText: {
       color: 'white',
       fontWeight: 'bold',
   },
})

Now reload your app and you could see your home screen as below:

When user clicks on Registration button it navigates to Registration screen as below:

When user clicks on Verification button it navigates to Verification screen as below:

Now let’s check the verification process.

Step 1: Navigate to Verification screen.

Step 2: Capture the registered image(face).

Step 3: Click on the verify button.

If everything is fine then you will receive an alert message with the face name as below:

If there are no face matches with the captured face then the user receives an alert message as “No matches found”.

Thanks for the read, I hope it was useful.

This story is authored by Venu Vaka. Venu is a software engineer and machine learning enthusiast.

Creating an Automated Data Engineering Pipeline for Batch Data in Machine Learning

A common use case in Machine Learning life cycle is to have access to the latest training data so as to prevent model deterioration. A lot of times data scientists find it cumbersome to manually export data from data sources such as relational databases or NoSQL data stores or even distributed data. This necessitates automating the data engineering pipeline in Machine Learning. In this post, we will describe how to set up this pipeline for batch data. This workflow is orchestrated via Airflow and can be set up to run at regular intervals: such as hourly, daily, weekly, etc depending on the specific business requirements.

Quick note – In case you are interested in building a real time data engineering pipeline for ML, please look at this post.

In this use case, we are going to export MongoDB data into Google BigQuery via Cloud Storage. The updated data in BigQuery is then made available in Jupyter Notebook as a Pandas Dataframe for downstream model building and analytics. As the pipeline automates the data ingestion and preprocessing, the data scientists always have access to the latest batch data in their Jupyter Notebooks hosted on Google AI Platform. 

We have a MongoDB service running in an instance and we have Airflow and mongoexport running on docker on another instance. Mongoexport is a utility that produces a JSON or CSV export of data stored in MongoDB. Now the data in MongoDB shall be extracted and transformed using mongoexport and loaded into CloudStorage. Airflow is used to schedule and orchestrate these exports. Once the data is available in CloudStorage it could be queried in BigQuery. We then get this data from BigQuery to Jupyter Notebook. Following is a step by step sequence of steps to set up this data pipeline.

You can create an instance in GCP by going to Compute Engine. Click on create instance.

Install.sh:

sudo apt-get update
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
sudo usermod -aG docker $USER
sudo apt-get install -y python-pip
export AIRFLOW_HOME=~/airflow
sudo pip install apache-airflow
sudo pip install apache-airflow[postgres,s3]
airflow initdb
airflow webserver -p 8080 -D
airflow scheduler -D
sudo docker pull mongo
sudo docker run --name mongo_client -d mongo

Please run the install.sh file using ./install.sh command (please make sure file is executable), which would install Docker, Airflow, pulls Mongo image and runs the mongo image in a container named mongo_client.

After installation, for Airflow webUIhttp://<public-ip-instance>:8080 (You may need to open port 8080 in the network just for your public IP)


Please make sure the Google service account in the running instance must have permissions for accessing Big Query and Cloud Storage. After installation, add the Airflow job Python file (mongo-export.py) inside the airflow/dags folder.

Before running the Python file, please make sure that you create Dataset and create the table in BigQuery. Also change the appropriate values for the MongoDB source database, MongoDB source table, Cloud Storage destination bucket and BigQuery destination dataset in the Airflow job Python file (mongo-export.py). Big Query destination table name is the same as the source table in Mongo DB. 

Mongo-export.py:

import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import json
from pandas.io.json import json_normalize

# Following are default arguments which could be overridden
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': airflow.utils.dates.days_ago(0),
    'email': ['airflow@gmail.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1),
}

bucket_name = '<Your_Bucket>'
db_name = '<Database_Name>'
dataset = '<Dataset_Name>'
table_name = '<Table_Name>'


time_stamp = datetime.now()
cur_date = time_stamp.strftime("%Y-%m-%d")

# It will flatten the nested json
def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

def convert_string(y):
    string_type = {}

    def convert(x, name=''):
        if type(x) is dict:
            for a in x:
                convert(str(x[a]), name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            string_type[name[:-1]] = x

    convert(y)
    return string_type


def json_flat():
    lines = [line.rstrip('\n') for line in open('/home/dev/'+ table_name + '-unformat.json')]
    flat_list = []
    for line in lines:
        line = line.replace("\"$", "\"")
        line = json.loads(line)
        try:
            flat_list.append(json.dumps(convert_string(flatten_json(line))))
        except Exception as e:
            print(e)
    flatted_json = '\n'.join(i for i in flat_list)

    with open('/home/dev/' + table_name + '.json', 'a') as file:
        file.write(flatted_json)
    return flatted_json 

dag = DAG('mongoexport-daily-gcs-bq', default_args=default_args, params = {'cur_date': cur_date, 'db_name': db_name, 'table_name': table_name, 'dataset': dataset, 'bucket_name': bucket_name})
#exports provide a table data into docker container 
t1 = BashOperator(
    task_id='mongoexport_to_container',
    bash_command='sudo docker exec -i mongo_client sh -c "mongoexport --host=<instance_public_ip> --db {{params.db_name}} --collection {{params.table_name}} --out {{params.table_name}}-unformat.json"',
    dag=dag)

# copies exported file into instance

t2 = BashOperator(
    task_id='cp_from_container_instance',
    bash_command='sudo docker cp mongo_client:/{{params.table_name}}-unformat.json /home/dev/',
    dag=dag)

t3 = PythonOperator(
    task_id='flattening_json',
    python_callable=json_flat,
    dag=dag)
# copies the flatten data from cloud storage
t4 = BashOperator(
    task_id='cp_from_instance_gcs',
    bash_command='gsutil cp /home/dev/{{params.table_name}}.json gs://{{params.bucket_name}}/raw/{{params.table_name}}/date={{params.cur_date}}/',
    dag=dag)
# 
t5 = BashOperator(
    task_id='cp_from_instance_gcs_daily_data',
    bash_command='gsutil cp /home/dev/{{params.table_name}}.json gs://{{params.bucket_name}}/curated/{{params.table_name}}/',
    dag=dag)

# removes the existing bigquery table
t6 = BashOperator(
    task_id='remove_bq_table',
    bash_command='bq rm -f {{params.dataset}}.{{params.table_name}}',
    dag=dag)
# creates a table in bigquery
t7 = BashOperator(
    task_id='create_bq_table',
    bash_command='bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON {{params.dataset}}.{{params.table_name}} gs://{{params.bucket_name}}/curated/{{params.table_name}}/{{params.table_name}}.json',
    dag=dag)
# removes data from container
t8 = BashOperator(
    task_id='remove_file_from_container',
    bash_command='sudo docker exec -i mongo_client sh -c "rm -rf {{params.table_name}}*.json"',
    dag=dag)
# removes data from instance
t9 = BashOperator(
    task_id='remove_file_from_instance',
    bash_command='rm -rf /home/dev/{{params.table_name}}*.json',
    dag=dag)

t1 >> t2
t2 >> t3
t3 >> [t4, t5]
[t4, t5] >> t6
t6 >> t7
t7 >> [t8, t9]

Then run the python file using python <file-path>.py  

(example: python airflow/dags/mongo-export.py).

After running the python file, the dag name shows in Airflow webUI. And you could trigger the dag manually. Please make sure toggle button is in ON status

Once the job completes, the data is stored in the bucket and also available in the destination table in BigQuery. You could see the table is created in BigQuery. Click on querytable to perform SQL operations and you could see your results in the preview tab at the bottom.

Now, you could access the data in Jupyter Notebook from BigQuery. Search for notebook in GCP console. 

Run the below commands in Jupyter Notebook.

from google.cloud import bigquery
client = bigquery.Client()
sql = """
SELECT * FROM 
`<project-name>.<dataset-name>.<table-name>`
"""
df = client.query(sql).to_dataframe()
df.head(10)

This loads the BigQuery data into Pandas dataframe and can be used for model creation as required. Later when the data pipeline is run as per schedule, the refreshed data would automatically be available in this Jupyter notebook via this SQL query.

Hope this helps you to automate your batch Data Engineering pipeline for Machine Learning. 

This story is co-authored by Santosh and Subbareddy. Santosh is an AWS Cloud Engineer and Subbareddy is a Big Data Engineer.

Create a Language Translation Mobile App using React Native and Google APIs

In this blog, we are going to learn how to create a simple React Native based Language Translation Android app with Speech to Text and Text to Speech capabilities powered by Google APIs.

Installing dependencies:

Go to React Native Docs, select React Native CLI Quickstart and select your Development OS and Target OS -> Android, as we are going to build an Android application.

Follow the docs for installing dependencies and create a new React Native Application. Use the command line interface to generate a new React Native project called “Translator“:

react-native init Translator

You should see a folder named Translator created. Now open Translator folder with your favourite code editor and create a file called Translator.js. We need an input box for text that needs to be translated and another output section to display the translated text. We also need a select box that lists different languages to choose from for translation. Let’s create a json file, call it languages.json.

Go to languages.json file and copy the code below:

{
   "auto": "Auto Detect",
   "af": "Afrikaans",
   "sq": "Albanian",
   "am": "Amharic",
   "ar": "Arabic",
   "hy": "Armenian",
   "az": "Azerbaijani",
   "eu": "Basque",
   "be": "Belarusian",
   "bn": "Bengali",
   "bs": "Bosnian",
   "bg": "Bulgarian",
   "ca": "Catalan",
   "ceb": "Cebuano",
   "ny": "Chichewa",
   "zh-cn": "Chinese Simplified",
   "zh-tw": "Chinese Traditional",
   "co": "Corsican",
   "hr": "Croatian",
   "cs": "Czech",
   "da": "Danish",
   "nl": "Dutch",
   "en": "English",
   "eo": "Esperanto",
   "et": "Estonian",
   "tl": "Filipino",
   "fi": "Finnish",
   "fr": "French",
   "fy": "Frisian",
   "gl": "Galician",
   "ka": "Georgian",
   "de": "German",
   "el": "Greek",
   "gu": "Gujarati",
   "ht": "Haitian Creole",
   "ha": "Hausa",
   "haw": "Hawaiian",
   "iw": "Hebrew",
   "hi": "Hindi",
   "hmn": "Hmong",
   "hu": "Hungarian",
   "is": "Icelandic",
   "ig": "Igbo",
   "id": "Indonesian",
   "ga": "Irish",
   "it": "Italian",
   "ja": "Japanese",
   "jw": "Javanese",
   "kn": "Kannada",
   "kk": "Kazakh",
   "km": "Khmer",
   "ko": "Korean",
   "ku": "Kurdish (Kurmanji)",
   "ky": "Kyrgyz",
   "lo": "Lao",
   "la": "Latin",
   "lv": "Latvian",
   "lt": "Lithuanian",
   "lb": "Luxembourgish",
   "mk": "Macedonian",
   "mg": "Malagasy",
   "ms": "Malay",
   "ml": "Malayalam",
   "mt": "Maltese",
   "mi": "Maori",
   "mr": "Marathi",
   "mn": "Mongolian",
   "my": "Myanmar (Burmese)",
   "ne": "Nepali",
   "no": "Norwegian",
   "ps": "Pashto",
   "fa": "Persian",
   "pl": "Polish",
   "pt": "Portuguese",
   "ma": "Punjabi",
   "ro": "Romanian",
   "ru": "Russian",
   "sm": "Samoan",
   "gd": "Scots Gaelic",
   "sr": "Serbian",
   "st": "Sesotho",
   "sn": "Shona",
   "sd": "Sindhi",
   "si": "Sinhala",
   "sk": "Slovak",
   "sl": "Slovenian",
   "so": "Somali",
   "es": "Spanish",
   "su": "Sundanese",
   "sw": "Swahili",
   "sv": "Swedish",
   "tg": "Tajik",
   "ta": "Tamil",
   "te": "Telugu",
   "th": "Thai",
   "tr": "Turkish",
   "uk": "Ukrainian",
   "ur": "Urdu",
   "uz": "Uzbek",
   "vi": "Vietnamese",
   "cy": "Welsh",
   "xh": "Xhosa",
   "yi": "Yiddish",
   "yo": "Yoruba",
   "zu": "Zulu"
}

Translator.js (modify file), copy the code below:

import React, { Component } from 'react';
import { View, TextInput, StyleSheet, TouchableOpacity, TouchableHighlight, Text, Picker, Image } from 'react-native';
import Languages from './languages.json';

export default class Translator extends Component {

   constructor(props) {
       super(props);
       this.state = {
           languageFrom: "",
           languageTo: "",
           languageCode: 'en',
           inputText: "",
           outputText: "",
           submit: false,
       };
   }

   render() {
       return (
           <View style = {styles.container}>
               <View style={styles.input}>
                   <TextInput
                       style={{flex:1, height: 80}}
                       placeholder="Enter Text"
                       underlineColorAndroid="transparent"
                       onChangeText = {inputText => this.setState({inputText})}
                       value={this.state.inputText}
                   />
               </View>

               <Picker
               selectedValue={this.state.languageTo}
               onValueChange={ lang => this.setState({languageTo: lang, languageCode: lang})}
               >
                   {Object.keys(Languages).map(key => (
                       <Picker.Item label={Languages[key]} value={key} />
                   ))}
               </Picker>

               <View style = {styles.output}>
                  {/* output text displays here.. */}
               </View>
               <TouchableOpacity
                   style = {styles.submitButton}
                   onPress = {this.handleTranslate}
               >
                   <Text style = {styles.submitButtonText}> Submit </Text>
               </TouchableOpacity>
           </View>
       )
   }
}

const styles = StyleSheet.create({
   container: {
       paddingTop: 53
   },
   input: {
       flexDirection: 'row',
       justifyContent: 'center',
       alignItems: 'center',
       backgroundColor: '#fff',
       borderWidth: .5,
       borderColor: '#000',
       // height: 40,
       borderRadius: 5 ,
       margin: 10
   },
   output: {
       flexDirection: 'row',
       justifyContent: 'center',
       alignItems: 'center',
       backgroundColor: '#fff',
       borderWidth: .5,
       borderColor: '#000',
       borderRadius: 5 ,
       margin: 10,
       height: 80,
   },
   submitButton: {
       backgroundColor: '#7a42f4',
       padding: 10,
       margin: 15,
       borderRadius: 5 ,
       height: 40,
   },
   submitButtonText:{
       color: 'white'
   },
})

Now import your Translator.js in to App.js file.
Replace your App.js file with below code

import React, {Component} from 'react';
import {View} from 'react-native';
import Translator from './Translator';

export default class App extends Component {
   render() {
       return (
       <View>
           <Translator />
       </View>
       );
   }
}

Preparing the Android device

You will need an Android device to run your React Native Android app. This can be either a physical Android device, or more commonly, you can use an Android Virtual Device (AVD) which allows you to emulate an Android device on your computer (using Android Studio).

Either way, you will need to prepare the device to run Android apps for development.

Using a physical device

If you have a physical Android device, you can use it for development in place of an AVD by connecting it to your computer using a USB cable and following the instructions here.

If you are using virtual device follow this link.

Now go to command line and run react-native run-android inside your React Native app directory:

cd Translator
react-native run-android

If everything is set up correctly, you should see your new app running in your physical device or Android emulator shortly as below.

That’s great. We got the basic UI for our Translator app. Now we need to translate the input text into the selected language on submit. In React Native we have a library called react-native-power-translator for translating the text.

Let’s install the react-native-power-translator library. Go to the project root directory in command line and run the below command:

npm i react-native-power-translator --save

Usage:

import { PowerTranslator, ProviderTypes, TranslatorConfiguration, TranslatorFactory } from 'react-native-power-translator';

//Example
TranslatorConfiguration.setConfig('Provider_Type', 'Your_API_Key','Target_Language', 'Source_Language');

//Fill with your own details
TranslatorConfiguration.setConfig(ProviderTypes.Google, 'xxxx','fr');
  • PowerTranslator: a simple component to translate your texts.
  • ProviderTypes: type of cloud provider you want to use. There are two providers you can specify. ProviderTypes.Google for Google translate and ProviderTypes.Microsoft for Microsoft translator text cloud service.
  • TranslatorFactory: It returns a suitable translator instance, based on your configuration.
  • TranslatorConfiguration: It is a singleton class that keeps the translator configuration.

Now add the following code in your Translator.js file:

In the above code I’m using Google provider. You can use either Google or Microsoft provider.

Save all the files and run your app in the command line again and you can see a working app with translates text from one language to another as below.

import React, { Component } from 'react';
...
...
import { PowerTranslator, ProviderTypes, TranslatorConfiguration, TranslatorFactory } from 'react-native-power-translator';

export default class Translator extends Component {
...
...
render() {
       TranslatorConfiguration.setConfig(ProviderTypes.Google,’XXXX’, this.state.languageCode);
       return (
             ...
             ...
             ...
             <View style = {styles.output}>
                  {/* output text displays here.. */}
              {this.state.submit && <PowerTranslator  text={this.state.inputText} />}
              </View>

             ...
...
    
}
}

In the below image you can see the text that converted from English to French.

In Android devices you can download different language keyboards. So that you can translate your local language to other languages.

In Android devices you can download different language keyboards. So that you can translate your local language to other languages.

For speech to text we have a library called react-native-android-voice. Let’s install this library in our project.
Go to command line and navigate to project root directory and run the below command:

npm install --save react-native-android-voice

After installing successfully please follow the steps in this link for linking the library to your android project.

Once you completed linking libraries to your Android project, let’s start implementing it in our Translator.js file.

Let’s add a mic icon in our input box. When user taps on mic icon the speech feature will be enabled, there is a library called react-native-vector-icons. For installation follow the steps in this link.

In this project I’m using Ionicons icons, you can change it in iconFontNames in your android/app/build.gradle file as:

project.ext.vectoricons = [
   iconFontNames: [ 'Ionicons.ttf' ] // Name of the font files you want to copy
]

Now add the following code in Translator.js file.

import React, { Component } from 'react';
...
...
import Icon from "react-native-vector-icons/Ionicons";
import SpeechAndroid from 'react-native-android-voice';

export default class Translator extends Component {
constructor(props) {
      super(props);
      this.state = {
          languageFrom: "",
          ....
          ....
          micOn: false, //Add this
      };
      this._buttonClick = this._buttonClick.bind(this); //Add this
  }

...
async _buttonClick(){
       await this.setState({micOn: true})
       try{
           var spokenText = await SpeechAndroid.startSpeech("", SpeechAndroid.ENGLISH);
           await this.setState({inputText: spokenText});
           await ToastAndroid.show(spokenText , ToastAndroid.LONG);
       }catch(error){
           switch(error){
               case SpeechAndroid.E_VOICE_CANCELLED:
                   ToastAndroid.show("Voice Recognizer cancelled" , ToastAndroid.LONG);
                   break;
               case SpeechAndroid.E_NO_MATCH:
                   ToastAndroid.show("No match for what you said" , ToastAndroid.LONG);
                   break;
               case SpeechAndroid.E_SERVER_ERROR:
                   ToastAndroid.show("Google Server Error" , ToastAndroid.LONG);
                   break;
           }
       }
       this.setState({micOn: false})
   }

render() {
       TranslatorConfiguration.setConfig(ProviderTypes.Google,'XXXX', this.state.languageCode);
       return (
             <View style = {styles.container}>
              <View style={styles.input}>
                  <TextInput
                      ...
                      ...
                      ...
                  />
                  <TouchableOpacity onPress={this._buttonClick}>
                       {this.state.micOn ? <Icon size={30} name="md-mic" style={styles.micStyle}/> : <Icon size={30} name="md-mic-off" style={styles.micStyle}/>}
                   </TouchableOpacity>
              </View>
...
...
</View>
    )
}
}

const styles = StyleSheet.create({
  container: {
      paddingTop: 53
  },
...
...
...
  micStyle: {
      padding: 10,
      margin: 5,
      alignItems: 'center'
  }
})

After adding the code correctly, save all the changes and run your app. Now you can see a mic icon in the text input box which allows speech to text feature.

In the above code we are calling a function called _buttonClick() which contains speech to text logic. This will automatically start recognizing and adjusting for the English Language. You can use different languages for speech, you can check here for more information.

Now we successfully implemented speech to text to our Translator app. Let’s add text to speech feature which will turn the translated text into speech. For that we have a library called react-native-tts which converts text to speech.

Install react-native-tts in our project. Go to the command line and navigate to project root directory and run the following command:

npm install --save react-native-tts
react-native link react-native-tts

First command will install the library.
Second command will link the library to your android project.

Now add the following code in your Translator.js file

import React, { Component } from 'react';
...
...
import Icon from "react-native-vector-icons/Ionicons";
import SpeechAndroid from 'react-native-android-voice';

export default class Translator extends Component {
constructor(props) {
      super(props);
      this.state = {
          languageFrom: "",
          ...
          ...
          micOn: false, //Add this
      };
      this._buttonClick = this._buttonClick.bind(this); //Add this
  }


handleTranslate = () => {
       this.setState({submit: true})
       const translator = TranslatorFactory.createTranslator();
       translator.translate(this.state.inputText).then(translated => {
           // alert(translated)
           Tts.getInitStatus().then(() => {
               Tts.speak(translated);
           });
           Tts.stop();
       });
   }
...

render() {
         ...
    )
}
}

In the above code we have added the text to speech logic in handleTranslate function that called when submit button clicked.

Now our final Translator.js file will look like below:

import React, { Component } from 'react';
import { PowerTranslator, ProviderTypes, TranslatorConfiguration, TranslatorFactory } from 'react-native-power-translator';
import { View, TextInput, StyleSheet, TouchableOpacity, TouchableHighlight, Text, Picker, Image } from 'react-native';
import Icon from "react-native-vector-icons/Ionicons";
import Tts from 'react-native-tts';
import Languages from './languages.json';
import SpeechAndroid from 'react-native-android-voice';

export default class Translator extends Component {

   constructor(props) {
       super(props);
       this.state = {
           languageFrom: "",
           languageTo: "",
           languageCode: 'en',
           inputText: "",
           outputText: "",
           submit: false,
           micOn: false,
       };
       this._buttonClick = this._buttonClick.bind(this);
   }
   handleTranslate = () => {
       this.setState({submit: true})
       const translator = TranslatorFactory.createTranslator();
       translator.translate(this.state.inputText).then(translated => {
           Tts.getInitStatus().then(() => {
               Tts.speak(translated);
           });
           Tts.stop();
       });
   }
   async _buttonClick(){
       await this.setState({micOn: true})
       try{
           var spokenText = await SpeechAndroid.startSpeech("", SpeechAndroid.DEFAULT);
           await this.setState({inputText: spokenText});
           await ToastAndroid.show(spokenText , ToastAndroid.LONG);
       }catch(error){
           switch(error){
               case SpeechAndroid.E_VOICE_CANCELLED:
                   ToastAndroid.show("Voice Recognizer cancelled" , ToastAndroid.LONG);
                   break;
               case SpeechAndroid.E_NO_MATCH:
                   ToastAndroid.show("No match for what you said" , ToastAndroid.LONG);
                   break;
               case SpeechAndroid.E_SERVER_ERROR:
                   ToastAndroid.show("Google Server Error" , ToastAndroid.LONG);
                   break;
           }
       }
       this.setState({micOn: false})
   }

   render() {
       TranslatorConfiguration.setConfig(ProviderTypes.Google, 'XXXXXXXXX', this.state.languageCode);
       return (
           <View style = {styles.container}>
               <View style={styles.input}>
                   <TextInput
                       style={{flex:1, height: 80}}
                       placeholder="Enter Text"
                       underlineColorAndroid="transparent"
                       onChangeText = {inputText => this.setState({inputText})}
                       value={this.state.inputText}
                   />
                   <TouchableOpacity onPress={this._buttonClick}>
                       {this.state.micOn ? <Icon size={30} name="md-mic" style={styles.ImageStyle}/> : <Icon size={30} name="md-mic-off" style={styles.ImageStyle}/>}
                   </TouchableOpacity>
               </View>

               <Picker
               selectedValue={this.state.languageTo}
               onValueChange={ lang => this.setState({languageTo: lang, languageCode: lang})}
               >
                   {Object.keys(Languages).map(key => (
                       <Picker.Item label={Languages[key]} value={key} />
                   ))}
               </Picker>

               <View style = {styles.output}>
                   {this.state.submit && <PowerTranslator text={this.state.inputText} />}
                   {/* onTranslationEnd={this.textToSpeech} */}
               </View>
               <TouchableOpacity
                   style = {styles.submitButton}
                   onPress = {this.handleTranslate}
               >
                   <Text style = {styles.submitButtonText}> Submit </Text>
               </TouchableOpacity>
           </View>
       )
   }
}

const styles = StyleSheet.create({
   container: {
       paddingTop: 53
   },
   input: {
       flexDirection: 'row',
       justifyContent: 'center',
       alignItems: 'center',
       backgroundColor: '#fff',
       borderWidth: .5,
       borderColor: '#000',
       // height: 40,
       borderRadius: 5 ,
       margin: 10
   },
   output: {
       flexDirection: 'row',
       justifyContent: 'center',
       alignItems: 'center',
       backgroundColor: '#fff',
       borderWidth: .5,
       borderColor: '#000',
       borderRadius: 5 ,
       margin: 10,
       height: 80,
   },
   ImageStyle: {
       padding: 10,
       margin: 5,
       alignItems: 'center'
   },
   submitButton: {
       backgroundColor: '#7a42f4',
       padding: 10,
       margin: 15,
       borderRadius: 5 ,
       height: 40,
   },
   submitButtonText:{
       color: 'white'
   },
})

Make sure you have replaced ‘XXXXXX’ with your Google/Microsoft API-Key in TranslatorConfiguration in render method.

That’s it. Now we have a Language Translator, Speech to Text, Text to Speech features in our Translator application. We are ready to go now. Reload / Run your app and you can see a fully functional app.

When user taps on mic icon, an Android speech recognizer popup will be displayed as below.

If user didn’t speak or google doesn’t recognize the speech then it shows up as below:

Once Google recognizes speech then select a language to which you need to translate to and click the submit button, so that you would receive a translated text as speech.

That’s it folks!

This story is authored by Venu Vaka. Venu is a software engineer and machine learning enthusiast.

Linear regression using Apache Spark MLlib

What is linear Regression?

Wikipedia states – In statistics, linear regression is a linear approach to modeling the relationship between  dependent variable and one or more  independent variables.

Linear regression is a basic and commonly used type of predictive analysis.

Back to school math, every straight line can be represented by the equation: y = mx + b, where y is dependent variable and X is the independent variable on which y depends. 

How can we use regression for a real life use case?! Let’s take an example – what if I have data of past 5-10 years on the quantity of wheat produced annually. With Linear regression, I will be able to predict what the wheat production would be this year or a year from now.

Why is prediction so important? It helps us plan and be prepared.  It helps us deal with unforeseen situations. In the context of above example, a country will know the quantity of wheat it can export/import in a year.  It means, if global demand seems much lower than the quantity we foresee producing,  we can help farmers choose some other crop, since less demand means a bare minimum selling rate.

Some additional use cases:

  1. A consistent increase in Blood pressure and sugar levels, Is the patient heading towards heart attack?
  2. A distinctive seismic activity, are we in for a tsunami /earth quake?
  3. Inward migration of birds increasing on yearly basis. Are a certain species of trees responsible for alluring birds?
  4. Will a certain steel stock move up or down this year?

These are all some basic uses cases of Linear regression model. We call it regression, because we will be predicting continuous values (as opposed to a yes or no result). Graphically, a linear equation (with one dependent and one independent variable) looks similar to this

So, for every X , we can derive the value of Y from the equation.

What happens if Y is dependent on , not just one independent variable (X) but few more variables.  The graph above will have not just X,Y axis but Z axis too to represent the second independent variable. More than two independent variables are hard to depict graphically. But, can be represented quite easily with equation as

y=   β + β1×1 + β2×2….

where β 1,β2…βn are coefficients of X1, X2,X3

β  is the y-intercept

What this implies for our wheat production example is:

y(total wheat produced in a year) =  β +  β1*Total available land in acres + β2 * rainfall received for that year +  β3 * fertiliser availability in the market….. Here, rainfall received, fertiliser availability are the added independent variables apart from the size of land in acres.

Where does Machine Learning come into picture?

In the above equation – β1, β2, β3…are all coefficients whose value we need to compute to form the linear equation.  This is where learning happens. Based on past data, ML learns the values of these coefficients through a number of iterations.

How?

To start with, we feed in data of past few years to ML packages, this is called training data, since it helps train the ML model. Through a large chunk of data and a number of iterations, the model learns the values of each of the coefficients.

We can then evaluate the model, ML packages offers a lot of functions and related parameters to know how the model has performed. Based on these values we decide if the model has learnt “enough” and then use this on the data we need to predict for (test data).

What is the error people talk about in ML?

If all data points (Y1,Y2,Y3….) form a perfect linear line as shown above, we can derive an exact output (prediction) we are looking. But, in the real world, this is not the case. The data points do not exactly form a line.  They are a bit scattered on the graph. So what do we do?  We make a line in such a way that its at a least possible distance from the points as shown below:

this is where root mean square error or any other error reduction method is used. So, we use the chosen method, and come out with a line (which best depicts the given data points)as output. Based on this line we predict the outcome, this is where our final values come from.

Let’s look at a very basic example of linear regression in PySpark

I have downloaded the dataset from  https://www.kaggle.com/leonbora/analytics-vidhya-loan-prediction

  1. creating a Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('lr_example').getOrCreate()

2.  Importing Linear Regression packages

from pyspark.ml.regression import LinearRegression

3.  We will read the input data and see its structure

data = spark.read.csv("train.csv",inferSchema=True,header=True)
data.printSchema()
#output looks similar to this
root
 |-- Loan_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Married: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- Self_Employed: string (nullable = true)
 |-- ApplicantIncome: integer (nullable = true)
 |-- CoapplicantIncome: double (nullable = true)
 |-- LoanAmount: integer (nullable = true)
 |-- Loan_Amount_Term: integer (nullable = true)
 |-- Credit_History: integer (nullable = true)
 |-- Property_Area: string (nullable = true)
 |-- Loan_Status: string (nullable = true)

Machine Learning packages expect only numerical inputs and cannot accept strings. A lot of packages are present to help us transform our data to suit ML requirements. This, we will look at this aspect in another post. As of now, we will just consider two numerical fields from above data schema( ‘ApplicantIncome, CoapplicantIncome’) and we will try to predict ‘LoanAmount’ with these inputs.

4. All our input data needs to be in the form of Vectors, an input to ML packages.  So, lets get this sorted first:

from pyspark.ml.feature import (VectorAssembler, VectorIndexer)
#lets define what inputs will go into our vector and give a name for the output of it
lassembler = VectorAssembler(
    inputCols=['ApplicantIncome','CoapplicantIncome'],
    outputCol='features')

5. We will now transform our data to ML standard input

output = lassembler.transform(data)

If you have downloaded the same set, running the above command will output something similar to:

Error is pretty clear, while transforming data ML has encountered nulls in the data and is asking for a clarification of what needs to be done. We have two options, one, clean the data of null values and feed it back or tell ML packages to skip the null values. We will take the second route by adding an extra line in step 5

output = lassembler.setHandleInvalid("skip").transform(data)

Now, the code executes.  Let’s see what’s there in the “features” output

output.select("features").show()
#output looks similar to below
+-----------------+
|         features|
+-----------------+
|  [4583.0,1508.0]|
|     [3000.0,0.0]|
|  [2583.0,2358.0]|
|     [6000.0,0.0]|
|  [5417.0,4196.0]|
|  [2333.0,1516.0]|
|  [3036.0,2504.0]|
|  [4006.0,1526.0]|
|[12841.0,10968.0]|
|   [3200.0,700.0]|
|  [2500.0,1840.0]|
|  [3073.0,8106.0]|
|  [1853.0,2840.0]|
|  [1299.0,1086.0]|
|     [4950.0,0.0]|
|     [3596.0,0.0]|
|     [3510.0,0.0]|
|     [4887.0,0.0]|
|  [2600.0,3500.0]|
|     [7660.0,0.0]|
+-----------------+
only showing top 20 rows

6. Let’s now feed this input Vector and our prediction value(Y), which is ‘LoanAmount’

loan_data = output.select('features','LoanAmount')
loan_data.show()
#and Output is
+-----------------+----------+
|         features|LoanAmount|
+-----------------+----------+
|  [4583.0,1508.0]|       128|
|     [3000.0,0.0]|        66|
|  [2583.0,2358.0]|       120|
|     [6000.0,0.0]|       141|
|  [5417.0,4196.0]|       267|
|  [2333.0,1516.0]|        95|
|  [3036.0,2504.0]|       158|
|  [4006.0,1526.0]|       168|
|[12841.0,10968.0]|       349|
|   [3200.0,700.0]|        70|
|  [2500.0,1840.0]|       109|
|  [3073.0,8106.0]|       200|
|  [1853.0,2840.0]|       114|
|  [1299.0,1086.0]|        17|
|     [4950.0,0.0]|       125|
|     [3596.0,0.0]|       100|
|     [3510.0,0.0]|        76|
|     [4887.0,0.0]|       133|
|  [2600.0,3500.0]|       115|
|     [7660.0,0.0]|       104|
+-----------------+----------+
only showing top 20 rows

7. Standard practice is to divide the test data into 2 parts- train the ML model with the first and let it predict the values on second set, so we can crosscheck the performance.

#splitting train and test data into 70% and 30% of total data available
train_data,test_data = loan_data.randomSplit([0.7,0.3])

#finally, calling the linear regression package we imported
lr = LinearRegression(labelCol='LoanAmount')
#we are specifying our 'Y' by explicitly mentioned it with 'labelCol'

#fitting the model to train data set
lrModel = lr.fit(train_data)

When you run the above, you should get an output like below

The error says – “Params must be either a param map…”. Reason for this error is our dependent variable (LoanAmount) has null values and ML cannot fit a model which as null as output values.  There are lot of ways to clean this kind of data, we will not consider null values in our example.  Let’s filter out null ‘LoanAmount’ values when we read the data from csv itself like so:

data = spark.read.csv("train.csv",inferSchema=True,header=True)
#we will add a filter to remove the null values from our dependent variable
data = data.filter("LoanAmount is not NULL")
data.printSchema()

Repeat the steps above and the error will go away. So, our Linear regression Model is ready now.

8. Let’s check the residuals (residual = observed value(the value in our input) – predicted value (value predicted by the model) of Y).  Each data point will have one residual.  So, number of residuals will be equal to the number of records we fed as input.

test_results = lrModel.evaluate(test_data)
test_results.residuals.show()
#output looks similar to below
+--------------------+
|           residuals|
+--------------------+
|5.684341886080801...|
|-1.42108547152020...|
|-4.26325641456060...|
|-1.42108547152020...|
|-1.42108547152020...|
|-2.84217094304040...|
|-1.42108547152020...|
|-1.42108547152020...|
|-1.42108547152020...|
|-1.42108547152020...|
|                 0.0|
|-1.42108547152020...|
|-2.13162820728030...|
|-1.42108547152020...|
|-2.84217094304040...|
|                 0.0|
|2.842170943040400...|
|-1.42108547152020...|
|-1.42108547152020...|
|-3.55271367880050...|
+--------------------+
only showing top 20 rows

9. Voila, we can now use this model to predict output of our test data

predictions_data = test_data.select('features')
predictions = lrModel.transform(predictions_data)
predictions.show()
#output looks similar to below
+----------------+------------------+
|        features|        prediction|
+----------------+------------------+
|  [150.0,1800.0]| 102.2878796393209|
| [1299.0,1086.0]|104.99451369589991|
| [1378.0,1881.0]|113.43100472341747|
| [1500.0,1800.0]|113.66768427384854|
|[1600.0,20000.0]|292.40273753359924|
| [1782.0,2232.0]|120.26729293510716|
| [1800.0,1213.0]|110.45902065483665|
| [1820.0,1719.0]|115.57340183734365|
| [1820.0,1769.0]|116.06211641088294|
|[1836.0,33837.0]| 429.6389670546782|
|    [1880.0,0.0]| 99.27716389393049|
| [1907.0,2365.0]|122.62095931502981|
| [1926.0,1851.0]|117.75713371242067|
| [2031.0,1632.0]|116.50165979633736|
| [2130.0,6666.0]|166.53996206680586|
| [2132.0,1591.0]|116.95229182239609|
| [2137.0,8980.0]| 189.2166789246058|
|    [2221.0,0.0]|102.15161824976303|
|    [2237.0,0.0]|102.28649000839447|
| [2253.0,2033.0]|122.29249632713373|
+----------------+------------------+
only showing top 20 rows

That’s it folks.