Creating a Chatbot for Healthcare in React Native using Dialogflow

In this blog, we shall learn how to build an AI virtual assistant or a Chatbot using React Native and Dialogflow API.

Why are chatbots important?
A chatbot is a piece of software that helps in conducting a conversation through voice based or textual methods. Chatbots offer companies new opportunities to improve the customer engagement process and operational efficiency by reducing the typical cost of customer service.

Image result for dialogflow

What is Dialogflow?
Dialogflow (previously known as API.AI) is a Natural Language Processing (NLP) platform which can be greatly helpful to build conversational applications for a company’s customers in various languages and also across multiple platforms. Dialogflow enables developers to create text-based and voice conversation interfaces for responding to customer queries in different languages.

Why Dialogflow?
There are different chatbot SDK’s like Dialogflow, Amazon Lex, IBM Watson, Microsoft Bot Framework etc. The reasons to why we chose to use Dialogflow are:

  1. Dialogflow supports multiple platforms.
  2. Dialogflow supports all the devices like wearables, phones and other devices.
  3. Dialogflow also supports multiple languages.

How Dialogflow works?
In Dialogflow, the typical flow of any conversation involves these steps:

  1. The user providing an input.
  2. Dialogflow agent parsing that input based on the intent.
  3. Agent returning a response to the user.

Setting up Dialogflow account:

Navigate to console in the official website. After navigating to console you will be prompted to sign in with Google, go ahead and sign-in. After successfully signing in you can see a dashboard.

Before we dive into the platform and start building the bot/agent, let us learn about the terms used in Dialogflow.

After signing in, you could see a Create Agent tab. An agent is nothing but the bot that you would like to create. Give a name of your choice and click on the Create button. After creating successfully you could see multiple tabs on the left side of the screen like:

  1. Intents
  2. Entities
  3. Fulfillment etc

Intents:
An Intent is a specific action that the user can invoke by using one of the defined terms in the Dialogflow console. 

For example, the user could ask “What’s the time?” or “What is today’s date?” if these terms are defined within the console, then they will be detected by Dialogflow and intents that are defined under will get triggered.

You can create an intent by clicking on create intent as shown below.

You shall see some default intents already available. We can create the new intents here.

Entities:
An Entity is a property which can be used by Dialogflow to answer the request from the user. The entity will usually be a keyword within the request such as a name, date, time etc. 

Dialogflow has a rich set of predefined entities and also has an option that enables the developer to define custom entities as well.

Fulfillment:
When the user provides the input, Dialogflow needs to process the user input which might contain entities as well. Hence Dialogflow needs to request the information from web-hook so as to fulfill the users request. The input provided by the user along with entities is then sent to the web-hook so that the required information can be retrieved. Once the Dialogflow receives the information from web-hook it sends the response back to the user in the desired manner.

For example, if the user wants to know about weather conditions, a web-hook could be used to get info about weather and pass it on to the user.

Response:
It is the content which Dialogflow sends back to the user once the user’s query is processed.

Creating a ChatBot for Health care:

Now that we have learnt about some basic terms of Dialogflow, let us start building a chatbot (in this case Healthbot) which helps the user (patient) to schedule an appointment with a specific doctor in an organization.

Let’s go ahead and create an agent first. Here we are creating an agent with the name HealthBot.

After clicking the create button, the HealthBot agent would be created. It would look like below.

You could see some default intents there. We can create our own intents here. So, let’s move forward and start creating the intents.

The intent we will be creating here is “Schedule an Appointment”.

Save the intent after creating. In the Training Phrases section, we can add our own training phrases to train the agent.

When we add a particular training phrase , Dialogflow would look for predefined entities in the phrase, if found it will highlight them as shown.

Add few other relative training phrases and click on save.

Next in the Action and Parameters section we can make the @sys.person, @sys.date, @sys.time as required by checking on the Required checkbox. We can also define the prompts for the required fields so that if the user does not provide any one of them the defined prompt will be shown up asking the user to provide the required parameters.

The prompts for the entities could be defined by clicking define prompts under Prompts. Below are the prompts for the respective entities.

Next we have to add the response in the Response section.

After receiving all the required parameters from the user , we can phrase a response like shown.

Now we have to create a front end app using React Native which would communicate with the HealthBot agent.

Let’s go to React Native Docs, select React Native CLI Quickstart and select the appropriate development OS and the target OS as Android, as we are going to build an android application. 

Follow the docs for installing dependencies, then create a new react native application. Use the command line interface to create a new react native project.

react-native init <project-name>

By using the below commands you can run the app on android device. You could see the default welcome page. 

cd <project-name>
Npm install
React-native run-android

Note:  If you face an issue like “Failed to install the app. Make sure you have the Android development environment set up”, just traverse to <project-name>/android folder and create a file named local.properties and add the Android SDK path in it as shown here.

sdk.dir = Your Android SDK Path

We also need to install some dependencies using below command.

npm install react-native-gifted-chat
react-native-dialogflow -save

We are using react-native-gifted-chat package as it provides a customizable and complete chat UI interface.

We are also using react-native-dialogflow so that we can bridge our app with Google Dialogflow’s SDK. 

For our app to communicate with Dialogflow agent, we need to configure few things. For that create any .js file in your project root folder (in this env.js).
We need to configure few values in env.js file.

To get the values click on the Service Account link as shown in the image.
You can get this by clicking on the gear icon present beside the agent name on the left side of the screen.

After clicking the link , you would be shown a table called Service accounts for project “<Agent Name>”. Click on Actions and select create key option from there. A prompt will appear asking to choose an option. Select JSON and click on create. A json file would be downloaded. Just copy the contents of the json file and add it in env.js.

Your env.js file would look like below.

env.js

export const dialogflowConfig = {
  "type": "service_account",
  "project_id": "Health-bot",
  "private_key_id": "xxxx",
  "private_key": "-----BEGIN PRIVATE KEY-----\n xxxx\n-----END PRIVATE KEY-----\n",
  "client_email": "xxxx",
  "client_id": "xxxx",
  "auth_uri": "xxxx",
  "token_uri": "xxxx",
  "auth_provider_x509_cert_url": "xxxx",
  "client_x509_cert_url": "xxxx"
}

Now go to <project-name> directory and open App.js. Modify the content of App.js as below.

App.js

import React, { Component } from 'react';
import {View} from 'react-native';
import { GiftedChat } from 'react-native-gifted-chat';
import { Dialogflow_V2 } from 'react-native-dialogflow';
import { dialogflowConfig } from './env';

const BOT_USER = {
  _id: 2,
  name: 'Health Bot',
  avatar: 'https://previews.123rf.com/images/iulika1/iulika11909/iulika1190900021/129697389-medical-worker-health-professional-avatar-medical-staff-doctor-icon-isolated-on-white-background-vec.jpg'
};
class App extends Component {

  state = {
    messages: [
      {
        _id: 1,
        text: 'Hi! I am the Healthbot 🤖.\n\nHow may I help you today?',
        createdAt: new Date(),
        user: BOT_USER
      }
    ]
  };

  componentDidMount() {
    Dialogflow_V2.setConfiguration(
      dialogflowConfig.client_email,
      dialogflowConfig.private_key,
      Dialogflow_V2.LANG_ENGLISH_US,
      dialogflowConfig.project_id
    );
  }

  onSend(messages = []) {
    this.setState(previousState => ({
      messages: GiftedChat.append(previousState.messages, messages)
    }));

    let message = messages[0].text;
    Dialogflow_V2.requestQuery(
      message,
      result => this.handleGoogleResponse(result),
      error => console.log(error)
    );
  }

  handleGoogleResponse(result) {
    let text = result.queryResult.fulfillmentMessages[0].text.text[0];
    this.sendBotResponse(text);
  }

  sendBotResponse(text) {
    let msg = {
      _id: this.state.messages.length + 1,
      text,
      createdAt: new Date(),
      user: BOT_USER
    };

    this.setState(previousState => ({
      messages: GiftedChat.append(previousState.messages, [msg])
    }));
  }

  render() {
    return (
      <View style={{ flex: 1, backgroundColor: '#fff' }}>
        <GiftedChat
          messages={this.state.messages}
          onSend={messages => this.onSend(messages)}
          user={{
            _id: 1
          }}
        />
      </View>
    );
  }
}
export default App;

When the App.js file renders, the first thing it renders is componentDidMount() where we set the configuration of Dialogflow as given below.


componentDidMount() {
    Dialogflow_V2.setConfiguration(
      dialogflowConfig.client_email,
      dialogflowConfig.private_key,
      Dialogflow_V2.LANG_ENGLISH_US,
      dialogflowConfig.project_id
    );
  }

When you click on send , it will trigger the onSend() method where the user message gets stored in the state variable and we will send a request to Dialogflow using Dialogflow_V2.requestQuery. If the response is successful, handleGoogleResponse() method gets triggered.

onSend(messages = []) {
    this.setState(previousState => ({
      messages: GiftedChat.append(previousState.messages, messages)
    }));

    let message = messages[0].text;
    Dialogflow_V2.requestQuery(
      message,
      result => this.handleGoogleResponse(result),
      error => console.log(error)
    );
  }

handleGoogleResponse() will get the text from the response and triggers sendBotResponse() method where it will set the state to response as shown below

handleGoogleResponse(result) {
    let text = result.queryResult.fulfillmentMessages[0].text.text[0];
    this.sendBotResponse(text);
  }

  sendBotResponse(text) {
      let msg = {
        _id: this.state.messages.length + 1,
        text,
        createdAt: new Date(),
        user: BOT_USER
      };

      this.setState(previousState => ({
        messages: GiftedChat.append(previousState.messages, [msg])
      }));
    }

Below are the images of the app running on an Android device.

That’s it folks, we hope it was fun and useful.

This story is authored by Dheeraj Kumar and Santosh Kumar. Dheeraj is a software engineer specializing in React Native and React based frontend development. Santosh specializes on Cloud Services based development.

Machine Learning Operations (MLOps) Pipeline using Google Cloud Composer

In an earlier post, we had described the need for automating the Data Engineering pipeline for Machine Learning based systems. Today, we will expand the scope to setup a fully automated MLOps pipeline using Google Cloud Composer.

Cloud Composer

Cloud Composer is official defined as a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers. Built on the popular Apache Airflow open source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.

So let’s get on with the required steps to create this MLOps infrastructure on Google Cloud Platform

Creating a Cloud Composer Environment

Step1: Please enable the Cloud Composer API.

Step2: Go to create environment page in GCP console. Composer is available in Big Data section.

Step3: Click on create to start creating a Composer environment

Step4: Please select the Service account which has the required permissions to access GCS, Big Query, ML Engine and  Composer environment. The required roles for accessing Composer environment is Composer Administrator and Composer Worker. 
For more details about access control in Composer environment please see this.

Step5: Please use Python Version 3 and latest Image version.

Step6: Click on create. It will take about 15-20 minutes to create the environment. Once it completes, the environment page shall look like the following.

Click on Airflow to see Airflow WebUI. The Airflow WebUI looks as follows

DAGs folder is where our dag file is stored. DAG folder is nothing but a folder inside a GCS bucket which is created by the environment. To know more about the concept of DAG and general introduction to Airflow, please refer to this post.

You could see Composer related logs in Logging.

Step7: Please add the following PyPI packages in Composer environment.

Click on created environment and navigate to PYPI packages and click on edit to add packages

The required packages are:

# to read data from MongoDB
pymongo==3.8.0
oauth2client==4.1.3
# to read data from firestore
google-cloud-firestore==1.3.0
firebase-admin==2.17.0
google-api-core==1.13.0

Create a ML model

Step1: Please create a folder structure like the following on your instance.

ml_model
├── setup.py
└── trainer
    ├── __init__.py
    └── train.py

Step2: Please place the following code in train.py file, which shall upload the model to GCS bucket as shown below. This model would be used to create model versions as explained a bit later.

from google.cloud import bigquery
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np
from google.cloud import storage
import datetime
import json
import pickle
client = bigquery.Client()
sql = '''
SELECT *
FROM `<PROJECT_ID>.<DATASET>.<TABLENAME>`
'''

df = client.query(sql).to_dataframe()
df = df[['is_stressed', 'is_engaged', 'status']]

df['is_stressed'] = df['is_stressed'].fillna('n')
df['is_engaged'] = df['is_engaged'].fillna('n')
df['stressed'] = np.where(df['is_stressed']=='y', 1, 0)
df['engaged'] = np.where(df['is_engaged']=='y', 1, 0)
df['status'] = np.where(df['status']=='complete', 1, 0)

feature_cols = ['stressed', 'engaged']
X = df[feature_cols]
y = df.status
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
pkl_filename = "model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(logreg, file)
BUCKET_NAME=BUCKET_NAME# Upload the model to GCS
bucket = storage.Client().bucket(BUCKET_NAME)
file_path = datetime.datetime.now().strftime('machine_learning/models/%Y%m%d_%H%M%S')
blob = bucket.blob('{}/{}'.format(
    file_path,
    pkl_filename))
blob.upload_from_filename(pkl_filename)

file_location = 'gs://{BUCKET_NAME}/{file_path}'.format(BUCKET_NAME=BUCKET_NAME, file_path=file_path)
file_config = json.dumps({'file_location': file_location})

bucket = storage.Client().bucket(COMPOSER_BUCKET)
blob = bucket.blob('data/file_config.json')
blob.upload_from_string(file_config)

Step3: Create an empty init.py file inside the trainer directory.

Step4: Please place the following code in setup.py file. The setup.py file contains required packages to execute code.

import setuptools

REQUIRED_PACKAGES = [
    'pandas-gbq==0.3.0',
    'cloudml-hypertune',
    'google-cloud-bigquery==1.14.0',
    'urllib3'
]

setuptools.setup(
    name='ml_model',
    version='1.0',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
    include_package_data=True,
    description='',
)

Step5: Packaging the code using the following command. It creates a gz file inside ml_model directory.

python3 setup.py sdist

Step6: The package name is the name that is specified in setup.py file. The package name becomes ml_model-1.0.tar.gz
Copy the package to gs://{your-GCS-bucket}/machine_learning/. This becomes the base directory for your machine learning activities described in this post.

Creating a DAG

In this use case, we have created a DAG file which exports some table data from a MongoDB instance into a GCS bucket and then creates a BigQuery table off of that exported data. It trains a model and creates version for that model. The DAG file supports full data extraction and daily data extraction explained in the code below using a variable tot_data. This variable is extracted from Airflow configurations set by the user. This process is also described later in this post.

Please place the following code in the DAG file.

import airflow
from airflow import DAG
from airflow.models import Variable
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta, datetime
from airflow.operators.python_operator import PythonOperator
import pprint
import json
import re

from pymongo import MongoClient
from google.cloud import storage
from google.cloud.storage import blob
from google.cloud import storage
import os

from airflow import models
from mlengine_operator import MLEngineTrainingOperator, MLEngineVersionOperator

ts = datetime.now()
today = str(ts.date()) + 'T00:00:00.000Z'
yester_day = str(ts.date() - timedelta(days = 1)) + 'T00:00:00.000Z'

str_ts = ts.strftime('%Y_%m_%d_%H_%m_%S')

config = Variable.get("mongo_conf", deserialize_json=True)
host = config['host']
db_name = config['db_name']
table_name = config['table_name']
file_prefix = config['file_prefix']
bucket_name = config['bucket_name']
# file_path = file_prefix + '/' + table_name + '.json'
file_path = '{file_prefix}/{table_name}/{table_name}_{str_ts}.json'.format(file_prefix=file_prefix, str_ts=str_ts, table_name=table_name)
file_location = 'gs://' + bucket_name + '/' + file_prefix + '/' + table_name + '/' + table_name + '_*.json'
config['file_location'] = file_location
bq_dataset = config['bq_dataset']
tot_data = config['tot_data'].lower()

BUCKET_NAME = config['ml_configuration']['BUCKET_NAME']
BASE_DIR = config['ml_configuration']['BASE_DIR']
PACKAGE_NAME = config['ml_configuration']['PACKAGE_NAME']
TRAINER_BIN = os.path.join(BASE_DIR, 'packages', PACKAGE_NAME)
TRAINER_MODULE = config['ml_configuration']['TRAINER_MODULE']
RUNTIME_VERSION = config['ml_configuration']['RUNTIME_VERSION']
PROJECT_ID = config['ml_configuration']['PROJECT_ID']
MODEL_NAME = config['ml_configuration']['MODEL_NAME']

MODEL_FILE_BUCKET = config['ml_configuration']['MODEL_FILE_BUCKET']
model_file_loc = config['ml_configuration']['MODEL_FILE_LOCATION']

bucket = storage.Client().bucket(MODEL_FILE_BUCKET)
blob = bucket.get_blob(model_file_loc)
file_config = json.loads(blob.download_as_string().decode("utf-8"))
export_uri = file_config['file_location']

def flatten_json(y):
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

def mongoexport():
        client = storage.Client()
        bucket = client.get_bucket(bucket_name)
        blob = bucket.blob(file_path)

        client = MongoClient(host)
        db = client[db_name]
        tasks = db[table_name]
        pprint.pprint(tasks.count_documents({}))
        # if tot_data is set to 'yes' in airflow configurations, full data 
        # is processed.  
        if tot_data == 'no':
          query = {"edit_datetime": { "$gte": yester_day, "$lt": today}}
          print(query)
          data = tasks.find(query)
        else:
          data = tasks.find()
        emp_list = []
        for record in data:
                emp_list.append(json.dumps(record, default=str))
        flat_list =[]
        for data in emp_list:
                flat_list.append((flatten_json(json.loads(data))))
        data = '\n'.join(json.dumps({re.sub('[^0-9a-zA-Z_ ]+', '', str(k)).lower().replace(' ', '_'): str(v) for k, v in record.items()}) for record in flat_list)
        blob.upload_from_string(data)

default_args = {
    'start_date': airflow.utils.dates.days_ago(0),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

with DAG('ml_pipeline', schedule_interval=None, default_args=default_args) as dag:

    # priority_weight has type int in Airflow DB, uses the maximum.
    pymongo_export_op = PythonOperator(
        task_id='pymongo_export',
        python_callable=mongoexport,
        )

    update_bq_table_op = BashOperator(
        task_id='update_bq_table',
        bash_command='''
        bq rm -f {bq_dataset}.{table_name}
        bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON --ignore_unknown_values=True {bq_dataset}.{table_name} {file_location}
        '''.format(bq_dataset=bq_dataset, table_name=table_name, file_location=file_location)
        )

    date_nospecial = '{{ execution_date.strftime("%Y%m%d") }}'
    date_min_nospecial = '{{ execution_date.strftime("%Y%m%d_%H%m") }}'
    uuid = '{{ macros.uuid.uuid4().hex[:8] }}'

    training_op = MLEngineTrainingOperator(
      task_id='submit_job_for_training',
      project_id=PROJECT_ID,
      job_id='{}_{}_{}'.format(table_name, date_nospecial, uuid),
      package_uris=[os.path.join(TRAINER_BIN)],
      training_python_module=TRAINER_MODULE,
      training_args=[
          '--base-dir={}'.format(BASE_DIR),
          '--event-date={}'.format(date_nospecial),
      ],
      region='us-central1',
      runtime_version=RUNTIME_VERSION,
      python_version='3.5')

    create_version_op = MLEngineVersionOperator(
      task_id='create_version',
      project_id=PROJECT_ID,
      model_name=MODEL_NAME,
      version={
          'name': 'version_{}_{}'.format(date_min_nospecial, uuid),
          'deploymentUri': export_uri,
          'runtimeVersion': RUNTIME_VERSION,
          'pythonVersion': '3.5',
          'framework': 'SCIKIT_LEARN',
      },
      operation='create')

    pymongo_export_op >> update_bq_table_op >> training_op >> create_version_op

Once file is created, please upload the file to DAGs folder. And also please add the following plugin dependency file named mlengine_operator in DAGs folder.
Place the following code in mlengine_operator.py file.

import re

from apiclient import errors

from airflow.contrib.hooks.gcp_mlengine_hook import MLEngineHook
from airflow.exceptions import AirflowException
from airflow.operators import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.utils.log.logging_mixin import LoggingMixin

log = LoggingMixin().log


def _normalize_mlengine_job_id(job_id):

    # Add a prefix when a job_id starts with a digit or a template
    match = re.search(r'\d|\{{2}', job_id)
    if match and match.start() is 0:
        job = 'z_{}'.format(job_id)
    else:
        job = job_id

    # Clean up 'bad' characters except templates
    tracker = 0
    cleansed_job_id = ''
    for m in re.finditer(r'\{{2}.+?\}{2}', job):
        cleansed_job_id += re.sub(r'[^0-9a-zA-Z]+', '_',
                                  job[tracker:m.start()])
        cleansed_job_id += job[m.start():m.end()]
        tracker = m.end()

    # Clean up last substring or the full string if no templates
    cleansed_job_id += re.sub(r'[^0-9a-zA-Z]+', '_', job[tracker:])

    return cleansed_job_id


class MLEngineBatchPredictionOperator(BaseOperator):
   
    template_fields = [
        '_project_id',
        '_job_id',
        '_region',
        '_input_paths',
        '_output_path',
        '_model_name',
        '_version_name',
        '_uri',
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 job_id,
                 region,
                 data_format,
                 input_paths,
                 output_path,
                 model_name=None,
                 version_name=None,
                 uri=None,
                 max_worker_count=None,
                 runtime_version=None,
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 *args,
                 **kwargs):
        super(MLEngineBatchPredictionOperator, self).__init__(*args, **kwargs)

        self._project_id = project_id
        self._job_id = job_id
        self._region = region
        self._data_format = data_format
        self._input_paths = input_paths
        self._output_path = output_path
        self._model_name = model_name
        self._version_name = version_name
        self._uri = uri
        self._max_worker_count = max_worker_count
        self._runtime_version = runtime_version
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

        if not self._project_id:
            raise AirflowException('Google Cloud project id is required.')
        if not self._job_id:
            raise AirflowException(
                'An unique job id is required for Google MLEngine prediction '
                'job.')

        if self._uri:
            if self._model_name or self._version_name:
                raise AirflowException('Ambiguous model origin: Both uri and '
                                       'model/version name are provided.')

        if self._version_name and not self._model_name:
            raise AirflowException(
                'Missing model: Batch prediction expects '
                'a model name when a version name is provided.')

        if not (self._uri or self._model_name):
            raise AirflowException(
                'Missing model origin: Batch prediction expects a model, '
                'a model & version combination, or a URI to a savedModel.')

    def execute(self, context):
        job_id = _normalize_mlengine_job_id(self._job_id)
        prediction_request = {
            'jobId': job_id,
            'predictionInput': {
                'dataFormat': self._data_format,
                'inputPaths': self._input_paths,
                'outputPath': self._output_path,
                'region': self._region
            }
        }

        if self._uri:
            prediction_request['predictionInput']['uri'] = self._uri
        elif self._model_name:
            origin_name = 'projects/{}/models/{}'.format(
                self._project_id, self._model_name)
            if not self._version_name:
                prediction_request['predictionInput'][
                    'modelName'] = origin_name
            else:
                prediction_request['predictionInput']['versionName'] = \
                    origin_name + '/versions/{}'.format(self._version_name)

        if self._max_worker_count:
            prediction_request['predictionInput'][
                'maxWorkerCount'] = self._max_worker_count

        if self._runtime_version:
            prediction_request['predictionInput'][
                'runtimeVersion'] = self._runtime_version

        hook = MLEngineHook(self._gcp_conn_id, self._delegate_to)

        # Helper method to check if the existing job's prediction input is the
        # same as the request we get here.
        def check_existing_job(existing_job):
            return existing_job.get('predictionInput', None) == \
                prediction_request['predictionInput']

        try:
            finished_prediction_job = hook.create_job(
                self._project_id, prediction_request, check_existing_job)
        except errors.HttpError:
            raise

        if finished_prediction_job['state'] != 'SUCCEEDED':
            self.log.error('MLEngine batch prediction job failed: {}'.format(
                str(finished_prediction_job)))
            raise RuntimeError(finished_prediction_job['errorMessage'])

        return finished_prediction_job['predictionOutput']


class MLEngineModelOperator(BaseOperator):
    template_fields = [
        '_model',
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 model,
                 operation='create',
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 *args,
                 **kwargs):
        super(MLEngineModelOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._model = model
        self._operation = operation
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

    def execute(self, context):
        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)
        if self._operation == 'create':
            return hook.create_model(self._project_id, self._model)
        elif self._operation == 'get':
            return hook.get_model(self._project_id, self._model['name'])
        else:
            raise ValueError('Unknown operation: {}'.format(self._operation))


class MLEngineVersionOperator(BaseOperator):
    
    template_fields = [
        '_model_name',
        '_version_name',
        '_version',
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 model_name,
                 version_name=None,
                 version=None,
                 operation='create',
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 *args,
                 **kwargs):

        super(MLEngineVersionOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._model_name = model_name
        self._version_name = version_name
        self._version = version or {}
        self._operation = operation
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to

    def execute(self, context):
        if 'name' not in self._version:
            self._version['name'] = self._version_name

        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)

        if self._operation == 'create':
            assert self._version is not None
            return hook.create_version(self._project_id, self._model_name,
                                       self._version)
        elif self._operation == 'set_default':
            return hook.set_default_version(self._project_id, self._model_name,
                                            self._version['name'])
        elif self._operation == 'list':
            return hook.list_versions(self._project_id, self._model_name)
        elif self._operation == 'delete':
            return hook.delete_version(self._project_id, self._model_name,
                                       self._version['name'])
        else:
            raise ValueError('Unknown operation: {}'.format(self._operation))


class MLEngineTrainingOperator(BaseOperator):
    
    template_fields = [
        '_project_id',
        '_job_id',
        '_package_uris',
        '_training_python_module',
        '_training_args',
        '_region',
        '_scale_tier',
        '_runtime_version',
        '_python_version',
        '_job_dir'
    ]

    @apply_defaults
    def __init__(self,
                 project_id,
                 job_id,
                 package_uris,
                 training_python_module,
                 training_args,
                 region,
                 scale_tier=None,
                 runtime_version=None,
                 python_version=None,
                 job_dir=None,
                 gcp_conn_id='google_cloud_default',
                 delegate_to=None,
                 mode='PRODUCTION',
                 *args,
                 **kwargs):
        super(MLEngineTrainingOperator, self).__init__(*args, **kwargs)
        self._project_id = project_id
        self._job_id = job_id
        self._package_uris = package_uris
        self._training_python_module = training_python_module
        self._training_args = training_args
        self._region = region
        self._scale_tier = scale_tier
        self._runtime_version = runtime_version
        self._python_version = python_version
        self._job_dir = job_dir
        self._gcp_conn_id = gcp_conn_id
        self._delegate_to = delegate_to
        self._mode = mode

        if not self._project_id:
            raise AirflowException('Google Cloud project id is required.')
        if not self._job_id:
            raise AirflowException(
                'An unique job id is required for Google MLEngine training '
                'job.')
        if not package_uris:
            raise AirflowException(
                'At least one python package is required for MLEngine '
                'Training job.')
        if not training_python_module:
            raise AirflowException(
                'Python module name to run after installing required '
                'packages is required.')
        if not self._region:
            raise AirflowException('Google Compute Engine region is required.')

    def execute(self, context):
        job_id = _normalize_mlengine_job_id(self._job_id)
        training_request = {
            'jobId': job_id,
            'trainingInput': {
                'scaleTier': self._scale_tier,
                'packageUris': self._package_uris,
                'pythonModule': self._training_python_module,
                'region': self._region,
                'args': self._training_args,
            }
        }

        if self._runtime_version:
            training_request['trainingInput']['runtimeVersion'] = self._runtime_version

        if self._python_version:
            training_request['trainingInput']['pythonVersion'] = self._python_version

        if self._job_dir:
            training_request['trainingInput']['jobDir'] = self._job_dir

        if self._mode == 'DRY_RUN':
            self.log.info('In dry_run mode.')
            self.log.info('MLEngine Training job request is: {}'.format(
                training_request))
            return

        hook = MLEngineHook(
            gcp_conn_id=self._gcp_conn_id, delegate_to=self._delegate_to)

        # Helper method to check if the existing job's training input is the
        # same as the request we get here.
        def check_existing_job(existing_job):
            return existing_job.get('trainingInput', None) == \
                training_request['trainingInput']

        try:
            finished_training_job = hook.create_job(
                self._project_id, training_request, check_existing_job)
        except errors.HttpError:
            raise

        if finished_training_job['state'] != 'SUCCEEDED':
            self.log.error('MLEngine training job failed: {}'.format(
                str(finished_training_job)))
            raise RuntimeError(finished_training_job['errorMessage'])

Import variables from composer_conf.json file into Airflow Variables.
Go to Airflow WebUI → Admin → Variables → Browse to file path or configure variables manually.
Please place the following in composer_conf

{
  "mongo_conf": {
    "host": "mongodb://<instance-internal-ip>:27017",
    "db_name": "DBNAME",
    "table_name": "TABLENAME",
    "file_prefix": "Folder In GCS Bucket",
    "bq_dataset": "BigQuery Dataset",
    "bucket_name": "GCS Bucket",
    "tot_data": "yes",
    "ml_configuration": {
      "BUCKET_NAME": "GCS Bucket",
      "BASE_DIR": "gs://GCS Bucket/machine_learning/",
      "PACKAGE_NAME": "PACKAGE NAME FROM setup.py FILE in ML",
      "TRAINER_MODULE": "trainer.train",
      "RUNTIME_VERSION": "1.13",
      "PROJECT_ID": "GCP Project",
      "MODEL_FILE_BUCKET": "BUCKET CREATED BY Composer Environment",
      "MODEL_FILE_LOCATION": "data/MODEL LOCATION FILE",
      "MODEL_NAME": "MODEL_NAME"
    }
  }

Please store any configuration files or credentials file that are used by Composer in the data folder in the bucket created by Composer environment.

After configuring variables accordingly, you can see the DAG named ml_pipeline in the Airflow WebUI.

Please trigger the DAG file from Airflow WebUI. Once the DAG ran successfully. It looks like the following:

Thanks for the read and look forward to your comments.

This story is authored by PV Subbareddy. Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Creating an Automated Data Engineering Pipeline for Batch Data in Machine Learning

A common use case in Machine Learning life cycle is to have access to the latest training data so as to prevent model deterioration. A lot of times data scientists find it cumbersome to manually export data from data sources such as relational databases or NoSQL data stores or even distributed data. This necessitates automating the data engineering pipeline in Machine Learning. In this post, we will describe how to set up this pipeline for batch data. This workflow is orchestrated via Airflow and can be set up to run at regular intervals: such as hourly, daily, weekly, etc depending on the specific business requirements.

Quick note – In case you are interested in building a real time data engineering pipeline for ML, please look at this post.

In this use case, we are going to export MongoDB data into Google BigQuery via Cloud Storage. The updated data in BigQuery is then made available in Jupyter Notebook as a Pandas Dataframe for downstream model building and analytics. As the pipeline automates the data ingestion and preprocessing, the data scientists always have access to the latest batch data in their Jupyter Notebooks hosted on Google AI Platform. 

We have a MongoDB service running in an instance and we have Airflow and mongoexport running on docker on another instance. Mongoexport is a utility that produces a JSON or CSV export of data stored in MongoDB. Now the data in MongoDB shall be extracted and transformed using mongoexport and loaded into CloudStorage. Airflow is used to schedule and orchestrate these exports. Once the data is available in CloudStorage it could be queried in BigQuery. We then get this data from BigQuery to Jupyter Notebook. Following is a step by step sequence of steps to set up this data pipeline.

You can create an instance in GCP by going to Compute Engine. Click on create instance.

Install.sh:

sudo apt-get update
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
sudo usermod -aG docker $USER
sudo apt-get install -y python-pip
export AIRFLOW_HOME=~/airflow
sudo pip install apache-airflow
sudo pip install apache-airflow[postgres,s3]
airflow initdb
airflow webserver -p 8080 -D
airflow scheduler -D
sudo docker pull mongo
sudo docker run --name mongo_client -d mongo

Please run the install.sh file using ./install.sh command (please make sure file is executable), which would install Docker, Airflow, pulls Mongo image and runs the mongo image in a container named mongo_client.

After installation, for Airflow webUIhttp://<public-ip-instance>:8080 (You may need to open port 8080 in the network just for your public IP)


Please make sure the Google service account in the running instance must have permissions for accessing Big Query and Cloud Storage. After installation, add the Airflow job Python file (mongo-export.py) inside the airflow/dags folder.

Before running the Python file, please make sure that you create Dataset and create the table in BigQuery. Also change the appropriate values for the MongoDB source database, MongoDB source table, Cloud Storage destination bucket and BigQuery destination dataset in the Airflow job Python file (mongo-export.py). Big Query destination table name is the same as the source table in Mongo DB. 

Mongo-export.py:

import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
import json
from pandas.io.json import json_normalize

# Following are default arguments which could be overridden
default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': airflow.utils.dates.days_ago(0),
    'email': ['airflow@gmail.com'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1),
}

bucket_name = '<Your_Bucket>'
db_name = '<Database_Name>'
dataset = '<Dataset_Name>'
table_name = '<Table_Name>'


time_stamp = datetime.now()
cur_date = time_stamp.strftime("%Y-%m-%d")

# It will flatten the nested json
def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(y)
    return out

def convert_string(y):
    string_type = {}

    def convert(x, name=''):
        if type(x) is dict:
            for a in x:
                convert(str(x[a]), name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            string_type[name[:-1]] = x

    convert(y)
    return string_type


def json_flat():
    lines = [line.rstrip('\n') for line in open('/home/dev/'+ table_name + '-unformat.json')]
    flat_list = []
    for line in lines:
        line = line.replace("\"$", "\"")
        line = json.loads(line)
        try:
            flat_list.append(json.dumps(convert_string(flatten_json(line))))
        except Exception as e:
            print(e)
    flatted_json = '\n'.join(i for i in flat_list)

    with open('/home/dev/' + table_name + '.json', 'a') as file:
        file.write(flatted_json)
    return flatted_json 

dag = DAG('mongoexport-daily-gcs-bq', default_args=default_args, params = {'cur_date': cur_date, 'db_name': db_name, 'table_name': table_name, 'dataset': dataset, 'bucket_name': bucket_name})
#exports provide a table data into docker container 
t1 = BashOperator(
    task_id='mongoexport_to_container',
    bash_command='sudo docker exec -i mongo_client sh -c "mongoexport --host=<instance_public_ip> --db {{params.db_name}} --collection {{params.table_name}} --out {{params.table_name}}-unformat.json"',
    dag=dag)

# copies exported file into instance

t2 = BashOperator(
    task_id='cp_from_container_instance',
    bash_command='sudo docker cp mongo_client:/{{params.table_name}}-unformat.json /home/dev/',
    dag=dag)

t3 = PythonOperator(
    task_id='flattening_json',
    python_callable=json_flat,
    dag=dag)
# copies the flatten data from cloud storage
t4 = BashOperator(
    task_id='cp_from_instance_gcs',
    bash_command='gsutil cp /home/dev/{{params.table_name}}.json gs://{{params.bucket_name}}/raw/{{params.table_name}}/date={{params.cur_date}}/',
    dag=dag)
# 
t5 = BashOperator(
    task_id='cp_from_instance_gcs_daily_data',
    bash_command='gsutil cp /home/dev/{{params.table_name}}.json gs://{{params.bucket_name}}/curated/{{params.table_name}}/',
    dag=dag)

# removes the existing bigquery table
t6 = BashOperator(
    task_id='remove_bq_table',
    bash_command='bq rm -f {{params.dataset}}.{{params.table_name}}',
    dag=dag)
# creates a table in bigquery
t7 = BashOperator(
    task_id='create_bq_table',
    bash_command='bq load --autodetect --source_format=NEWLINE_DELIMITED_JSON {{params.dataset}}.{{params.table_name}} gs://{{params.bucket_name}}/curated/{{params.table_name}}/{{params.table_name}}.json',
    dag=dag)
# removes data from container
t8 = BashOperator(
    task_id='remove_file_from_container',
    bash_command='sudo docker exec -i mongo_client sh -c "rm -rf {{params.table_name}}*.json"',
    dag=dag)
# removes data from instance
t9 = BashOperator(
    task_id='remove_file_from_instance',
    bash_command='rm -rf /home/dev/{{params.table_name}}*.json',
    dag=dag)

t1 >> t2
t2 >> t3
t3 >> [t4, t5]
[t4, t5] >> t6
t6 >> t7
t7 >> [t8, t9]

Then run the python file using python <file-path>.py  

(example: python airflow/dags/mongo-export.py).

After running the python file, the dag name shows in Airflow webUI. And you could trigger the dag manually. Please make sure toggle button is in ON status

Once the job completes, the data is stored in the bucket and also available in the destination table in BigQuery. You could see the table is created in BigQuery. Click on querytable to perform SQL operations and you could see your results in the preview tab at the bottom.

Now, you could access the data in Jupyter Notebook from BigQuery. Search for notebook in GCP console. 

Run the below commands in Jupyter Notebook.

from google.cloud import bigquery
client = bigquery.Client()
sql = """
SELECT * FROM 
`<project-name>.<dataset-name>.<table-name>`
"""
df = client.query(sql).to_dataframe()
df.head(10)

This loads the BigQuery data into Pandas dataframe and can be used for model creation as required. Later when the data pipeline is run as per schedule, the refreshed data would automatically be available in this Jupyter notebook via this SQL query.

Hope this helps you to automate your batch Data Engineering pipeline for Machine Learning. 

This story is co-authored by Santosh and Subbareddy. Santosh is an AWS Cloud Engineer and Subbareddy is a Big Data Engineer.

Real Time Data Engineering Pipeline for Machine Learning

Our focus in this post is to leverage Google Cloud Platform’s Big Data Services to build an end to end Data Engineering pipeline for streaming processes.

So what is Data Engineering?
Data Engineering is associated with data specifically around data delivery, storage and processing. The main goal is to provide a reliable infrastructure for data which includes operations such as collect, move, store and prepare data.

Most companies store their data in different formats across databases and as text files. This is where data engineers come in to picture, they build pipelines that transform this data into formats that data scientists could use.

Need for Data Engineering in Machine Learning:
Data engineers are responsible for:

  • Develop machine learning models.
  • Improve existing machine learning models.
  • Research and implement best practices to enhance existing machine learning infrastructure.
  • Developing, constructing, testing and maintaining architectures, such as databases and large-scale processing systems.
  • Analyzing large and complex data sets to derive valuable insights.

This is the reference architecture used to build the end to end pipe data pipeline :

Google Cloud Platform Data Engineering Pipeline for Streaming Processes

The Google Cloud Services used in above streaming process are:

  1. Cloud Firestore: Lets us store data in cloud so that we could sync it across all other devices and also share among multiple users. It is a NoSQL query document data which lets us store, query and sync.
  2. Cloud Function: A lightweight compute solution for developers to create single-purpose, stand-alone functions that respond to cloud events without the need to manage a server or runtime environment.
  3. Cloud Pub/Sub: A fully-managed real-time messaging service that allows you to send and receive messages across independent applications.
  4. Cloud Dataflow: A cloud-based data processing service for both batch and real-time data streaming applications. It enables developers to set up data processing pipelines for integrating, preparing and analyzing large data sets.
  5. Cloud Storage: A data storage service in which data is maintained, managed, backed up remotely and made available to users over a network.
  6. BigQuery: It was designed for analyzing data on the order of billions of rows, using a SQL-like syntax. It runs on the Google Cloud Storage infrastructure and could be accessed with a REST-oriented application programming interface (API).
  7. Jupyter notebook: An open source web application that you could use to create and share documents that contain live code, equations, visualizations, and text.

Create data engineering pipeline via Firestore Streaming

Step1: Add a new record in a collection (think of it as a table), say pubsub-event in firestore.

Step2: It triggers the cloud function named pubsub_event

Document Path: pubsub-event/{eventId}  listens for changes to all pubsub-event documents.

Below is the Cloud Function written in node js which triggers whenever there is a change in our source Firestore collection and publishes the data to Pub/Sub

const PubSub = require('@google-cloud/pubsub');
const pubsubClient = new PubSub();
const functions = require('firebase-functions');

exports.helloFirestore = functions.firestore
  .document("pubsub-event/{eventId}")
  .onCreate((snap, context) => {
    const event = snap.data();
    const payload_data = {};
    for (let key of Object.keys(event)) {
    	payload_data[key] = event[key];
    }
    console.log(JSON.stringify(payload_data))
    // The name for the new topic
    const topicName = 'pubsub-gcs';
    const dataBuffer = Buffer.from(JSON.stringify(payload_data));
    // Creates the new topic
    return pubsubClient
      .topic(topicName)
      .publisher()
      .publish(dataBuffer)
      .then(messageId => {
        console.log(`Message ${messageId} published.`);
        return messageId;
      })
      .catch(err => {
        console.error('ERROR:', err);
      });

  });

Below is the dependencies of the Cloud Function.

{
  "name": "functions",
  "description": "Cloud Functions for Firebase",
  "scripts": {
    "serve": "firebase serve --only functions",
    "shell": "firebase functions:shell",
    "start": "npm run shell",
    "deploy": "firebase deploy --only functions",
    "logs": "firebase functions:log"
  },
  "engines": {
    "node": "8"
  },
  "dependencies": {
    "@google-cloud/pubsub": "^0.18.0",
    "firebase-admin": "~7.0.0",
    "firebase-functions": "^2.3.1"
  },
  "devDependencies": {
    "firebase-functions-test": "^0.1.6"
  },
  "private": true
}

Step3: Cloud Function pubsub_event publishes data to Pub/Sub topic projects/ProjectName/topics/pubsub-gcs

Step4: As shown above, create an export job : ps-to-text-pubsub-gcs (implemented via Dataflow). This job reads data every 5 minutes (configurable to other values as well) from Pub/Sub topic pubsub-gcs and dumps this into the destination bucket on GCS.

Click on run the job.

 Step6: Now, we have data in CloudStorage. We shall use BigQuery to perform all the data manipulation operations. But first we need to create dataset in BigQuery to query data from GCS into Bigquery.

Go to BigQuery and create dataset. So that we create our table to access that data.

The dataset shall be created. By clicking on the dataset you shall see an option to CREATE TABLE.

Click on CREATE TABLE then we shall get the data from CloudStorage. While setting up the required inputs as indicated below, please make sure that you select “Table type” as External Table. This ensures that BigQuery is able to automatically load new data as it comes into GCS.

To create table in BigQuery from CloudStorage. Click on the browse button and configure file path.

Files that are having pubsub-event-* as prefix. This prefix is very important as it makes sure that all subsequent data dumps into GCS destination folder are also picked automatically by BigQuery. Select the file format to be JSON. Check the auto-detect schema box. Then click create table.


Quick Tip: For reading nested json files in BigQuery, please go through this resource. Now the data which is present in CloudStorage is also available in BigQuery and you could run sql commands to manipulate the data.

Click on table you have created, accounts is my table name and click on query table to make SQL operations and you could see your results in the preview tab at the bottom.

Step7: Now, we are on to the last step to access this BigQuery data in Jupyter Notebooks and use that as the source data to train and build our ML models.

Search for notebook in GCP console. 

You shall see something like this 

Click on OPEN JUPYTERLAB then it will redirect you to notebook.

from google.cloud import bigquery

client = bigquery.Client()

sql = """
SELECT * FROM 
`<project-name>.<dataset-name>.<table-name>`
"""

df = client.query(sql).to_dataframe()
df.head(10)

So in this way, we have built a data pipeline that continuously dumps data from Firestore into GCS every 5 minutes, which is then readily available in Jupyter Notebook via BigQuery for any downstream analytics and ML model building.

Look forward to your comments.

This story is co-authored by Santosh Kumar and PV Subbareddy. Santosh is a Software Engineer specializing on Cloud Services and DevOps. Subbareddy is a Big Data Engineer specializing on AWS Big Data Services and Apache Spark Ecosystem.