A Beginner’s Guide to Airflow

Airflow is used to create code pipeline where we can schedule and monitor our workflows. A workflow can be a collection of tasks to be executed like a flowchart.

It is like an orchestra conductor that controls all different data processing tools/tasks under one roof.

Why Airflow?

  • Open source.
  • Can handle upstream/downstream in an elegant way.
  • Jobs can pass parameters to other jobs downstream.
  • Ease of deployment of workflow.
  • Easy to reprocess historical jobs by date or rerun for specific intervals.
  • Integration with a lot of infrastructure.
  • Job testing through airflow itself.
  • Accessibility of log files and other meta-data through the web GUI.
  • Data sensors to trigger Directed Acyclic Graph (DAG) when data arrives.

Airflow Architecture:

Metadata: It stores the state of tasks and workflow.
Scheduler: It is a user DAG, with the state of tasks in metadata database. It decides what needs to execute.
Executor: It is a message queuing process. Decides which worker will execute each task.

Setting up airflow:

Use the following commands to install airflow

$ sudo pip install apache-airflow

You can install extra features like

$ sudo pip install apache-airflow[postgres,s3]

Airflow requires database to be initiated before you run tasks

$ airflow initdb

After installation the folder structure will be something like this.

If you don’t see ‘dags’ folder then you can create by yourself and name it ‘dags’. You shall dump all your python task files in that folder.

To start the web server

$ Airflow webserver -p 8080

Goto https://localhost:8080 to see airflow GUI. You should see something like this:

Airflow relies on 4 cores:

  1. DAG.
  2. Tasks.
  3. Scheduler.
  4. X-com.

Directed Acyclic Graph (DAG):

It is a collection of all tasks which you want to run in an organized way that shows their relationships and dependencies.

Basically a DAG is just a python file, which is used to organise tasks and set their execution content. DAGs do not perform any actual computation. Basically a DAG task contains 2 type of things:

  1. Operators
    It determines what actually gets done. It triggers a certain action like running a bash command, executing a python function, etc.
    There are different kinds of operators like:
    Bash operator: Executes a bash command.
    Python operator: Calls an arbitrary python function.
    Hive operator: Executes hql code or hive script in specific hive database.
    Bigquery operator: Executes Google BigQuery SQL query in a specific BigQuery database.
  2. Sensors : A certain type of operator that will keep running until a certain criteria is met.

Tasks :
Tasks are the elements that actually “do work” which we want to be performed. It is our job to write the configuration and organise tasks in specific order to create data pipeline.

Once the operator is instantiated, it is referred to as a “task”. An operator describes single task in a workflow

Scheduler :
It monitors all tasks and all dags. And trigger the task instance whose dependencies have met.
We create new python file in dags folder
Importing packages in our python file:

import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator

Next setting default arguments

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': dt.datetime(2019, 6, 17, 8, 40, 00),
    'retries': 1,
}

Depends_on_past: we will be in a situation where  a task t1 is dependent on its previous execution. In that case we will use this.

Start_date: starting date of execution
Defining our tasks, dags, functions

def greet():
    print("Hello how are you ?")
    return 'greeted'

def response():
    return "im fine what about you?"

with DAG('YourDagName', default_args=default_args, schedule_interval='@daily',) as dag:
    opr_hello = BashOperator(task_id='say_Hi',bash_command='echo "Hi!!"')
    opr_greet = PythonOperator(task_id='greet', python_callable=greet)
    opr_sleep = BashOperator(task_id='sleep_me', bash_command='sleep 2')
    opr_respond = PythonOperator(task_id='respond', python_callable=response)

In this, I am using 2 simple functions and with dag passing my default arguments and also schedule time of execution.

After that we created 4 tasks in which 2 are python and other 2 are bash. If we want to execute a python function we need to use python operator and also the same with the bash operator. Python operator will take at least 2 arguments which is task id and python callable and for bash, task id and bash command.

Defining task dependencies or order in which the tasks should be executed in.

Task dependencies are set using

  • The set_upstream and set_downstream operator.
  • The bitshift operator <<  and >>.

t1.set_downstream(t2)
t1>>t2

The above 2 lines mean the same. This means t2 will depend on t1 to run successfully.

opr_hello >> opr_greet >> opr_sleep >> opr_respond

If you wrap up the whole code in one file and put that file in dags folder then you can see your dag in Airflow GUI. If you click on your DAG you will see tree view and many other options to view your data in different forms.

Dagrun are dags that runs at certain time.
TaskInstances are task belongs to dagrun.
Each dagrun and task instance is associated with an entry in airflow’s metadata database that logs their state(eg: queued, running, failed, skipped, etc).
X-com: XCom which means cross communication allow airflow tasks to send and receive data/status. XCom data are stored in the airflow database in the form of key-value pairs.

That’s all guys. Thanks for the read!

This story is co-authored by Santosh Kumar and PV Subbareddy. Santosh is a Software Engineer specializing on Cloud Services and DevOps. Subbareddy is a Big Data Engineer specializing on AWS Big Data Services and Apache Spark Ecosystem.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.