Simplifying Data Pipelines with Apache Airflow | Lucideus

Data makes the foundation of all the operations of any organization. The data pipelines can be really cumbersome and complicated. An automated pipeline framework that can orchestrate the sequence of tasks and move the data can really help to simplify the operations in an organization. This post would elaborate how Apache Airflow helps to automate the data pipeline process programmatically using directed acyclic graphs (DAGs).

Apache Airflow was developed by Maxime Beauchemin, a data Engineer at Airbnb. The project is open-source since its first commit and there has been active development on the same. You can track the progress on its Github repository: https://github.com/apache/incubator-airflow.

Airflow comes with a template that helps to create a sequence of tasks. These tasks cover a wide of range of operations such as running a shell script, querying on a database, sending an Email or transferring data.

Terminology
DAG: Directed Acyclic Graphs (DAGs) make up the sequence of tasks. They define several parameters such as the order of the tasks and the frequency at which they should run.

Operator: The tasks in a DAG are instantiated using an operator class. The operators can trigger a shell command, connect to an SQL database, run a Python function or send an Email etc.

DAG run: When a DAG is executed, it is called a DAG run.

Task Instance: When an Operator is executed during a DAG run, Airflow calls it a task instance.

Creating Airflow DAGs
All the DAGs file will be stored in dags folder in the Airflow directory. To create a DAG definition, create a .py (Python) file in the dags folder.

We will start by coding an elementary dag that will execute a Shell Command to output date and time in the .txt file. 

We will import modules in accordance with the tasks we will create. Here we have imported a module named DAG which we will use to create a DAG instant and BashOperator to instantiate a task that will run a Shell Command.

A DAG object is created using several arguments. Suppose, you want to create several DAGs with common arguments, you can create a Python Dictionary with arguments that can be used while instantiating a DAG object. You can schedule a DAG run using cron expressions. Here we have set the DAG to run every minute (“* * * * *”). 

The tasks inside a DAG can be created using Operators. Here we have used BashOperator to run a Shell command that will output time and date to .txt file.

Web UI
The Web UI of Airflow contains several components to make the workflow look easier. When you start web server of Airflow, you will be greeted with the following screen. The screen mentions the owner and the schedule of the DAGs along with the status of tasks and DAG runs.


On clicking the name of the DAG, Airflow will show you the hierarchy of tasks and DAGs using Tree View and Graph View.

The UI also hosts key features to manually trigger a DAG or pause the DAG using toggle on/off switch along with other shortcuts to take you to the details of the DAGs.

Conclusion
Apache Airflow comes with a wide range of functionalities to smoothen the workflow. Active development makes it a project to look forward to meeting the data pipelining needs of your organization.

References
Airflow Documentation: https://airflow.apache.org/