Apache Airflow

Recently I had the opportunity to play with this kind of scheduler for data pipelines tasks. It is mega simple to setup either on bare-metal or as docker worker or in Kubernetes using Helmchart acc. to this desciption: https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html

This software enables to create data pipelines for extracting data, decorating and saving in different place. All is done by using python scripts, which allow to create DAG(directed acyclic graph) with set of tasks. Tasks are using operators as templates. Operator can create EMR or Databricks cluster and then run on it chosen job. It can also get data from external source like database, Kafka topic or invoke bash or python function. These operators enable quite good flexibility in composition even complex DAGs.

Here is short tutorial on this topic:

