Recently I had the opportunity to play with this kind of scheduler for data pipelines tasks. It is mega simple to setup either on bare-metal or as docker worker or in Kubernetes using Helmchart acc. to this desciption: https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html
This software enables to create data pipelines for extracting data, decorating and saving in different place. All is done by using python scripts, which allow to create DAG(directed acyclic graph) with set of tasks. Tasks are using operators as templates. Operator can create EMR or Databricks cluster and then run on it chosen job. It can also get data from external source like database, Kafka topic or invoke bash or python function. These operators enable quite good flexibility in composition even complex DAGs.
Here is short tutorial on this topic:
Awesome blog you have here but I was wondering if you
knew of any user discussion forums that cover the same topics talked about here?
I’d really love to be a part of group where I can get feedback from other experienced people that share the same interest.
If you have any suggestions, please let me know. Appreciate it!