Workflow Orchestration

Orchestration Tools: Choose the Right Tool for the Job

August 14, 2024
Avril Aysha
Share

Orchestration tools help you automate, deploy, and monitor your workflows. With an orchestrator, you and your team can quickly build repeatable processes that are easy to run remotely, observe, and troubleshoot if they fail.

This article is a guide to choosing the right orchestration tool for your project. This is intentionally not a list of Top {lucky-number} Orchestration Tools with Prefect conveniently slipped in there. We like to assume our readers are intelligent humans who can research a list of tools by themselves just fine.

Instead, think of this piece as a roadmap or an atlas. It maps out the landscape of available orchestration tools and helps you understand how they relate to each other. The article is structured with questions that you can use to differentiate orchestration tools.

By answering these 7 questions, you will be better positioned to find the right tool for your use case:

  • Do I want to write Python code?
  • Do I need to investigate what goes wrong with my code?
  • Do I need to scale my deployment?
  • When do I need to trigger my work?
  • How much time am I willing to spend configuring the tool?
  • Do I want an OSS solution?
  • What am I orchestrating exactly?

Defining Orchestration Tools

There are lots of different kinds of orchestration tools out there. It can be helpful to break this down into subcategories to clarify what it is we’re talking about:

  • Workflow orchestration: The orchestration and automation of programmatic workflows often containing business logic, including tasks and processes.
  • Data orchestration: A subset of workflow orchestration specifically built for processing data workflows. It involves orchestrating workflows that perform data extraction, transformation and storage.
  • Container orchestration: The management, deployment, scaling, and networking of containerized applications such as Kubernetes, Docker, etc.
  • Cloud orchestration: The management, deployment, and scaling of cloud services and resources.
  • Microservices orchestration: The orchestration of microservices interactions to deliver a cohesive application or service.

This article will focus on orchestration tools for data workflows.

Orchestration Tools for Data

This article will illustrate the Atlas of Orchestration Tools by looking specifically at orchestration tools for data projects. We will arrange a selection of orchestration tools according to their functionality and use case relevance. This method of analysis and comparison can be applied to other types of orchestration tools as well.

Specifically, we’ll look at the following data orchestration tools, listed alphabetically:

  • Airflow
  • Celery
  • cron
  • Dagster
  • Kestra
  • Prefect
  • Temporal

🗒️ These tools are all open-source. We discuss closed-source enterprise solutions briefly in the “Do I want an OSS solution?” section below.

We selected these orchestration tools because they illustrate a wide range of functionality and use cases. These tools are popular but they are certainly not the only orchestration tools out there. By exploring these specific orchestration tools in detail, you will gain an understanding of the landscape of available tooling that will help you place other tools you might come across.

Orchestration Tools for Complex Workloads

Data workloads usually involve more than pure data operations. In order to run a scheduled data analysis job you often will also have to spin up some remote infrastructure (cloud orchestration), configure software environments and dependencies (container orchestration) and maybe also run some non-data background tasks (microservice orchestration).

Because of these hybrid requirements, some data orchestration tools like Prefect are designed to support other types of orchestration. We will refer to this combination of tasks as complex workloads throughout this article. Supporting complex workloads gives users the flexibility to orchestrate whatever they need.

Choosing an Orchestration Tool

Of course, you’re reading this on the Prefect website. And we wouldn’t be where we are if we didn’t believe we have a great product…for the right use cases. After answering these questions, maybe you’ll decide to give Prefect a shot. In that case, we’d love to talk to you. But hopefully you will also know when one of the other orchestration tools is likely to be a better fit for your use case.

The point is: the 7 guiding questions are the takeaway here, not the specific tools themselves. You know your specific problem best.

Let’s jump in! 🪂

Do I want to write Python code?

Data orchestration tools offer different syntax APIs and SDKs for different kinds of users. One important factor in choosing your data orchestration tool will be the programming language you want to use to define your workflows.

Do you want to use Python to define your data workflows?

Yes, I want to write only Python code.

Airflow, Celery, Dagster and Prefect all allow you to define your data workflows in Python code. If you’re familiar with Python and/or your team is already using it, one of these orchestration tools will probably be the best choice for you. With decorators and native Python scripting, we believe Prefect does stand out here.

No, I prefer to define my workflows in YAML.

Kestra defines its workflows in YAML. This can be a good choice if you’re not a Python user or need to collaborate with team members who do not know Python.

No, I want to write other coding languages like Go, TypeScript, etc.

Temporal supports Python as well as other languages like Java, Go, TypeScript and PHP. Celery also supports language interoperability through webhooks.

No, I know cron and don’t want to learn anything else

Cron is its own syntax for automating jobs. It’s simple which makes it easy to learn but also limited in functionality.

Do I need to see what goes wrong with my code?

Code is never meant to fail – but it always does at some point. When your scheduled code fails, do you want to be able to see where and why it failed? In other words, do you care about logging, debugging and observability?

Yes, I need detailed logging and automated triage to quickly get back on track. When something fails, a backup process should launch automatically.

Prefect is designed for detailed and actionable observability as well as automated backups. When your pipeline fails, your Prefect dashboard gives you clear information on what went wrong and how you can get back on track. You can define automatic retry logic to handle expected cases. You can also keep track of metrics like uptime to make sure you are meeting SLAs.

Prefect’s transactional semantics ensure your failures are isolated. The automation engine automatically runs a backup job when something fails. This way you can avoid expensive downtime and engineering hours spent manually retriggering failed jobs.

Yes, I need detailed logging and actionable information to quickly get back on track. I’m happy to manually trigger my own retries.

Dagster, Temporal and Kestra all give you detailed logging and observability for your workflows. They all offer visibility into workflow states and histories, usually through a web UI. Dagster also provides data-specific alerting which is helpful to ensure data consistency and quality.

Yes, but I only need basic logging. I’m happy to spend hours or days figuring out what went wrong and how to fix it.

Airflow and Celery provide only basic notification functionality by default. This can be helpful to know that something failed, but does not provide you much context on why it failed or how to fix it. It also doesn’t give you detailed downstream dependency analysis. It’s possible to extend Airflow and Celery observability by adding additional plug-ins but this will require additional configuration hours.

No, I don’t need any logging or observability. My code never fails or I am not running critical work and failures are okay.

Cron does not give you any logging or observability by default. Logging is not built-in and must be configured manually, typically by redirecting stdout and stderr to log files. Cron offers only very basic error handling through log inspection; there is no native support for retries or detailed failure diagnostics.

Observability has varying definitions, particularly in the data space. However, observability in the context of workflow orchestration is still early - only reaching reactive states (not yet proactive). The graphic addresses where the space is today, while observability for workflows is getting closer to the level of observability available for data.

Do I need to scale my deployed infrastructure?

To deploy your workflows to production, you will need to consider the type of infrastructure your code will run on. Are you happy for your job to just run on the machine that triggers the orchestration job, i.e. locally? Or do you need to be able to scale to external (usually, cloud) infrastructure? If so, do you need to be able to install packages, set environment variables, or configure hardware settings for that infrastructure? These questions become increasingly important as your workloads scale.

Yes, I need to run my code on external (cloud) infrastructure. My infrastructure should auto-scale and I need to be able to customize infrastructure per job run.

Prefect and Temporal allow you to run workflows on dynamically provisioned, ephemeral hardware that scales with your workloads. In the case of Prefect, this can be either on infrastructure managed by Prefect or on your own cloud. Using Prefect Workers, you can also define dynamic infrastructure configurations (both hardware and software) per job run. This gives you a lot of control over where and how your scalable workloads run.

This is technically also feasible in Airflow but it requires additional layers of plug-ins and is often painful and time-consuming.

Yes, I need to run my code on external (cloud) infrastructure. All my jobs can run on the same infrastructure.

Prefect, Temporal, Dagster, Kestra and Celery all allow you to easily deploy your workflow code to external infrastructure in the cloud.

Again, this is technically also feasible in Airflow but it requires additional layers of plug-ins and is often painful and time-consuming.

No, just run my code wherever my job gets triggered.

All orchestration tools support local execution of your workloads.

Cron does not natively support dynamic scaling or external deployment. You will have to write and manage separate scripts manually to spin up cloud resources. This is generally not advised because of poor oversight and cost management.

When do I need to trigger my work?

The most basic form of orchestration is running tasks on a predefined time schedule. However, this is often not enough; you may need to trigger work based on specific events, such as user actions, document uploads, database updates, etc. Some workflows even require support for real-time streaming orchestration.

I need to be able to trigger work in real-time.

Prefect and Temporal support working with real-time event data. You can configure jobs to listen to streaming sources like Kafka topics and trigger work based on the incoming data with very low-latency.

I need to trigger my work based on events.

Prefect, Temporal, Kestra, Dagster, and Celery all support triggering work based on events, such as clicks, database updates, etc. New tasks can be launched based on these events.

I need to trigger my work according to a fixed time schedule.

All the orchestration tools listed in this article support scheduling tasks according to a fixed time schedule. They wouldn’t be much of an orchestrator if they didn’t.

How much time do I want to spend configuring the tool?

I want a tool that is easy to manage and onboard.

If you’re a software or data engineer already working in Python, then Temporal and Prefect will probably feel most familiar to you. Kestra is yaml-based and Dagster’s asset-based framework introduces some new concepts; so these two can be a bit of a steeper learning curve.

Prefect, Dagster, Kestra and Temporal all provide users with clear and insightful web UIs that are intuitive to use. By prioritizing observability, fault tolerance and an easy debugging experience, these tools also minimize the amount of time spent on managing the tool once it’s in place.

I’m happy to spend a lot of time learning how to use the tool and manage its quirks.

Tools like Airflow, Celery and cron are less polished. Occasionally this can be a feature if you want lower-level access. Most of the time, it’s a pain and a waste of time better spent on actually doing data processing.

I want a tool with a large active community to support me.

Of all the tools listed in this article, Airflow currently has the largest community. This is partly due to it being the oldest solution in the market. While community size is difficult to measure, Dagster and Prefect probably come in tied for second place.

Do I want an open-source solution?

We believe the future is open-source. Open-source gives you flexibility without the risk of vendor lock-in. You also benefit from a rich and helpful community. That’s why all the orchestration tools we discuss are open-source.

There are also many closed-source orchestration tools out there. To name a few:

Closed-source tools are often not flexible and try to lock you into using their own connectors for adjacent tooling etc. Open-source solutions are generally just a layer on top of your existing infrastructure. This gives you the freedom to customize and adapt the tool to more use cases and organizational changes.

So…what am I orchestrating exactly?

To bring all the information together, we’ve put together a few example workflows to highlight the differences between orchestration tools. For each use case, we will describe the orchestration process broadly, highlight the features you will probably need and provide recommendations on which orchestration tool you might want to use and which you probably want to avoid. Spoiler alert: you probably want to avoid cron in all cases, as it is really just a scheduler.

Here are some example use cases to help you begin to identify the most important differentiators between orchestration tools:

I am orchestrating complex data engineering, cleaning or ETL/ELT jobs

Imagine you work for a small e-commerce company that wants to analyze user behavior on their website to optimize their marketing strategies. The company website gets about 1,000 visitors per day.

Your data orchestration process could look something like this:

  • Extract user clickstream data from the website at the end of the day.
  • Use PySpark to filter out bot traffic and duplicate events.
  • Transform the cleaned data to calculate metrics like session duration and conversion rates
  • Load the transformed data into cloud object storage like S3
  • Analyze the data and provide data visualization reports ready by the start of the next day.

You will probably need:

  • Ability to run complex workloads
  • Access data from multiple sources/syncs
  • Observability and monitoring
  • Fault tolerance and automatic retries

You probably don’t need:

  • Scalable infrastructure
  • Ability to spin up custom infrastructure per job
  • Data lineage
  • Real-time streaming support

Recommend: Prefect, Kestra

Also possible: Dagster, Airflow

Probably not: cron, Celery, Temporal

Why?

Prefect and Kestra are both designed toautomate complex workloads that involve data manipulations tasks as well other code operations, such as connecting to external sources. Prefect is Python-based; Kestra uses YAML.

You could also use Dagster or Airflow. It may be more difficult to manage the non-data specific tasks with Dagster. Third-party integrations (like Apache Kafka) are not as easy with Airflow and you have less flexibility to customize executors per task run.

Cron is too simple for this job - it will not support external sources or event-based scheduling. Temporal is not primarily built for data processing. Celery requires lots of setup for monitoring across systems.

I am orchestrating pure data analysis jobs that run inside of my data warehouse.

Suppose you work in an operations team that monitors key operational metrics across multiple departments. Analysts run many simultaneous SQL queries on a central dataset that gets updated multiple times a day.

Your data orchestration process could look something like this:

  • A separate data engineering team pulls and cleans data from various operational systems, such as manufacturing, logistics, and customer service.
  • A central operational dataset is updated in a data warehouse like BigQuery several times a day.
  • Operations managers and analysts run a set of predefined SQL queries every hour to monitor performance, identify issues, and make data-driven decisions.

You will probably need:

  • Data-specific alerting
  • Data lineage
  • Observability and monitoring

You probably don’t need:

  • Scalable infrastructure
  • Ability to spin up custom infrastructure per job
  • Ability to run complex workloads
  • Fault tolerance and automatic retries
  • Real-time streaming support

Recommend: Dagster

Also possible: Airflow, Prefect, Kestra

Probably not: cron, Celery, Temporal

Why?

Dagster is built as a data-centric orchestration tool. Features like data lineage and data-specific alerting give you fine-grained visibility over where data comes from, how it is transformed and where it ends up. Traceability ensures that if an issue arises (e.g., an unexpected value in a key metric), the team can quickly trace back through the transformations and identify the root cause. This is crucial for maintaining data quality and reliability in real-time monitoring.

In-warehouse scheduling solutions like dbt cloud and Snowflake tasks can be a good option for simple orchestration, especially if your data is already in one of these vendor warehouses.

You can definitely also use Airflow, Prefect and Kestra for this use case. They are all fully-featured data orchestration tools that will get the job done. You will probably have less detailed traceability on your data assets.

Cron is too simple for this job - it will not support the data processing features you need. Temporal is also not built with data processing as a priority. Celery will be hard to set up because you’ll need to spend time hardcoding the database connections.

I am orchestrating hybrid jobs that require flexible infrastructure

Let’s say you work for a financial institution that needs to detect fraudulent transactions in real-time. You need to spin up external infrastructure that auto-scales with the amount of data coming in.

Your data orchestration process could look something like this:

  • Stream transaction data from various sources into Google Cloud Pub/Sub.
  • Spin up Kubernetes clusters on Google Kubernetes Engine (GKE)
  • Deploy Docker containers running real-time analytics and machine learning models for fraud detection using TensorFlow.
  • Automatically scale the GKE clusters based on the transaction throughput, ensuring enough processing power to handle peak loads without wasting resources during low activity periods.

You will probably need:

  • Scalable cloud infrastructure
  • Ability to run complex workloads
  • Ability to spin up custom infrastructure per job
  • Observability and monitoring
  • Fault tolerance and automatic retries
  • Real-time streaming support

You probably don’t need:

  • Data lineage

Recommend: Prefect

Also possible: Kestra, Temporal

Probably not: cron, Celery, Airflow, Dagster

Why?

Prefect lets you run any kind of Python work, including cloud resource and container provisioning. Prefect Work Pools allow you to execute tasks on different environments or resource configurations. This flexibility ensures that tasks requiring high computational power (e.g., running machine learning models) are allocated to appropriately configured resources, optimizing both performance and cost. Prefect also integrates with major cloud providers and can orchestrate tasks within your own cloud environments.

Kestra and Temporal also allow you to flexibly scale infrastructure depending on the computational load. They are less intuitive for a Python data engineer when it comes to defining custom environments and infrastructure configurations per job.

Scaling remote infrastructure automatically is far less easy to do with Dagster, Airflow and Celery. It’s near impossible to do reliably with cron.

I am orchestrating microservices with lots of small background tasks

Imagine you’re working at a car-sharing app that requires proof of identification for account verification. When the user uploads their ID (driver’s license, etc.), you want to read the information automatically and verify its authenticity.

Your data orchestration process could look something like this:

  • Document upload: The user's application begins with the upload of their driver's license. You store the upload in an Amazon S3 bucket.
  • Notification trigger: Upon successful upload, the application sends a notification to an AWS Simple Notification Service (SNS) topic. This acts as a trigger for the subsequent steps.
  • Data processing via containers: You spin up a container in Amazon Elastic Container Service (ECS) to process the request. This container contains the code to extract necessary information from the uploaded driver's license.
  • Information analysis: The Docker container sends the extracted data to the underwriting team's internal tool for further analysis and decision-making.
  • Decision communication: Once the underwriting team makes a decision, an application on their end sends a notification back to the database to update the information. It also sends an email to the user updating them on the status of their request.

You will probably need:

  • Ability to run complex workloads
  • Scalable infrastructure
  • Observability and monitoring
  • Fault tolerance and automatic retries

You probably don’t need:

  • Data-specific alerting
  • Data lineage
  • Ability to spin up custom infrastructure per job

Recommend: Prefect, Temporal

Also possible: Celery, cron

Probably not: Airflow, Dagster, Kestra

Why?

Prefect and Temporal are built to support microservice orchestration. The ability to run any Python code together with the ability to launch customizable and scalable external infrastructure give you the features you need to execute this kind of workflow successfully.

Cron and Celery can be a good choice for very basic microservice orchestration. Both of these technologies offers very limited functionality, with no native support for auto-scaling, monitoring and fault tolerance.

Dagster and Kestra do not let you run any Python code to launch, scale, and shut down microservices. It would be possible but painful to do in Airflow.

Orchestration Tools for Your Data Workflows

It’s important to choose the right orchestration tool for your data projects. You know your specific use case needs best and this article has given you a method for comparing different orchestration tools so you can make the right choice.

You can use the following attributes to differentiate orchestration tools:

  • Do you want to work in Python?
  • Do you need detailed and actionable observability functionality?
  • Do you need to scale your deployments and how?
  • Do you need support for event-based or real-time task triggering?
  • Do you want to use a developer-centric product?
  • Do you want to use an open-source product?

Your answers to these questions, together with the exact nature of your workflows, will help you narrow down the options to choose the best tool for your use case. Whether you're managing complex data engineering tasks, orchestrating pure data analysis within a data warehouse, handling hybrid jobs with flexible infrastructure needs, or managing microservices with numerous background tasks, there's going to be an orchestration tool that will meet your needs.

And if you’ve got lots of time on your hands, you can always check out the awesome-pipelines repo on Github for at least 100 more options to choose from.

If you’re interested in seeing how Prefect can help you build resilient data workflows like Cash App and Cox Automotive do, get in touch with us.

—-

Thank you to Simon Späti for helpful comments on an early draft of this article.