Chris White

CTO

Workflow Orchestration

What I Talk About When I Talk About Orchestration

September 10, 2024

Chris White

CTO

Introduction

In my years working with data platforms and talking to engineers, data scientists, and business leaders, I've noticed a common pattern: the term "orchestration" often elicits a mix of vague nods, furrowed brows, and the occasional explanation that is framed entirely in terms of tooling instead of value. Most often it’s blindly identified with “automation” which is only a few degrees away from a concept so general it loses all meaning.

Because of this, I've seen teams struggle to justify the adoption of a first-class orchestrator, often falling back on the age-old engineer's temptation: "We'll just build it ourselves." It's a siren song I know well, having been lured by it myself many times. The idea seems simple enough – string together a few scripts, add some error handling, and voilà! An orchestrator is born. But here's the rub: those homegrown solutions have a habit of growing into unwieldy systems of their own, transforming the nature of one’s role from getting something done to maintaining a grab bag of glue code.

Orchestration is about bringing order to this complexity. It's about creating systems that don't just execute blindly, but understand their purpose and can inform us, in detail, what they're doing and why. It's about moving away from brittle, ad-hoc solutions that break in new and exciting ways every other Tuesday, towards robust, observable systems that we can rely on, debug effectively, and evolve as our needs change.

In this post, I want to take you on a tour through my own understanding of orchestration. We'll explore what I think it means, why I think it matters (especially when dealing with data), how it's usually implemented, and the challenges that come with it. My hope is that by the end, whether you're a seasoned data engineer or an executive trying to make sense of your tech stack, you'll have a clearer picture of what I talk about when I talk about orchestration.

Defining Orchestration

At its core, I like to define orchestration in data platforms as the automated execution and documentation of predefined, goal-oriented workflows. Think of orchestration as a self-documenting expert system designed to accomplish well-defined objectives (which in my world are often data-centric objectives). It knows the goal, understands the path to achieve it, and – crucially – keeps a detailed log of its journey. Let’s break this down:

Goal-oriented

The fact that workflows are goal-oriented is a crucial distinction from other forms of software; most application software is typically written to expose interfaces that let users of that software achieve certain goals from a supported set of features. Orchestration, in subtle contrast, is always aimed at achieving a particular outcome, such as generating a report, updating a model, or transforming a dataset. Because of this, an orchestrator typically owns the complete life cycle of an operation from triggering event to completion.

Pre-defined steps

In the world of orchestration, the steps to achieve the goal are determined in advance, though they may adapt during execution. These steps and their triggering rules are almost always encapsulated in a central object typically called a “workflow”, or “flow”.

It’s a common misconception that it’s strictly necessary for these steps to be represented as a Directed Acyclic Graph (DAG), prior to execution; while it’s essentially true that all terminating software can be represented as a DAG post-hoc, it is not true that users need to confront this fact in order to define an orchestratable workflow. Expressing things like conditionals, dynamically defined steps and loops are much easier to express in standard programming languages than with non-DAG Domain Specific Languages (DSLs) that force users to contort the concepts into a pre-baked DAG.

The essence of pre-defined steps in orchestration is that they allow us to anticipate potential failure points and implement appropriate error handling and recovery mechanisms. In addition, knowing the steps in advance allows for optimization of resource allocation and ensures that the process can be repeated consistently. None of these characteristics preclude mutability or flexibility in workflow definition.

Automated Execution

An orchestrated process runs without requiring human intervention, though it may involve human tasks or approvals at certain stages. I often think of an orchestrator as a stand-in for a human doing rote tasks every day: creating a report based on fresh data every week is strictly within the realm of a human’s capabilities; in fact, it’s far easier for a human to adapt to failure modes in the moment. But as operations scale and software becomes more powerful, it’s unreasonable to spend so much time on tasks whose execution can be automated.

This is one of the dimensions on which orchestrators vary the most in my experience. Some orchestrators lean almost exclusively into making intelligent automated decisions about the most efficient resources to execute a given step or workflow on; others require workflow authors to adhere to relatively rigidly defined rules about how and where steps and workflows can run. There is no one-size fits all approach here, and there are usually trade-offs in product focus on this spectrum.

Self-documenting

One of orchestration’s most critical distinctions from basic automation is that orchestration systems are self-documenting. This self-documentation means an orchestrator keeps clear records of each step and workflow execution across various inputs and triggers. The system keeps a comprehensive log of what it does, when, and potentially even why, enabling observability and auditability.

It is tempting to call this aspect of orchestration “observability”, but I think it is different: to illustrate, consider how errors might get surfaced in the two types of systems. Because an observability system is ignorant of outcome, it may display or even alert on a count of errors across a program’s history. An orchestrator - in contrast - may know that 90% of those errors were successfully retried and therefore only concern itself with the ones that resulted in goal failure.

This aspect of orchestration emphasizes its role not just in making things happen, but in providing a clear, auditable path from intention to outcome or to the first moment of “true” failure.

Key Components of Orchestration

To support goal-oriented, self-documenting workflows, orchestration systems typically include these essential components: a workflow definition, an execution engine, a state management layer and an observability layer.

Workflow Definition: this component allows engineers to specify the steps and goals of their workflows, success criteria for tasks, error handling, and retry/execution strategies. Many orchestrators rely on DSLs while others support native programming languages such as Python. Orchestrators that avoid heavy DSLs are often the ones that support the most complex workflow topologies, potentially at the expense of more domain-specific feature sets.
Execution Engine: this component is responsible for interpreting the workflow definition and executing it, along with any runtime strategies and failure handling strategies. Execution engines in the orchestration space range from rigid centralized schedulers to loosely coupled event-driven components. One aspect that is crucial for orchestration systems to function properly is that the business logic that a workflow encapsulates should not compete with the orchestrator itself; this is often summarized as “separate orchestration from compute”.
State Management: this component tracks workflow and task status, maintains a history of all executions, and stores metadata about each run. This is a critical piece of the self-documenting aspect of orchestration systems, and is often the one that is the most difficult to build and maintain if building a homegrown solution. The state management layer must be consistent and allow for easy resumption of failed executions.
Observability: this component combines the self-documenting state layer with additional metrics and metadata, enabling fast debugging and insight generation. As will be said many times in this post, orchestration observability is distinct from traditional monitoring - this is why an observability layer remains a key component of an orchestrator and not something that can usually be swapped out with an off-the-shelf observability provider.

The Role of an Orchestrator

When discussing the role of orchestration in modern data and software engineering, three critical aspects stand out to me: resilience, observability, and standardization. These factors not only define the value of orchestration but also underscore why adopting a dedicated orchestrator can sometimes be the right choice for organizations dealing with complex, homegrown automated processes.

Resilience

Allow me to be a little pedantic here: by “resilience” I mean the ability of a system to recover from unknown or unforeseen failure modes, but this does not imply the system never fails. A failure may occur, but it can be quickly identified, fixed, and redesigned to avoid the same failure mode in the future. To emphasize this distinction I sometimes quip that resilient pipelines are the backbone of reliable data platforms.

Orchestration plays a crucial role in building resilient automated systems that can absorb failures and resume operating effectively. This is achieved through many of the mechanisms already described: intelligent retry mechanisms, failure isolation, observability and state management for fast recovery from the point of failure.

Observability

Orchestration observability is distinct from traditional monitoring. One example already mentioned is that an orchestrator can distinguish between a transient error that was successfully retried, and an error that halted the execution of a workflow. While standard tools provide snapshots of system health, orchestration observability offers a continuous, end-to-end view of complex, often long-running operations.

Consider a nightly ETL process that extracts data from multiple sources, transforms it through several stages, and loads it into a data warehouse. Traditional monitoring might show that each component—the extraction process, the transformation service, and the loading process—is functioning. Orchestration observability, however, tracks the entire data flow: from initial extraction, through each transformation step, to final loading, even as the process spans hours. It reveals not just that each part works, but how they cohesively coordinate to achieve the desired outcome of migrating data from one place to another.

Standardization

In my opinion this aspect of orchestration is one of the most under-appreciated benefits of adopting an orchestrator. Orchestrators bring standardization to your data and engineering processes and this standardization has far-reaching effects on your organization's efficiency, collaboration, and ability to scale. Consistent workflow patterns can make it easy to enforce best practices, collaborate across teams, and generate audit reports. Having centralized configuration management allows for improved governance of systems such as infrastructure. And of course, having internal standards and centralized governance simplifies the onboarding of new team members.

The standardization enforced by orchestrators also naturally leads to improved scalability and adaptability. Whether you're increasing data volume, frequency of execution, or expanding to new use cases, the consistent structure and consistent runtime makes it easier to adapt and scale. In addition, as your tech stack evolves, a good orchestrator can act as an abstraction layer that allows you to swap out underlying technologies without completely rewriting your workflows. This adaptability is crucial in today's rapidly changing technology landscape.

Role Ambiguity

This view of orchestration’s role, particularly its emphasis on self-documenting, goal-oriented workflows and robust state management, points towards a similarity with durable execution frameworks. In my opinion the primary distinction lies in the goal-orientation of each paradigm: durable execution is about managing the state and continuity of an application with no pre-defined outcome. They often involve “orchestration” features for end-users who are not privileged to see the observability layer of the orchestrator, and whose goal was not pre-defined, only the steps to achieve it were. While this may feel a bit abstract and nuanced, in practice it’s not so subtle - to force an analogy, it’s like comparing microwaves with stove tops. Sure, they both heat food up but when I’m in the market for one I’m definitively not in the market for the other.

Challenges and Limitations of Orchestration

While orchestration offers numerous benefits for managing complex data workflows, it's not without its challenges and limitations. Understanding these is crucial for making informed decisions about when and how to implement orchestration in your data platform.

Learning Curve: Orchestration systems can have steep learning curves, especially for teams new to the concept. For this reason I usually recommend that teams new to orchestration start with a greenfield project to learn the ropes before undertaking a migration.
Over-engineering: There's always a risk of applying orchestration to scenarios where simpler solutions would suffice. Sometimes the answer to a potential orchestration challenge really is to use cron. I think this risk is inversely proportional to the number of automated processes, so once a team is managing more than a couple of workflows it’s a good idea to at least consider an upgrade.
Performance Overhead: Orchestration systems can introduce performance overhead, especially for simpler tasks. This typically comes as a consequence of the state management layer and the supported execution engines / runtime environments.
Vendor Lock-in: If I’ve successfully made the case for orchestration as a critical platform component, then I’ve equally made the case that it risks vendor lock-in. For this reason you should probably avoid orchestrators that don’t support a feature rich open source version.
Multi-Cloud and Hybrid Environments: As organizations increasingly adopt multi-cloud and hybrid cloud strategies, orchestrating workflows across diverse environments presents unique challenges. Most cloud providers offer some form of orchestration, but if you are multi-cloud you should probably consider a cloud-agnostic orchestration solution. Support for such scenarios greatly ranges across orchestration tools - some will require multiple instances of the orchestrator in each cloud environment, while others have lighter weight solutions.

Let’s wrap it up

When I first encountered orchestration, it was just another vague tech term. I had to fumble through systems, fail a few times, and gradually piece together what it meant in practice. It's okay if it still feels a bit abstract to you – it did for me for a long time.

But when I talk about orchestration now, I'm talking about something specific. It's not just about making things run; it's about making them run with purpose, transparency and recoverability. And I don't pretend that orchestration is a silver bullet. It comes with its own challenges and isn't the right solution for every problem. Sometimes a simple script is all you need.

In the end, orchestration is an approach, a way of managing with the complexities of workflows. When applied thoughtfully, it can bring a bit more clarity to the often murky world of automated processes. And in my book, a little more clarity is always welcome.

Want to learn more about how Prefect takes these ideas and turns them into practice? Come join us on GitHub or Slack to find out!