Prefect Logo
Workflow Orchestration

When That Adhoc Script Turns Into a Production Data Pipeline

July 23, 2024
Zack Proser
Share

So you’re a data engineer, analyst, data scientist, or ML engineer. Your daily battles likely involve things like: composing brittle third-party APIs, wrestling with janky ETL services, and descending far into the abyss of Python libraries (or rather, labyrinths).


Your latest journey likely began with a simple request: "Can you pull some numbers for a presentation?" What started as a quick Python script soon evolved into a critical component of your organization's decision-making process. Before you knew it, your "one-time thing" became a regular report, feeding dashboards and informing strategies across multiple departments. Let's explore how we got here and crucially how we can mitigate the mayhem in our data pipelines.

Workflow orchestration separates proofs of concept from production pipelines

We’ve seen this story before. The VP of Finance asked you to pull some numbers for a presentation. "Just a one-time thing," they said.

You whipped up a quick Python script to extract the data from your CRM, crunch some numbers, and generate a simple report.

The presentation was a hit. So much so that the VP asked if you could run it again next week. "No problem," you thought, setting up a cron job to automate the process.

But then things started to escalate:

  • Week 3: Marketing wants access to the same data for their campaign analysis.
  • Week 5: The product team needs similar metrics, but slightly tweaked, for feature prioritization.
  • Week 7: The CEO mentions your "insightful reports" in the all-hands meeting.
  • Week 9: Finance asks if you can include revenue projections based on your data.

Before you know it, your "quick script" struggles to compose interconnected processes and feed dashboards, reports, and critical business decisions across the company, and you start your morning kicking the jobs that failed overnight.

Suddenly, your initial prototype is production. Your “quick script” is a mission-critical service, and failures and data quality issues are more apparent than ever. How can you quickly productionize your data pipeline so you can move on to other tasks?

You could develop a cloud-native solution with infrastructure as code tools or script against AWS or Azure. After all, you tackle complex data problems for a living. But let's be real: do you really want to spend your valuable time wrestling with infrastructure when you could be extracting insights from data?

Why can’t you just use abstractions to write robust pipelines that handle all your tasks using Python? You need a robust, resilient system that can handle the complexity and scale of what your project has become. You need proper orchestration.

Workflow orchestration is the missing superpower separating an initial proof of concept from the reliable production data engineering systems you’re proud of.

Now, I know what you're thinking. "Great, another complex system to learn." But here's the secret: good orchestration doesn't add complexity—it tames it. It's about working smarter, not harder. Here's how data teams should approach orchestration.

Rethinking data workflows: the resilient orchestration mindset

Embrace flexible scheduling

Forget rigid, time-based schedules. Modern data workflows need to be responsive. You may need to start a process when new data arrives, or a certain condition is met. With the right orchestration tool, you can:

  • Trigger workflows based on events (new files, API calls, database changes)
  • Dynamically adjust processing based on data volume or system load
  • Easily modify schedules without rewriting your entire pipeline

Build resilience into your DNA

In data engineering, failure is not just an option—it's a certainty. APIs will go down. Data will be malformed. Servers will crash. But with orchestration, you can build pipelines that bend instead of break:

  • Automatically retry failed tasks with intelligent backoff strategies
  • Implement circuit breakers to prevent cascading failures
  • Create self-healing workflows that can recover from interruptions

Your self-healing workflows will be humming along, handling issues before they become problems. Friends don’t let friends manually restart failed jobs over and over, ok?

Yes, even small tasks need orchestration

I can hear you asking, "I've got this simple script that runs once a day. Why bother with orchestration?"

Here's the thing: in data engineering, there's no such thing as "set it and forget it." Even the smallest tasks can spiral into complexity faster than you can say "cron job." That's where orchestration comes in, saving you from:

Silent failures

With orchestration, you'll know exactly when and why something goes wrong.

Dependency hell

That "simple" script now relies on three different data sources. Orchestration helps you manage these dependencies cleanly without the spooky spaghetti code.

Scaling headaches

As your data grows, so does the processing time required. Orchestration empowers you to scale using well-defined and tested primitives without obtaining AWS certifications.

Debugging nightmares

Instead of digging through log files like it's 1999, orchestration gives you a clear picture of what happened, when, and why.

Remember, every data pipeline you build is a time bomb of complexity waiting to explode, and orchestration is your bomb squad.

How do we adopt orchestration when we start from a series of scripts, API calls, task runners, and glue code?

The best-kept secret of teams who successfully orchestrate their data pipelines? They do not struggle to do so.

They choose a robust and easy-to-use framework that meets them where they are, allowing them to code using languages they already know and turn a simple Python function into a cloud-native, distributed, retriable task using decorators:

1@task
2def my_task():
3    print("Hello, I'm a task")
4
5@flow
6def my_flow():
7    my_task()

The best orchestration tools are obsessively focused on user experience. They only ask you to import a library and sprinkle some decorators into your existing functions to convert them from a headache to a resilient cloud-native pipeline.

The most successful data engineering teams remain focused on their tasks' logic, correctness, and efficiency. They offload the tedium and minutiae of cloud-based automation, retries, alerts, and monitoring to the orchestration framework.

With the mystique of high-performing data engineering teams roundly dispelled, let’s examine the core components they include in their production data pipelines.

The components of a well-orchestrated data pipeline

Intelligent workflow coordination

Good orchestration frameworks expose simple primitives you can wrap your existing Python code in, such as the concepts of a task, a discrete function to run, and a flow, a series of tasks that runs in response to some condition, like a webhook, upstream message, build, or end-user action.

If the solution you’re using today to group related tasks and workflows requires anything more than a few extra lines of code, drop it. Automation doesn’t have to be difficult.

Resilient Systems

Good orchestration systems simplify making your code robust to failure scenarios. If your code is littered with ugly HTTP response code parsing, that’s a good sign you have insufficient retry and error handling primitives.

Of course, you could keep manually parsing every non-standard API response you’re dealing with throughout your pipeline code. (Someone on your team may enjoy swooping in to devsplain the obscure timeouts in Cloudwatch logs).

Or, you could just annotate your existing code with a clear directive controlling how many times it should be retried and how:

1@task(retries=2, retry_delay_seconds=5)
2def get_data_task(
3    url: str = "https://api.brittle-service.com/endpoint"
4) -> dict:
5    response = httpx.get(url)
6
7    # If the response status code is anything but a 2xx, httpx will raise
8    # an exception. This task doesn't handle the exception, so Prefect will
9    # catch the exception and will consider the task run failed.
10    response.raise_for_status()
11    return response.json()

Now we’re cooking.

Scalability

Successful data engineering teams focus on extracting insights and competitive advantages from their proprietary data, not manually provisioning or maintaining EC2 instances.

Good orchestration frameworks can also assist you here—effortlessly scaling your workloads in the cloud according to simple descriptions in your task and flow annotations.

The other key thing that separates winning data teams from those perpetually under pressure from their own pipeline? Attitude

The most successful data teams turn pain points into opportunities for automation. If you have the right tools, being a small team of data scientists or even a solo first hire is not a reason to panic but an opportunity to leverage automation.

Effective automation pays off exponentially. The sooner you implement it, the greater the performance gains you can realize by freeing up your time to do what human scientists do best: creatively tackle large and complex problems.

It’s essential to offload the sticky bits, such as API call timeouts, error logging, instance restarts, and upstream system downtime, to a comprehensive solution that can return dozens of focus hours to your weekly schedule.

When you build resilient systems this way, you earn confidence in your outputs and trust from other teams in your organization.

This is the path out of regular firefighting, burnout, low morale, and poor retention, and toward the stability the best teams use to innovate and push the business forward.

The most effective data teams sleep well at night because their systems are observable

Once your pipeline is humming along in production, earning you accolades, new business, and hockey-stick-shaped revenue growth, you still need to keep tight tabs on exactly how it’s performing to identify bottlenecks and get early warnings about failures or data quality issues.

Effective orchestration frameworks make implementing this kind of observability effortless. Imagine decomposing your complex production pipeline into several Python functions decorated with task and flow and then calling a visualize method to see your entire system rendered as a graph.

The happiest data teams can count on this level of interpretability in their systems to help themselves, their partner teams and their organizations quickly get up to speed with complex production workflows that may comprise thousands of individual nodes.

The next time you’re in an all-hands meeting and someone asks clarifying questions about a system you’re responsible for, imagine linking them to a fully rendered graph of the entire flow along with a tasteful emoji of course.

Prefect makes data pipelines resilient. Let us show you how, or try the product yourself.