Brendan O'Leary

VP, Developer Relations

Engineering

Stop Making Your Data Team Learn Kubernetes

January 16, 2025

Brendan O'Leary

VP, Developer Relations

As a developer (or "DevOps person") who's spent years bridging the gap between engineering teams and platform operators, I've noticed a consistent pattern emerging in how successful organizations structure their operations teams - regardless of what code they're shipping. Whether they're building microservices, training ML models, or transforming data - the most successful organizations have cracked a common pattern in how they support these teams.

🦺 Aside: I once inherited a system using Jenkins to manage Python data pipelines, requiring updates in the code, Jenkins config, and metadata service for any change—highlighting how the wrong tool adds complexity.

It's not about having the fanciest tools or the most cutting-edge tech stack. It's about finding that sweet spot where platform teams empower others without drowning them in infrastructure complexity. And nowhere is this challenge more evident than with data teams, who will often be stuck learning Kubernetes when they should be focusing on data science or analysis.

Before we dive into why data teams often get the short end of the infrastructure stick, let's step back and look at the bigger picture. I've noticed three core components that every successful platform team gets right - regardless of who they're supporting. Understanding these fundamentals will help explain why data teams in particular tend to suffer from common platform anti-patterns, and more importantly, how we can fix them.

Think of success as a platform team like a playground with a fence: kids can play freely and creatively within bounds that keep them safe. I’m going to start by explaining the fence, but get to how to enable the “kids” to really have fun later on.

The Core Components 🔍

Every modern engineering team - whether filled with software engineers, data engineers, ML engineers, or really anyone that touches a piece of software, needs three fundamental pieces:

A place to store and ship code & data
Compute ("infrastructure") to execute the code on
Tools to orchestrate, monitor, and report on operations of the code

At first glance, this three-part framework might seem obvious - almost too simple. But in my experience working with teams across different domains, it's not the components themselves that create complexity - it's how we divide responsibility for them.

The Traditional Team Structure Challenge ⛩

Traditionally, these responsibilities might be divided among different teams:

Platform teams handling infrastructure and compute resources
Data teams focusing on business analytics and insights

But there's a catch. Even with dedicated platform teams, the "last mile" of data team requirements often falls through the cracks. Why? Because most platform teams are optimized for containerized applications or long-running services – not what data teams actually need. And what do data teams need? They need infrastructure that can handle:

Batch jobs that run on schedules but need different amounts of compute each time
Resource-intensive operations that burst CPU and memory usage for short periods
Dynamic workflows where one job kicks off varying numbers of downstream jobs
Efficient retry mechanisms for when external data sources fail
Cost optimization for expensive operations like ML training runs or GPU tasks

❗ And most importantly, they need all this without having to become infrastructure experts

Traditional platform tooling, built around the concept of always-on services with predictable resource usage, just doesn't fit these patterns. It's like trying to use a hammer when you really need a screwdriver - sure, you might eventually get the job done, but it's going to be messy and inefficient.

The Domain Dilemma 😓

Let's be real - every specialized engineering team has their own unique challenges that standard infrastructure just wasn't built for. Take security teams (I see you, AppSec folks 👋) - they need serious audit logging and compliance tracking. Or ML teams wrestling with GPU orchestration. And don't get me started on gaming teams trying to wrangle massive binary assets through their pipelines (git was NOT made for this). Instead of playing freely in their data sandbox, teams end up spending their time building and maintaining the playground infrastructure itself.

Let’s zero in on data teams who are running traditional batch ETL for a minute as an example because this is one place where this challenge is particularly acute. When data teams find themselves without purpose-built orchestration solutions, they typically face three less-than-ideal options:

1. Build a Custom Scheduling System 🏗️

1# What starts as a "simple" scheduling solution...
2# (We've all been here, right?)
3def custom_scheduler():
4    while True: # Famous last words
5        for job in jobs_queue:
6            if is_time_to_run(job):
7                try:
8                    run_job(job)
9                except Exception as e:
10                    # Uh oh, now we need error handling
11                    # And retry logic
12                    # And alerting
13                    # And monitoring
14                    # And...you get the idea
15                    pass
16

What starts as a "quick fix" inevitably grows into a complex internal tool requiring dedicated maintenance. It always starts innocently enough. You set up a few cron jobs to run your data pipelines. But then reality hits: your jobs start failing because they're all hitting the database at midnight. No problem, you think - you'll just build a simple queue to throttle the jobs.

That works for a while, until you realize you need to monitor that queue length to make sure things aren't backing up. Then jobs start failing, so you add some retry logic. Of course, you need alerts when things go wrong, and some way to track resource usage so you can debug the failures. Before you know it, you're also handling infrastructure failures and managing complex job dependencies.

What started as a few cron entries has now ballooned into thousands of lines of homegrown scheduling logic. I've seen teams end up with entire systems that need their own dedicated maintenance team - just to handle the infrastructure that was supposed to make their lives easier. While this might sound extreme, it's a pattern I've watched play out over and over. The worst part? Every team ends up building the same complex system, just slightly differently.

2. Embrace Complexity in Code 🤯

1# Data processing code gets buried under infrastructure concerns
2async def process_data():
3    async with concurrent.futures.ThreadPoolExecutor() as executor:
4        tasks = []
5        for chunk in data_chunks:
6            task = executor.submit(
7                process_chunk,
8                chunk,
9                retries=3,
10                backoff=exponential_backoff,
11                resource_limits=compute_constraints
12            )
13            tasks.append(task)
14
15        # Now handle partial failures, monitoring, logging...
16

Think about what data scientists actually need to do: process large datasets in parallel to train models faster, or analyze data across multiple sources simultaneously. It sounds straightforward until you hit scale. Suddenly, you're processing millions of records that don't fit in memory, or you need to parallelize across multiple machines, or your source data occasionally times out.

That simple data processing logic balloons into complex infrastructure code. Data scientists end up spending weeks wrestling with parallel processing frameworks and retry logic instead of analyzing data. ML engineers who should be tuning models are instead debugging distributed computing issues. Oh, and by the way, it costs real money every time to retry that training on expensive architecture.

The worst part? Every team ends up rebuilding these same patterns from scratch, usually less efficiently than battle-tested solutions.

3. Adapt Ill-Fitting Tools 🔧

Teams have tried to adapt everything from Jenkins to Kubernetes CronJobs to Airflow to handle data workflows. And yes, these tools are powerful - Kubernetes can scale pods, Jenkins can handle complex pipelines, Airflow understands DAGs. But they force data teams to think in infrastructure terms instead of data terms. Let me show you what I mean.

With Kubernetes CronJobs, your data team needs to:

Write YAML to define resource limits without knowing their actual memory needs
Manually implement retry logic for failed jobs
Figure out how to pass data between pods
Debug pod logs instead of seeing their data pipeline state

With Jenkins, they need to:

Split data pipelines into multiple jobs for parallel processing
Maintain separate data logic and pipeline configuration
Handle data sharing between pipeline steps

Even Airflow, built specifically for data pipelines, has challenges:

Teams need to learn Airflow-specific DAG syntax and operators
Dynamic workflows require complex workarounds
Basic Python operations often require custom operators

These tools weren't built with modern data workflows in mind. When your data scientist wants to do something simple like "process this dataset in parallel, retry failures, and only run after validation" - they spend more time configuring infrastructure than writing data logic.

The real cost of these approaches isn't just in the code - it's in the ongoing maintenance burden, the cognitive overhead for data teams, and the opportunity cost of not focusing on actual data problems.

A Better Way Forward 🚀

The solution? It's about rethinking how we approach data platform architecture. Instead of forcing data teams to adapt to traditional infrastructure patterns, we need orchestration tools that natively understand both the infrastructure and data science worlds.

Here's what this looks like in practice:

1# Instead of this (plus a bunch of Kubernetes YAML)
2def data_process():
3    # 100 lines of infrastructure setup
4    # Set up logging and monitoring
5	  logger = setup_cloudwatch_logging()
6	  metrics = setup_prometheus_metrics()
7
8	  # Configure infrastructure settings
9		memory_limit = get_memory_config()
10    cpu_limit = get_cpu_config()
11    timeout = get_timeout_config()
12
13    # Set up retry logic and error handling
14    retries = 0
15    max_retries = 3
16    backoff = ExponentialBackoff(initial=1, multiplier=2)
17    
18    # ........
19    
20    # 10 lines of actual data logic
21    
22    pass

Instead, we can have a work pool already set up to handle and connect all of that, abstracting away the infrastructure and the logging, and allowing us to just choose to run our data flow there:

This is what a well-designed playground looks like - all the safety equipment is already installed and maintained, letting the kids focus on what they do best: play and explore.

The Ideal Separation of Concerns ✨

When implemented correctly, this approach creates a clean separation of responsibilities:

Data teams focus on writing efficient, powerful data processing code
Platform Teams manage orchestration and infrastructure
Both Teams get what they need without compromising

👌 Now data teams can focus on writing efficient, powerful data processing code that they can monitor and scale

Finding The Right Balance ⚖️

Just like a great playground needs both sturdy equipment and room to play, great platform tooling needs both robust infrastructure and space for innovation. When organizations get this balance right, you see the same kind of energy and creativity you'd see in a well-designed playground - teams confidently exploring and building within safe boundaries.

The impact shows up in all the key metrics platform teams care about: significant drops in infrastructure-related support tickets as data scientists or data teams become self-sufficient, faster recovery times when things go wrong because teams can focus on fixing their data logic instead of debugging infrastructure, and most tellingly - data scientists spending less time wrestling with infrastructure and more time doing actual data science. The key is choosing tools that respect this natural separation of concerns while providing the flexibility both teams need to succeed. But there's more to this story than just separation of concerns.

Engineers face a constant tension between rapid iteration and maintaining stability. The right abstractions don't just separate concerns - they create safe spaces for innovation within guardrails. Similarly, data teams should be able to experiment and iterate rapidly without worrying about taking down production systems or needing to understand every detail of the underlying compute layer in detail.

Managing Complexity 🧠

Here's a truth that often gets overlooked: not all complexity can be eliminated, but it can be properly placed. When we talk about "managing complexity," what we're really discussing is:

Moving infrastructure complexity to teams equipped to handle it
Letting domain experts focus on their core challenges
Creating intuitive interfaces between these layers

Think of it like playground design - the complex engineering of safe equipment is handled by experts, while kids get simple, intuitive interfaces (slides, swings, climbing frames) that let them focus on play and exploration.

The Path Forward 🧗

The future of data platforms (and, really, all specialized engineering platforms) isn't about forcing domain experts to become infrastructure experts or platform teams to become domain experts. It's about creating an environment where:

Each team can excel at what they do best
Innovation can happen safely and quickly
Complexity lives where it belongs
Tools bridge gaps naturally without creating new problems

At the end of the day, this isn't about eliminating complexity - we all know that's impossible. It's about building and maintaining a playground where everyone can safely do their best work. Your data scientists should be free to play in their data sandbox, not spending their nights installing safety equipment. Your data scientists should be doing data science, not wrestling with Kubernetes configs at 2 AM. Perhaps Henning Holgersen, data engineering consultant at Webstep, put it best: “convoluted isn’t how I want to spend my time”

🦆 We might be biased, but we believe Prefect is a great set of abstractions to help manage complexity and put teams in control of what they need.

If you want to learn more, you should join us next week - January 22nd and 23rd for Prefect’s Winter Summit

We’ll be talking about the future of workflow orchestration and how a Pythonic orchestrator enables data teams and platform teams to both enjoy their time on the playground.