
In the ever-evolving world of data engineering, one truth stands out: change is constant and failure is inevitable. But what if we could build systems that adapt to unexpected failures gracefully, and embrace these situations as a pathway to resilience? Let's explore how principles from resilience engineering can improve how we build data pipelines.
Before diving into resilience patterns, we need to challenge a common misconception about software design. Many believe that design is about accomplishing tasks or meeting requirements. However, consider this thought experiment: if you had to build an application with completely known, unchanging requirements that would never need modification, would you need design patterns at all? You could simply brute force the solution without concern for structure, architecture or readability.
This reveals a crucial insight: software design isn't about the present—it's about the future. As Sandy Metz eloquently puts it, "practical design does not anticipate what will happen. It merely accepts that something will, and that in the present you cannot know what."
In data engineering, change isn't just common—it's relentless. Consider the typical scenarios:
As Heraclitus might say if he were a data engineer: "No person ever steps in the same data stream twice, for it's not the same stream, and they're not the same person."
Resilience engineering emerged from studying major industrial disasters like Three Mile Island and the Challenger explosion. Two fundamental principles from Charles Perrow's Normal Accident Theory (1984) are particularly relevant:
This field continued to evolve through the study of high-reliability organizations in the 1990s and entered software engineering in the 2000s, pioneered by Netflix's chaos engineering and Google's site reliability engineering practices.
A system's reliability isn't just the sum of its parts. Two reliable components can create an unreliable system, while unreliable components can sometimes combine to create reliability. Think of a memory-leaking application that becomes reliable through scheduled Kubernetes restarts.
When building data pipelines, we can't predict every failure mode. However, we can implement four key strategies to handle unexpected issues:
Systems should be designed to bend rather than break when reaching their limits. This means balancing two principles:
The "irony of automation" is the principle that adding automation increases system complexity and potential failure modes. To counter this:
As Grace Hopper famously warned, "The most dangerous phrase in the English language is 'we've always done it this way.'" Building resilient data pipelines isn't about preventing all failures—it's about creating systems that can adapt, recover, and learn from failures.
The key is to embrace change as a constant companion rather than an unwelcome guest. By applying principles from resilience engineering, we can build data pipelines that don't just survive change—they thrive on it.
View the recording of the talk below.








