Push and Pull Architecture: Event Triggers Vs. Sensors in Data Pipelines
Application data often needs to refresh in a matter of seconds like when trying to reserve Taylor Swift concert tickets. In contrast, most data pipelines barely run in hourly batches. Some pipelines need to react to data changes in seconds, and for those pipelines schedules can be the antihero.
There are several ways to ensure your workflows get you the data you need when you need it. Often, running workflows on a fixed schedule doesn’t meet business needs and wastes money. In this article, we will take a deeper look into schedules. You’ll learn the architectural differences between pulling tasks from a sensor versus pushing them with a trigger. You’ll see when to consider a transition to event-driven pipelines to ensure you get your concert tickets on time.
Push or pull
In web development, a pull architecture is request-driven: the client sends the request, and the server responds accordingly. In the data engineering world, the client would be a data orchestrator sending requests to determine when to execute a particular task. Some orchestrators use sensors to send these requests to external services and handle their responses.
Sensors use a pull architecture. These sensors continuously occupy compute resources and halt the execution of a pipeline until they receive the request they are waiting for.
A push architecture is event-driven: the server pushes data to clients as updates become available. With our data orchestrator acting as a client, instructions for which task to execute next would be pushed to the client by a data asset when it changes state.
Event-driven pipelines use a push architecture. To implement a push architecture, an orchestrator needs the ability to receive events from other systems and transform them into a syntax the orchestrator can operate on.
Sensors to pull or events to push data?
Say you need to monitor your organization's Snowflake utilization and build two data pipelines to do so, one that tracks total queries made and another that sends you a message every time one of those queries drops an existing table.
We will make a simple example to show the differences between the two possible approaches.
Use sensors (pulling) to not overload systems but stay efficient
A pull-based, request-driven architecture would be the best choice in building the first pipeline example that aggregates total Snowflake queries. Depending on the organization, scheduling this pipeline to run at regular intervals (every hour, for example) means it would run on nights and weekends when no one is executing queries. This would be a waste of orchestration resources.
Even worse, unnecessarily running a data pipeline that increments a count after every query could cause a significant load on an orchestration system during peak hours. This could snowball into a need for unnecessarily complex process synchronization tools to guard against multiple queries triggering updates to a global count at the same time.
In contrast, with polling, an orchestrator could skip runs if no new queries are made, and act as a catalyst for micro-batch processing if many are executed in a short period. (Note: this would involve some storage of state, but most orchestrators handle this with variables or data stores.) A pulling architecture is useful when the work that’s being polled is very dynamic and the operations downstream are not critical enough to necessitate processing every event as it happens.
Pros of pull architecture: Avoids running no-op pipelines, more efficient than scheduling
Cons of push architecture: Could miss critical events between pulls
Use event triggers (pushing) for lower latency alerts and dependent runs
A different approach is necessary for real-time issues, and dropping a table could bring entire systems down. A push-based, event-driven architecture would be better for the the example where wesend an alert when a drop table query runs. In contrast, running a pipeline on a schedule to check for these queries could result in missing critical information between pipeline runs. Polling a warehouse may catch these occurrences faster than running checks on a schedule. However, a pull-based architecture constantly checking for drop table queries would cause an unnecessary strain on Snowflake that could increase warehousing spending significantly to monitor a situation that may never occur.
Pushing this information from a server (in our case, Snowflake) to the client (a data orchestrator) would allow for taking automated actions in response to the event while reducing requests to Snowflake. Note that these latency and efficiency improvements require the orchestrator to undertake the added complexity of handling and translating incoming events.
Pros of push architecture: Capture and process each event as it happens
Cons of pull architecture: Overhead in processing pushed events
A hybrid architecture is also possible, and often found in complex systems.
Implement critical automation and alerting with Prefect
In Prefect, event-driven workflows and alerting is implemented with Automations and Webhooks. Automations enable you to configure actions that Prefect executes automatically based on an event. Webhooks use a unique URL to receive an event from an external system and translate them into a Prefect event that can trigger an Automation. For example, an alert or a workflow can fire when a file lands in an S3 bucket, a database is updated, or a ticket order arrives.
Ultimately, the question becomes: is your data pipeline reactive or proactive? While scheduled pipelines have their place, both push and pull architectures can unlock more efficient development. The choice between a pull or push architecture hinges on the specific needs of your data pipeline. By considering the trade-offs between latency, efficiency, and implementation complexity, you can select the approach that best aligns with your data’s needs. Prefect is optimized for the needs of all sorts of critical data work, empowering you to build resilient workflows in any situation.
Let us show you how event-triggered pipelines can help you optimize your compute cost in Prefect - book a time to chat with us.