Case Studies

How One Education Travel Company Unblocks Data Science Teams With Prefect

September 05, 2024
Taylor Curran
Senior Sales Engineer
Mike Grabbe
Principal Data Engineer, EF Education First
Share

If you or your kids ever traveled with a school group, there’s a good chance it was organized through EF Education Tours, a leader in travel-based learning that creates educational tour packages for teachers and students. They’re a large company with over 10,000 employees, and work with a massive amount of travel data to respond to real-time changes in their industry—whether it’s airline price fluctuations or transit strikes in Europe.

EF faced a familiar challenge: to work with all this data, they had many data scientists and analysts spread across many departments. EF has far fewer data engineers on the data platform team. As a result, a data scientist wanting to integrate a new data source wouldn’t necessarily have access to a data engineer for help.

Mike Grabbe, Principal Data Engineer on EF’s Data Platform team, explained, “The data analysts and data scientists could write a Python script to fetch data and maybe they had access to a server and can upload their script and assign it a CRON schedule.”. But beyond that, they quickly ran into problems implementing data-focused scripts on their own, specifically around task orchestration:

  • 📁Where will the script be deployed to run daily?
  • 🚨How will it alert when it fails?
  • 🔄Will it retry if it fails?
  • 🤒How is its health monitored?
  • 🌊When it completes successfully, how does this trigger the refresh of downstream dependencies (models, reports)?

EF’s journey with task orchestration

EF’s data platform team was responsible for thinking about these kinds of problems, and they had implemented a variety of solutions in the past. Before 2016, EF used SQL Server as part of its data platform and orchestrated everything through SQL Server jobs. This had several downsides: it was step-based, could only handle serial processing, and SQL jobs were a tool only database admins could use.

In 2016, EF made the leap to the cloud and started using AWS Lambdas and AWS Step Functions to orchestrate data integrations. “It was great to leave servers behind and go serverless, but we found there was a lot of boilerplate config we needed to include with each integration,” Grabbe said. “Complex jobs required even more complex nested JSON configs, and those tools really only work well when everything you do is inside the AWS service ecosystem. But data scientists and report builders were still locked out, and still dependent on my team.”

Empowering data scientists to self-serve with Prefect

Two years ago, EF held a task orchestrator selection process, and the data platform team selected Prefect. Prefect proved able to manage every task that the data engineering team was responsible for, and crucially, it allowed all orchestration to be done through Python with minimal overhead—a must-have for EF’s team.

“Prefect has become the brain of my data engineering team, but it's also a tool accessible to and used by our data scientists,” Grabbe said. “They can self-deploy their scripts, run them on a schedule—with retries, a slick user interface, and alerting built in. Whatever they have Python do, Prefect can schedule and run it for them.” The power of a self-service data orchestration platform means shipping models and reports faster, satisfying requests from business users.

Prefect offers both observability and orchestration features:

  • Orchestration: Gets your Python code off your laptop, to run when and where you need it to run with the proper timing, requirements, and visibility. At EF, the team sets up dedicated work pools for orchestration to meet their strict security needs, but Prefect workflows can also run as a simple local process with one a single method.
  • Observability: Ensures that Python code is actually running when you expect it to run. This includes workflow metrics, alerts, and monitoring infrastructure hosted by Prefect so that users can trust their workflows.

This model empowers data scientists to work independently and lets the data platform team focus on what matters most.

Use case: Querying historical data

In EF’s data stack, around 50 Fivetran connectors are constantly running, bringing live raw data from various EF backend systems into Snowflake. The problem: existing data is overwritten when new data arrives, so backtesting models with previous data is nearly impossible.. For training and backtesting models, data scientists at EF need to access pas data, sometimes years old. Unfortunately, Snowflake’s time travel feature only goes back 90 days.

To solve this, EF’s data platform team used Prefect to stitch together different tools in their stack:

Every night at midnight, a Prefect flow runs that identifies all of EF’s Fivetran schemas in Snowflake, builds schema clone commands in SQL, and executes them. All schema clones are created in a separate database.

  • The source Fivetran schemas update throughout the day, but the cloned schemas are frozen in time as of midnight to have a comparable basis for reporting and analysis.

After the schema cloning flow completes, a second Prefect flow runs that orchestrates a dynamically configured dbt snapshot, command.

  • EF runs queries against their Fivetran schemas in Snowflake to extract all the table metadata needed to power the dbt snapshot command: table name, column names, and primary keys.
    • dbt snapshot converts a source table into a Type-2 Slowly Changing Dimension table so that prior record states are never lost.
  • Prefect generates the dbt project configuration and then executes the dbt snapshotcommand.

Tools such as SQL, Fivetran, Snowflake, and dbt all have their purpose in the modern data stack—and Prefect has helped EF unlock their best qualities and easily stitch them together so data scientists could query years of historical data without racking up a huge Snowflake bill. The data scientists building the models are also using Prefect to quickly get their work into production with simple Python scripts.

“Our job is to provide data analysts and data scientists the data they need to create data products that drive business value,” Grabbe said. “And beyond that, we focus on enabling our data scientists by removing roadblocks and giving them powerful tools that make their jobs easier. Prefect is allowing us to achieve these objectives.”

Curious how companies like EF Educational Tours, Cash App, and others build resilient data platforms with Prefect? Book a demo with our team.