Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

Engineering

July 23, 2024

What Happened When We Replaced Airflow for DAG Orchestration?

Lisa Huynh

Engineer at Eppo and former principal/staff engineer at Storyblocks, AddThis, and Shef

This is the final post in a three-part series on our migration from Airflow to our custom DAG orchestration system in order to improve the system's scalability (more tasks!).

In this post, we’ll go over the results of using our custom system (Paddle). For additional context, see the previous articles for:

The results

Performance

With the new system, we have already been able to double our old workload without suffering any bottlenecks.

Our pipelines have also been running noticeably faster due to no longer having the overhead of starting a new Kubernetes pod and initializing the application before starting processing. For instance, the P99 duration for one Airflow task was 6 minutes, whereas in Paddle, the same task was 1 minute).

Visibility

We wanted a UI that could easily be used to troubleshoot our pipelines. Data related to our DAG runs and tasks was already being stored, so we just needed to make it available so other team members, technical or not, could use it.

To quickly spin up a dashboard for Paddle, we decided to use Retool. For those unfamiliar, Retool is an application for building internal software based on their provided building blocks. At Eppo, we had already been using it for other internal tools. With one engineer, we were able to get a basic version of our dashboard up within a day!

The dashboard lists out basic information on the DAG runs and tasks, with the ability to drill down to see logs, errors and warehouse queries linked to a task.

With Airflow, the scheduler system was unaware of application errors, so we had to dig into the logs in the UI to find the stack trace. Now, we can see which SQL exactly failed for a task, no log spelunking necessary.

Being able to switch between environments and version the Retool application made development a breeze. Beyond the initial implementation, Retool has made it easy to quickly implement feedback, even by users not directly working on Paddle. For quality of life, we’ve iteratively added more functionality such as retrying through the UI, dynamically starting DAGs, and more.

Cost Savings

We didn’t originally set out with the goal of reducing cost. However, switching away from the Cloud Composer (Google’s managed Airflow offering) had some real savings.

Almost 50% of our spending in Google was dedicated to Cloud Composer alone. Combined with the Kubernetes cluster we used for task execution, that was 70% of our spending.

Of course, the new system has its own costs to run it. We’re still running a Kubernetes cluster, and we spun up a separate Redis instance for our queueing framework.

After February, when we switched over fully, the costs for our infrastructure netted out around

So we netted out to saving around 57%, mostly driven from moving off of Cloud Composer.

Issues we ran into

Along the way, we did work through some hiccups.

Lifecycle management

When a worker receives a call to shut down (such as with a new deployment) we want it to finish processing its task before shutting down. However, we noticed that sometimes workers would be stuck in a ‘Terminating’ state, long after logs indicated the worker was done with its task.

wtf-node and netstat ended up being useful for tracking down that workers in this state typically had Redshift connections (and deeper digging led to the type of configuration that would lead to this). With that, we did some clean up with our connection management and added safeguards. Since we use NestJS, we were able to do much of it in the lifecycle methods provided there.

Network limits

As we were scaling up, we found ourselves hitting connection errors (Error: connect ETIMEDOUT). After some digging, we found the Cloud NAT had correlating allocation errors and dropped packets. We had too many workers for the amount of IPs we had, so the Cloud NAT was unable to allocate ports which caused our external network requests to fail. To increase capacity, we added more IPs to our Cloud NAT.

Rate limiting ourselves

In the new system we can scale up quickly a lot of tasks. However, it’s not always desirable to do everything all at once - a warehouse can only support so much concurrency. We’ve found that for some customers, their pipelines finish faster if we limit the number that are running for them.

Future work

There’s always more work we can do to improve the system. A few items we’ve thought of for the future include:

Versioning of DAGs/workloads

‍It’s possible for the structure of DAG, or the implementation of its tasks, to be changed while an old version is still processing. This could cause issues if the latest DAG support is not backwards compatible. We haven’t run into an issue with this yet, but it could happen.

Deferrable tasks

‍Currently our tasks will remain active while waiting for warehouse queries to complete. We can reduce our compute by deferring queries we know will take a significant amount of time, only continuing processing once the task is complete which frees up the worker to process other tasks.

Smarter retries

‍Our tasks retry based on the configured number of attempts. We could make this smarter by taking into account variables like the type of error - some errors need manual intervention to be fixed (for example, a SQL syntax error).

Intelligent scheduling

Spikes in workloads in our system can often be correlated to several scheduled DAGs kicking off at once. Instead of adding more nodes and workers on-demand, we could predict the increase in resource demand and scale up preemptively.

Conclusion

Overall, the shift to Paddle has not only met but exceeded our expectations, setting a strong foundation for future growth and updates in our workflow orchestration processes. We began with requirements derived from our experience with Airflow, and developed a system specifically tailored to our unique situation: managing dynamic workloads run through our monolithic NodeJS application.

Thanks for following our journey through this series (here are Part 1 and Part 2 again). We hope our insights can guide others considering similar migrations.