Engineering
AB Testing 101 for Engineers
What I wish I knew about AB testing when I started my career
Learn more
This is the final post in a three-part series on our migration from Airflow to our custom DAG orchestration system in order to improve the system's scalability (more tasks!).
In this post, we’ll go over the results of using our custom system (Paddle). For additional context, see the previous articles for:
With the new system, we have already been able to double our old workload without suffering any bottlenecks.
Our pipelines have also been running noticeably faster due to no longer having the overhead of starting a new Kubernetes pod and initializing the application before starting processing. For instance, the P99 duration for one Airflow task was 6 minutes, whereas in Paddle, the same task was 1 minute).
We wanted a UI that could easily be used to troubleshoot our pipelines. Data related to our DAG runs and tasks was already being stored, so we just needed to make it available so other team members, technical or not, could use it.
To quickly spin up a dashboard for Paddle, we decided to use Retool. For those unfamiliar, Retool is an application for building internal software based on their provided building blocks. At Eppo, we had already been using it for other internal tools. With one engineer, we were able to get a basic version of our dashboard up within a day!
The dashboard lists out basic information on the DAG runs and tasks, with the ability to drill down to see logs, errors and warehouse queries linked to a task.
With Airflow, the scheduler system was unaware of application errors, so we had to dig into the logs in the UI to find the stack trace. Now, we can see which SQL exactly failed for a task, no log spelunking necessary.
Being able to switch between environments and version the Retool application made development a breeze. Beyond the initial implementation, Retool has made it easy to quickly implement feedback, even by users not directly working on Paddle. For quality of life, we’ve iteratively added more functionality such as retrying through the UI, dynamically starting DAGs, and more.
We didn’t originally set out with the goal of reducing cost. However, switching away from the Cloud Composer (Google’s managed Airflow offering) had some real savings.
Almost 50% of our spending in Google was dedicated to Cloud Composer alone. Combined with the Kubernetes cluster we used for task execution, that was 70% of our spending.
Of course, the new system has its own costs to run it. We’re still running a Kubernetes cluster, and we spun up a separate Redis instance for our queueing framework.
After February, when we switched over fully, the costs for our infrastructure netted out around
So we netted out to saving around 57%, mostly driven from moving off of Cloud Composer.
Along the way, we did work through some hiccups.
When a worker receives a call to shut down (such as with a new deployment) we want it to finish processing its task before shutting down. However, we noticed that sometimes workers would be stuck in a ‘Terminating’ state, long after logs indicated the worker was done with its task.
wtf-node and netstat
ended up being useful for tracking down that workers in this state typically had Redshift connections (and deeper digging led to the type of configuration that would lead to this). With that, we did some clean up with our connection management and added safeguards. Since we use NestJS, we were able to do much of it in the lifecycle methods provided there.
As we were scaling up, we found ourselves hitting connection errors (Error: connect ETIMEDOUT
). After some digging, we found the Cloud NAT had correlating allocation errors and dropped packets. We had too many workers for the amount of IPs we had, so the Cloud NAT was unable to allocate ports which caused our external network requests to fail. To increase capacity, we added more IPs to our Cloud NAT.
In the new system we can scale up quickly a lot of tasks. However, it’s not always desirable to do everything all at once - a warehouse can only support so much concurrency. We’ve found that for some customers, their pipelines finish faster if we limit the number that are running for them.
There’s always more work we can do to improve the system. A few items we’ve thought of for the future include:
Versioning of DAGs/workloads
It’s possible for the structure of DAG, or the implementation of its tasks, to be changed while an old version is still processing. This could cause issues if the latest DAG support is not backwards compatible. We haven’t run into an issue with this yet, but it could happen.
Deferrable tasks
Currently our tasks will remain active while waiting for warehouse queries to complete. We can reduce our compute by deferring queries we know will take a significant amount of time, only continuing processing once the task is complete which frees up the worker to process other tasks.
Smarter retries
Our tasks retry based on the configured number of attempts. We could make this smarter by taking into account variables like the type of error - some errors need manual intervention to be fixed (for example, a SQL syntax error).
Intelligent scheduling
Spikes in workloads in our system can often be correlated to several scheduled DAGs kicking off at once. Instead of adding more nodes and workers on-demand, we could predict the increase in resource demand and scale up preemptively.
Overall, the shift to Paddle has not only met but exceeded our expectations, setting a strong foundation for future growth and updates in our workflow orchestration processes. We began with requirements derived from our experience with Airflow, and developed a system specifically tailored to our unique situation: managing dynamic workloads run through our monolithic NodeJS application.
Thanks for following our journey through this series (here are Part 1 and Part 2 again). We hope our insights can guide others considering similar migrations.