For the fourth edition of Customer Stories, we spoke to Valentino Volonghi, the CTO of AdRoll and a member of its founding team. AdRoll is the most widely-used prospecting and retargeting platform in the world. They have 100,000 customers across 35 countries and process right around 70 billion requests a day.
(Editor’s note: 70 billion requests a day is a very large number, but seeing it written out might not capture just how big it is. Let’s consider an example: if you had 70 billion pennies, you’d be buried in pennies. Then if someone gave you 70 billion more the next day, you’d be even more buried. And that would continue on in to the future forever while you simply shook your head and marveled at how many pennies there were.)
One domino knocks down the next, which knocks down the next, which knocks down the next. And before you know it, you’ve got a mess on your hands.
With any large interconnected system you have to be on the lookout for the domino effect. One mistake can set off a chain of events, spreading through an entire system and causing catastrophic failure.
As CTO of AdRoll, I oversee all of the technology that goes into running our business on thousands of globally distributed machines. It’s my job to make sure that, when the unexpected happens (and it will), it is isolated and managed before it sets off the domino effect. That makes real-time monitoring of our systems a critical aspect of our business.
Here at AdRoll, we help thousands of companies drive people to their websites through retargeting and prospecting ads. They use AdRoll to target people who visited their sites but didn’t convert and show them an ad to get them to come back. Or they can get new visitors to their sites by targeting people who are similar to their existing customers.
But, to do that, every day AdRoll needs to handle over 70 billion requests from all over the internet and all across the globe. It’s an almost unfathomable number of requests that need to be processed. And, since each one of those requests needs to be handled in 100ms or less, our infrastructure needs to be globally distributed.
Today, we have AdRoll deployed on as many as 3,000 different machines across the globe. At the scale of AdRoll, if an issue spreads through our system and creates just a 1% error rate, that amounts to 700 million errors a day. An error rate like that could easily costs us well over a million dollars a day.
With that many requests happening on that many machines, all around the world, every machine needs to be monitored for the unexpected, so when that first domino falls, we know and can react.
Each machine has a complicated network of decisions, buying and delivering ads, and we log that entire flow. That translates into about seven trillion events every day. These events are the core of our monitoring. It tells us how AdRoll is operating.
Each metrics flows into a collection point where it can be displayed on dashboards and monitored so that we can be alerted any time something unexpected happens. Whether it is an exception that we want to be alerted to, or a specific machine deviating from the norm in some way and behaving differently from the rest. We want to know.
Over the years, how we monitor and react to errors in our infrastructure has developed and matured. Today, we work with a few really key partners that help us get a real-time look at how AdRoll is operating. They allow us to process over a million data points per second that all of our infrastructure is generating and to alert us when there are anomalies in the infrastructure.
Sentry is particularly helpful because it records errors while the machines are running. That means even when a machine crashes, we can still turn to Sentry to piece together the chain of events that led to an issue in our code. The behavior is logged. This also means that we don’t have to hesitate in killing a machine if it’s acting up and spitting out bad code. We know that every error and its full context is saved inside Sentry.
A pillar of our monitoring and incident response philosophy is that instances are not to be coddled. A machine that is generating a problem related to the app can just be killed and rebooted, and nobody is going to cry over it. That philosophy and the ability to act on it is crucial in stopping the bleeding as soon as an issue presents itself, as soon as that first domino falls. This allows us to isolate an issue before it has the chance to set off a domino effect across the system.
Knowing that an issue is controlled gives our engineering team breathing room to approach any issue a little bit more calmly. This is the basis for our Blue-Green deployment strategy.
All of this comes together to create not just the infrastructure at AdRoll, but the monitoring and response philosophy.
We need to be able to deliver billions of requests across the globe, but there isn’t room for error. And we need to stay on top of every machine and on top of every change to the growing system.
We know that first domino is going to fall; it’s bound to at our scale. But when it does, we’re going have the monitoring in place to know about it and the process to isolate it and manage it so it doesn’t knock over the next domino.
Because, while dominos falling across a table might be fun to watch, dominos falling all across your various applications running on various infrastructure across the globe is no fun to clean up after.
If a system is not allowed to change, it is very unlikely that it will develop unexpected behavior. That’s an important tenet of our application. It is immutable. We don’t modify what’s already live in production.
The best way to do this is to introduce changes in an immutable infrastructure is with what we call blue-green deploys. This means that when we roll out the new version of our software, we do it on a separate set of machines.
The old version is blue. It continues running on the existing production machines. The new version is green. It gets rolled out on a number of new machines and can be scaled up to have more and more traffic as we begin to trust it more over time.
Effectively, the blue-green deployment infrastructure allows us to easily automate the rollout of a new version of AdRoll. And if something doesn’t work out, we you can just roll it entirely back without causing any particular pain into the system.
By having two systems running at the same time, it is very easy for us to flip between them to see how their behavior and their metrics are moving, to see if there is any change in their exception patterns, and if there is anything that is unexpected during the release process. And if anything unexpected happens, in just the matter of a few seconds we can switch back to the old blue version and completely remove the new green version.
Blue-green deployment are obviously too in-depth a concept to break down at the end of a short interview. There are a number of resources to learn more about implementing blue-green deployments, including this excellent break down presented at AWS re:Invent.