For the fifth edition of Customer Stories we spoke to Stephen Boak, a Senior Product Designer at Datadog. With turn-key integrations, Datadog seamlessly aggregates metrics and events across the full devops stack, giving you full visibility into your application. Sentry also just so happens to integrate with Datadog.
(Editor’s note: You like data? How about cute icons of dogs? Well, friends, do we have the company for you. One of the great joys of highlighting customers is that there’s always a chance you’ve recently written a separate blog post about their product. In Datadog’s case, we’ve done exactly that. If you’d like a detailed look at our integration, read this post from October.)
Over seven years and four companies, I’ve seen a lot of change in the monitoring industry. What we’re looking and and how we’re looking at it is completely different from where it was when I started. The infrastructure we monitor moved from physical data centers to cloud-computing. Physical hosts became virtual machines which became containers.
Over that same time, not a lot has changed with the way users actually interact with monitoring products. But with the introduction of the AI all of that is going to change.
When it comes down to it, monitoring is ultimately about user experience. When we’re watching a backend system and when we’re looking at performance metrics to understand how the system is behaving, we’re really just trying to understand the interaction between our system and our users.
At Datadog, we make sure we never lose sight of that. Our objective isn’t to track CPU usage across a thousand different servers. The objective is to let our customers know whether their users are happy or not. Are their pages loading quickly? Are they able to do all of the things they want to do? And monitoring CPU usage is one way to try and answer those questions.
Our customers use Datadog to figure out how they can make their product work better for their users. And, as a Senior Product Designer, I’m working with engineering teams across the company, but I’m representing how the user is interacting with our product and what they’re getting from the experience.
And with tools like Sentry we’re able to monitor the user experience of our products. When we see latency metrics and page speed and when we know where and how errors are happening, we have a much greater understanding of the experience users are having and how we can improve it.
The basic rules of how we construct our monitoring has always been the same. A user tells the monitoring product to look at something specific, like the amount of disk space left on a machine. Then the user instructs it to take some step when it hits a certain threshold, like to send an alert when the disk is 90% full.
We create this extensive set of instructions for monitoring our systems so we can understand how it is working based on its external output. And that has just been the experience of monitoring. But our systems are getting more and more complex, with microsystems and integrations and dependencies. As we introduce more parts into the system, we accumulate more and more things to monitor. And these increasingly complex systems and the “if it moves, monitor it” mentality has made the job of understanding how a system is working much more difficult. It feels like this inevitable march towards watching more and more metrics, with more and more thresholds.
All of this has meant a lot of human effort going into watching. It has meant more people watching increasingly complex dashboards trying to diagnosis the system in real-time and it’s meant more and more people getting woken up in the middle of the night by alerts. And it’s gotten very expensive.
But with AI, we can turn the tables.
All of these monitoring rules we’ve put in place on our systems have people thinking like machines. If this metric does X, do Y. If the disk space hits 90%, send an alert. And based off of that output humans make a diagnosis, and only then can we do what we do best and come up a solution for the problem. But AI can flip the model and get machines thinking more like humans. If the machine can diagnosis the system instead of just watching and sending alerts at some threshold, then we can fundamentally change how people use monitoring products.
Think about the simple disk space example again. Today Datadog uses something called forecast monitoring. That means not just looking to see where how a metric is at, but understanding how it is changing. In an orderly but volatile system, total disk space could regularly pass a threshold without being in any danger of running out. At the same time, disk space could be under a certain threshold but growing in an irregular way and be very quickly on the verge of running out. Intelligent monitoring could understand usage and know when there’s a problem and when everything’s fine.
In this way, AI isn’t just sending an alert, it’s sending a diagnosis. And with that, the people maintaining the system can go right to work finding a solution for the problem.
The old user experience of monitoring was always been about watching and diagnosis. But that’s changing with AI.
As our monitoring solutions are able to make more and more of these diagnoses all of these manual processes we’ve built up more and more over time can all start to fade away. The experience of using monitoring products is fundamentally changing. Machines can do what machines do best, and humans can do what humans do best.
And that frees them to focus on what matters most, which is getting to the root cause of the problem is, and working on the fix, to give their users a better experience.
When it comes to designing for user experience, there’s no better audience than your users themselves. I’ll give you an example.
We were working on a new dashboard widget for monitor status. What it would do is provide customers the ability to see across lots of different monitors: how many are up, how many are down, and what the overall status is across the board.
We came up with three different new visual designs for it. One that was little pills that were lined up, one looked more like a calendar, and the last one was these horizontal bars with red and green stripes indicating when there was uptime and downtime.
For us, the two that seemed the most interesting were the calendar and the bars. We felt like the calendar gave you a real sense of orientation. You could see if outages were occurring on a certain day of the week, maybe Fridays are a problem, maybe there’s a particular time of the day that’s problem; the calendar did a really good job of that.
Meanwhile the bars were… a little bit weird, you know. It’s just this horizontal bar that tells you where exactly downtime occurs. It’s not as clear or detailed as the calendar. But the nicest thing about it is the precision. If you have ten minutes of downtime, it’s this really tiny stripe. And with the calendar, if you have ten minutes of downtime on Friday, Friday is red. The entire day looks red.
Regardless, we thought the orientation and simplicity of the calendar would win over customers. But we started putting it in front of them and they immediately reacted badly. And the reason was that even if there is virtually no downtime, just a few minutes, the entire day looks red and bad. Customers didn’t like this because it was something they were going to put in front of their managers. And when their managers look at this dashboard and they see, oh Friday is bad, then that makes the team look bad and automatically makes the outage look worse. There was a real perception problem.
And so they preferred the precision of the bars that would show a more accurate view of just how long the downtime was, even at the loss of simplicity and detail. They wanted just the minimum amount of downtime represented and the highest precision possible.
What does this how-to tell you to do? Always, always, always test your assumptions. Create multiple designs of the same thing and see how customers react to it. Those reactions will help guide you to better think through how you build your product.