Vetting Your Pager
As Sentry, we receive a million requests a minute to process and store crashes from all around the world. And it’s our Operations Team’s responsibility to ensure that everything goes right with these requests, but it’s also their responsibility to not burn themselves out in the process of dealing with everything that goes wrong.
We collect fifty thousand custom metrics inside of DataDog, but only alert on less than fifty of them. James Cunningham leads our internal observability initiative, creating and maintaining those alerts.
In this talk, he discusses the full lifecycle of an alert at Sentry, including:
- How we collect such a wide variety of metrics efficiently
- How we justify a metric’s degree of accuracy
- Why a metric’s logical purpose is defined
- How alerts evolve from metrics, articulating their existence
- What happens when an engineer actually gets paged
Includes the most interesting questions from the closing Q&A.