Key Results
- 10-15% increase in developer productivity
- Features like Discover, custom tags, and dashboards were essential
- 500+ engineers rely on Sentry to ship code without drowning in errors
- 20-30% faster incident resolution → Less downtime, fewer interruptions
Solutions
Error Monitoring, Tracing, Alerting, Slack, Custom Tags
600+ Engineers, 1 Tool: Anthropic's Sentry Story
Few teams in the world are working on challenges more technically complex than the ones Anthropic tackles every day.
Working at the forefront of AI and AI safety research, Anthropic’s everyday tasks involve unimaginably huge data sets, massive distributed jobs, and highly specialized hardware. Their engineers are tackling everything from GPU memory constraints to compute optimizations at the lowest level, all while continuously shipping the kind of code that was until relatively recently inconceivable.
As the complexity of a system grows, so do the complexities of fixing stuff when it breaks.
When Anthropic’s existing infrastructure monitoring tools weren’t keeping up, they turned to Sentry to help them find and fix issues faster. Getting bugs and crashes out of the way faster means more time to focus on the research that matters.
“Sentry played a significant role in helping us develop Sonnet,” says Anthropic Systems Lead Nova DasSarma, referring to one of the company’s most advanced AI models.
The challenge: Overwhelming log volume
In AI research, time isn’t just money — it’s progress. When you’re working at the cutting edge of your industry, finding and fixing bugs can’t take days.
A single node failure can affect hundreds or thousands of servers… Before Sentry, we’d crash loop, often due to hardware failures, but had no telemetry at the node level to reject bad hardware.
Nova, Systems Lead
As their infrastructure scaled, so did their issues:
- Overwhelming log volume that their existing monitoring tools couldn’t handle
- No visibility into node-level hardware issues
- Difficulty correlating errors across distributed systems
- Limited ability to track error ownership — when something breaks, who should know first?
As the models they were training only grew bigger, the problems became critical. With thousands of GPUs running simultaneously, every minute spent debugging meant wasted resources and stalled research.
We were hitting failures that would take us days to debug. With thousands of servers involved, getting that down to hours with Sentry was crucial for keeping our large-scale training jobs running efficiently.
Nova, Systems Lead
Their existing setup, which had worked fine for smaller jobs, wasn’t built for this scale:
The previous tool had hard throttling limits, and we were generating way more logs than it could handle.
Nova, Systems Lead
They needed a real-time debugging solution that could handle their scale without requiring them to manually piece together failures.
The solution: Immediate visibility at scale
Anthropic’s machine learning infrastructure operates at a scale where small failures quickly become massive disruptions. During the training of Claude 1, the infrastructure monitoring they had in place struggled to keep up.
This was the biggest job we had ever done. Before, most jobs spanned a couple of nodes. But when you train large models, a single node failure can affect thousands of servers.
Nova, Systems Lead
Some teammates had used Sentry before at a previous company — so when cascading failures across their GPU cluster made debugging nearly impossible, they knew Sentry could help.
With no time to waste, they integrated Sentry into a single job to test its impact.
We just slapped the Sentry SDK into our job and redeployed.
Nova, Systems Lead
The results were immediate:
- Exceptions surfaced in real time, eliminating manual log searching.
- Node-level correlations made it easy to pinpoint failing hardware.
- Problematic nodes were removed in hours instead of days.
We went from being stuck in crash loops for days to pulling bad nodes in hours and getting the job running again.
Nova, Systems Lead
What began as an experiment paid for with one developer’s credit card evolved into an enterprise-wide solution, transforming how they debug large-scale AI training.
Building a smarter debugging system
While Sentry worked great for Anthropic out of the box, as they scaled, they leveraged Sentry’s flexibility to customize their monitoring to fit the demands of large-scale ML training.
Anthropic developed custom exception handling for GPU-related errors, synthetic events to track Kubernetes preemptions, and job-based error tracking to tie failures to specific experiments.
They also implemented a detailed tagging system to capture critical metadata specific to their needs — hardware type, data sources, service dependencies, and job ownership — providing deeper context for debugging.
The way we use Sentry is job-oriented rather than release-oriented… Errors are tagged based on the job, so they’re automatically assigned to the right developer [within Anthropic].
Nova, Systems Lead
Between custom alerts, tags, dashboards, and Sentry’s open APIs — we have the flexibility to extend Sentry how we need, while maintaining the ease of use Sentry provides for the entire team.
Nova, Systems Lead
For example, we’re a big Slack company, so we do a lot of alert forwarding.
Nova, Systems Lead
We’ve also built custom dashboards that combine Sentry data with our logging systems through the API. Being able to query against the Sentry API is very valuable for us.
Nova, Systems Lead
By structuring their error monitoring around individual ML jobs instead of traditional software releases, Anthropic made debugging more efficient, reducing downtime and keeping experiments on track.
Expanding Sentry’s role in their workflow
Anthropic relies on Sentry to track exceptions, assign errors, and analyze failures in real time across all of the primary languages used by Anthropic’s research teams, including Python, Rust, and C++.
Bringing all debugging context into one place has significantly improved issue resolution.
Sentry gives our developers one place that will have all the information they need to debug an issue.
Nova, Systems Lead
Key issues Sentry helps fix, faster
1. Data Processing Issues
For research workloads running across thousands of servers, data validation is crucial. Bad data can corrupt entire experiments. Sentry allows engineers to identify and isolate faulty sequences before they cause widespread issues.
“When we send locals to Sentry, we can immediately see if there’s poison data or a sequence that needs to be removed,” says Nova. “It’s pretty easy for us to track down because we can see exactly which file needs attention.”
2. Service Dependencies
With multiple environments and interconnected downstream services, tracing failures can be challenging. “Imagine you’ve got a couple of environments the model is learning in,” Nova explains.
Using Sentry’s querying capabilities they are able to quickly identify the root cause and isolate the related errors. “Using Discover, we can isolate all errors related to a particular job and connect them to specific services,” said Nova
“Being able to instantly query the last 90 days of all errors is extremely useful,” Nova explains. “When we think there’s a corruption bug, we can immediately check which jobs were affected by grouping exceptions by creator tags.”
3. GPU-Related Failures
Hardware issues present unique challenges — especially when you’re dealing with lots of hardware. When one piece of hardware starts turning out bad results, the problems can ripple out and compound. Anthropic uses Sentry to find problematic hardware and take it offline faster.
“GPUs are extremely reliable pieces of hardware—they’re a modern miracle doing a thousand times more work than CPUs,” Nova notes. “But they can be flaky and sometimes return bad math. With the custom tags and error context Sentry provides we can isolate that to a particular host so we can remove it from the fleet.”
How Sentry supports Anthropic’s engineers
With many hundreds of developers across research and engineering, Sentry provides a single source of truth for errors across web, mobile, and a large Python monolith used by the research team.
Before: Debugging could take days, delaying research and burning expensive GPU time. Now: They fix issues in hours, keeping experiments on track and reducing costs.
Sentry helps us get visibility into complex distributed systems and close the loop on broken jobs—giving developers everything they need to debug, in one place.
Nova, Systems Lead
Why It Works for Them:
- One place for all errors → No more chasing logs across multiple tools
- Faster debugging → Fixing incidents in hours, not days
- 10-15% productivity boost → More time spent building, less time debugging
- 20-30% faster incident resolution → Less downtime, fewer interruptions
We wouldn’t have scaled without Sentry. Most of our incidents are hardware-related—and we debug them all inside Sentry.
Nova, Systems Lead
Looking ahead
With AI training becoming even more complex, Anthropic is exploring Sentry’s Autofix, which is built on Sonnet to further reduce debugging time. In AI research, where compute time is expensive, the right tooling doesn’t just fix problems — it keeps the momentum going.