- Reduced tech debt by improving both server latency and database load
- Reduced average total latency by 2x
- API response time became 13x faster (i.e. decreased API response time from 17.5 seconds to 1.2 seconds) in their largest bottleneck
- Identified critical bugs in order to continue meeting predefined customer SLAs
PHP, Go, Python, TypeScript, React
Error Monitoring, Performance Monitoring
How Intelligence Fusion made API response time 13x faster by finding performance bottlenecks with Sentry
Intelligence Fusion specializes in providing intelligence, risk assessment, and situational awareness solutions to businesses and government agencies. Their threat intelligence platform manages geographical information and relies on assurance and prediction to help keep people safe; they also provide a threat intelligence REST API. Intelligence Fusion gathers and analyzes various types of data, including open-source information, social media content, and news reports to create actionable insights and reports for their clients.
- 8 REST APIs written in PHP
- Background processes written using Go and Python
- Web platform written using Typescript and React
- Has Sentry instrumented across 8 PHP services, 3 Go services, and 1 Typescript
We started with Sentry 3 years ago when we were building infrastructure and didn’t have a lot of test coverage. Sentry became essential to us to know when things were going wrong. Thomas Hockaday, Lead Engineer @ Intelligence Fusion
As a long-time Sentry customer for error and exception monitoring, Intelligence Fusion was ready to invest in making their services more performant for customers. As their tech team grew from 7 to 16 and started scaling its databases, performance also grew to be more of a priority.
With Sentry integrated into their services and dealing with a legacy API, the Intelligence Fusion engineers knew from experience that some of their endpoints were slower than others. They also knew of slow rendering issues on their threat data heatmap — the main feature users see upon logging into the Intelligence Fusion platform — from historical internal and client feedback. Thus, the team needed a solution to find app slowdowns in a swift and painless way — leading them to Sentry Performance.
Pre-Sentry, Lead Engineer Thomas Hockaday of Intelligence Fusion recalls the laborious process his team used to diagnose application performance issues. Typically, the engineering team needed to figure out:
- How urgent an issue was – which they often determined based off of whether a customer was complaining
- What was causing it — a process that usually required their engineers to dig around in AWS logs for the culprit, as well as attempt to reproduce the error while communicating with the user
With Performance Monitoring from Sentry, Thomas’s team was able to troubleshoot their performance issues faster (which was reflected in existing metrics on overall uptime and service latency that the team tracked in Grafana). In Sentry, they could immediately:
- Know which of their issues were critical and demanded attention
- Find the root cause of issues using Sentry’s latest Profiling feature
Here’s the exact workflow Thomas’s engineering team takes to quickly diagnose a performance issue in Sentry. With Sentry’s Slack integration, the team gets alerted via Slack about any critical app performance issues in their project, like N+1 database query issues and slow database queries.
Then from the transaction summary in Sentry, the Intelligence Fusion engineers can identify exactly where the performance issue lies — whether it’s in the database, the response builder, or somewhere else. The engineers may click into the span waterfall associated with the transaction to see how long a query is running. Or, they may look at how many database queries are running in an endpoint to see if they can be reduced.
Using tracing with the Trace Navigator in Sentry helps you see what parts of the application are doing most of the heavy lifting. For PHP applications, tracing is particularly helpful to see which parts of the API call are slowest. The average API call goes through multiple stages:
- First, the request to fetch data passes through the server into the PHP application.
- The PHP application bootstraps relevant dependencies, then routes to identify which part of the application the request is trying to access.
- Next, middleware is executed to authenticate and sanitize data on the request to ensure application security.
- The validated data request moves through the main application layers to prepare a database query. Then, this data passes back into the application layers, is shaped into a response, and returns to the user.
Given this complexity, tracing helps the team decide whether the performance issue is in the server, the main PHP code, a specific part of the PHP code, or the database. If more detail is needed, the team delves in deeper and looks at the profiles.
A few months ago, Intelligence Fusion implemented tracing on their main API and then steadily rolled it out to all of their other PHP services. Seeing all their tracing data in one place helped the team identify 1) areas that were particularly slow and 2) universal improvements they could make to reduce overall server latency.
Once tracing was enabled, the Intelligence Fusion team also easily set up Profiling to identify their largest performance bottlenecks. The main bottlenecks (apart from the issues sent to Slack) were transactions with the highest latency and user misery scores, clearly visible from the Performance dashboard.
From there, it was simply a matter of clicking into each profile to look at the breadcrumbs, the slowest functions widget, and the aggregated flame graph (see images below). These Profiling features helped reveal the functions with the longest execution times that needed optimization.
Within two weeks, the Intelligence Fusion team improved their overall application speed with several application-level changes. This included compressing large responses with gzip, opcache, and JIT in PHP, as well as chunking larger dataset queries to reduce PHP memory usage.
Using Sentry, the team monitored these improvements by comparing recent traces against historical ones as the optimizations were gradually deployed. The two screenshots below show how the slowest endpoint improved between May/June and today.
The Trends page also provided a helpful graph that showed how our endpoint performance had improved over time:
As a result of these optimizations, Intelligence Fusion reduced their average latency across all of their APIs (as tracked in Grafana). For example:
- API 1 (threat actors) average latency decreased from 1820ms to 23.8ms
- API 2 (static assets) average latency decreased from 60.7ms to 24.1ms
- API 3 (authorization) average latency decreased from 105ms to 24.9ms
The slowest endpoint, their country data endpoint, initially took 17.5 seconds to return. Having identified the database query as the biggest bottleneck from the flame graph, they optimized the endpoint in several ways (e.g. applied a GIN index to one of the columns and made sensible reductions to the accuracy of coordinate data). Sentry also helped them see where they were doing unnecessary column selection, further enhancing endpoint speed. Ultimately, average time decreased from 17.5 seconds to 1.2 seconds – making their API response time 13x faster.
Joe Sweeny, VP of Engineering at Intelligence Fusion, speaks to the impact of these application performance improvements within the company:
Sentry has always been our go to tool for critical error handling to ensure we continue to deliver reliable and robust software. As Sentry has evolved its offering, it has allowed us to not only scale our microservice architecture horizontally but vertically, as Profiling and performance metrics let us dive deeper into our applications to ensure we provide an optimal level of performance for our customers.
When we inherited a legacy API, Sentry was essential in helping us know exactly which parts of the tech needed the most care and attention. In the past 3 years, we’ve taken that codebase from 400 unit tests to 3400 — and still growing, thanks to the information we got from Sentry errors. I’m looking forward to expanding our insights now that we’re starting to really dig into the performance aspect of it too. Thomas Hockaday, Lead Engineer