Error Logging and Tooling for Disney+

Share

Share on Twitter.
Share on Facebook.
Share on LinkedIn.

Truth

As seasoned developers, we accept that all software has bugs. Therefore, error logs are a fundamental pillar of high quality software. Error logs appear in many forms including console output, text files, HTTP messages, charts, and mobile push notifications. All of these mediums are bound by a single goal: “What is the truth about this software?”

There is an alluring data puzzle within error logs. It’s the factual history of app behavior with no concern for the hand-wringing that went into the architecture, patterns, interfaces, trade-offs, and myriad of other decisions. Existence of the data proves it happened. You might extend the value of an error message, or improve the signal-to-noise ratio, but fundamentally, that error happened.

By definition, error logs are a defensive posture. Error logging logic often pushes the ideal limits of lines per function, and it can obscure the simple purpose of a given code block. I’ve come to realize when I see concise and elegant code, I immediately know that code is not production ready. Beautiful code is a means to convey an idea; it’s not suited to survive the stress of a high-quality app. In contrast, production-ready code includes validation, inspection, logging, telemetry, and other bloat that sullies the charm of elegant code.

I place a high value on fundamentals. I like using fewer patterns because the exotic pattern is untested, and it will fail in an upcoming scenario. Boring code means everyone understands it, and anyone can go on vacation. Reviewing error logs should be a fundamental skill too; it improves defensive coding, and brings the team into alignment with the truth about the software.

In this blog post, I’ll briefly describe how we use Sentry to support Disney+ on a variety of platforms along with some context to share what’s important to us.

Disney Platforms

About Disney+

Disney+ runs on many devices and in many countries. Like many Agile practitioners, we work in predictable cycles and release software frequently. Visibility into these release windows represents one important aspect of error log reviews.

Honing the discipline required to operate remotely and leaning on fundamentals has been well worth the effort. The depth and breadth of Disney+ content, features, subscribers, and quality contrasted with the number of employees at Disney Streaming is a well- earned point of pride.

Disney+ is available and operational on the following a wide range of mobile and connected TV devices, including all the major gaming consoles, streaming media players, and smart TVs.

All of the platforms must account for error logs within the constraints of the given device. That’s a lot of ways to do the same thing with diverse languages, hardware, and ecosystems.

Error Logging and Tooling

A long time ago (in Internet time), a team at Disney Streaming waded through the many options available for centralized error logging. At the end of the process, Sentry met our requirements better than other similar managed service providers.

To get hands-on experience with Sentry, we installed the OSS solution from https://github.com/getsentry on a couple of servers in the cloud and teams began to experiment. The portal was intuitive and helped build confidence in the capabilities of this service.

After our trial, an official managed service engagement began. We had left the initial server on a single version of Sentry, and we were delighted with new features that popped up on the managed server instance.

In addition to establishing the server environment, apps need to send messages to the Sentry server. Many client SDKs exist under that same GitHub umbrella to simplify app integration… From what I’ve seen, the OSS licenses for these vary between MIT, Apache, and BSD.

Disney+ platforms lacked a compatible Sentry OSS library; some languages are just obscure. So, those teams authored a custom Sentry SDK. The documentation for this task at https://develop.sentry.dev/sdk/ overview was thorough, thus enabling the Disney Streaming teams to author a module that posts JSON to an API. The goal was a sufficient module; we didn’t have a need to achieve full parity.

The Sentry SDKs encapsulate the tedious plumbing work of sending HTTP messages. The following code block is a C# example of capturing an unhandled exception. As a minimalist sample, it sends the exception to a Sentry API without additional context.

An example from https://docs.sentry.io/platforms/dotnet/guides/aspnet/:

using Sentry;
public class MvcApplication : System.Web.HttpApplication
{
    private IDisposable _sentry;
    protected void Application_Start()
    {
        // Add this to Application_Start
        _sentry = SentrySdk.Init(o =>
        {
          var key = getKey();
          o.Dsn = new Dsn(key);
        });
    }
    protected void Application_Error()
    {
        var exception = Server.GetLastError();
        SentrySdk.CaptureException(exception);
    }

    protected void Application_End()
    {
        _sentry?.Dispose();
    }
}

Shortly after switching to the managed service, we wrote some simple tools to put it through its paces by hammering the API with heavy traffic. Some engineers noticed a series of HTTP 429 responses sprinkled throughout the report.

429 Too Many Requests

Imagine a runaway app, distributed around the world, hammering telemetry services with noise. On a spectrum, this is farther from helpful and closer to a distributed denial of service attack. Using the 429 response, the Sentry API was telling us the message was rejected due to load, and the API politely hinted that someone needs to take a look at what’s causing that load on the service.

Error Types

The seams of a given language, hardware, and architecture will have a lot to say about the interesting places to track exceptions. First in line is the app crash scenario. Unfortunately, some platforms are unable to offer a foothold to capture this catastrophic event. Other platforms offer a portal provided by the device vendor with visibility into app crash details, but it’s case by case. You’ll spend considerable effort playing defense to avoid an app crash. Yet, it is a potential outcome when (not if) enough things go wrong. On the up-tick, if a platform can log app crashes, it’s just a few lines of boilerplate code.

Next in line is the unexpected API response. It would be trivial to wrap an HTTP module with logic that logs any response north of 399, but that would introduce too much noise. We need more context. An “Unauthorized” or “Not Found” API response can be a valid behavior. For this reason, reporting unexpected API responses will leak into every module capable of HTTP requests in order to squeeze more context into error messages.

The most subjective and nuanced is the third case: unexpected app states. Enumerating app states can be a challenge, as well as identifying when they occur in an unexpected sequence. They suffer from both omission in the error log, as well as contributing to noise when they’re too verbose. Regular error log reviews and periodic code reviews will help identify these hotspots.

Most teams review code during a Pull Request event in GitHub. I’ve found that over time, we will learn new things, observe patterns, and discover better ways well after the code change. So consider reviewing files on a regular cadence, well after that PR has been merged.

Message Parts

Context is an important component of an error message. The Sentry API offers a number of extension points to collect during an error condition. The following Sentry SDK document gives an overview of the individual parts of a JSON message sent over the wire. The client SDKs make this easy; custom libraries will need to carefully construct their own.

Check out https://develop.sentry.dev/sdk/event-payloads/ for more details.

Message

The message is the essence of the HTTP post to Sentry. Give some thought to the structure of this string value as it forms a key part of distinguishing different errors. “Server Error” is too coarse, and “GetContent() failed for movie abc123 at 11/1/2020 10:00 +7:00” is too granular.

Breadcrumbs offer a concise method for tracking the path that led to this unfortunate state. This feature can collect a list of views the app displayed leading up to the error as well as actions and other stateful metrics.

Extra Data

The extra data field holds an arbitrary collection of name/value pairs. This is a great place to hold contextual data where an error in one part of the app might insert four fields of related data, and an error in another part of the app inserts twenty fields into the collection.

Tags

The tags collection is similar to extra data, but contains a more uniform number of fields such as the OS name, component versions, and similar data. For example, you might want to query for a list of errors on a specific device model to deduce the root cause of an error.

Stacktrace

The stacktrace is that magical data point that describes the exact line of code that failed and the preceding lines that led to the error. Obtaining this value can provide enormous value in understanding the source of an error and resolving it. Unfortunately, access to this is a mixed bag, and it’s entirely dependent on the platform. Some platforms include a pristine stacktrace while others yield an unreadable mess of hex offsets. JavaScript-based platforms can configure a source map to chart a way back through the uglified code, which can be a non-trivial effort to enable.

Retrospective

We’ve learned that error log reviews must be part of the development routine. Initially, some teams were more verbose than others, and logged informational states in addition to handled and unhandled exceptions. At our scale, informational states need to remain at the console or a similar level. The primary purpose of our error logging is to help the engineering teams maintain a high-quality app. Hence, the individual platform teams have discretion to identify what messages best support their needs, given a particular language, device, and architecture.

Sentry is one of many tools we use to gather telemetry data in order to maintain a high quality level. It’s common to start an investigation in one area, and then ask correlation questions in other tools. Gauging both the severity and prevalence goes a long way in understanding negative impacts. The “Dashboard” and “Discover” sections within the Sentry portal are fascinating areas for a puzzle solver by offering loads of charts, canned queries, and custom queries.

In the future, we are likely to explore more extension points within Sentry — for example, Slack and Jira. I’m skeptical about integrating a potential firehose into Slack, but it has its place. The ability to create a Jira ticket with a single click within the Sentry portal is enticing.

We’ve dabbled with Alerts and there’s more to leverage there. The Sentry portal includes an intuitive alert system to send a message when a given threshold is exceeded. For example, there may be a tolerance for ten API failures in a rolling 60-minute window, but when that eleventh failure happens, a PagerDuty alert will be activated. All good stuff.

It’s been an incredible ride building Disney+ and supporting launches around the world. Monitoring the error log system and improving the value of the data it contains has been essential. Sentry’s high-quality tooling helps Disney+ maintain high-quality service to its tens of millions of global subscribers.

Check out the original post on Disney Streaming Services’ blog here.

© 2021 • Sentry is a registered Trademark
of Functional Software, Inc.