Kristján, thanks for taking the time to speak with us. First of all, why did you choose to use Sentry over traditional logging?
We needed to re-visit client logging for our latest product, Dust 514. While games like EVE use a home-grown logging solution, the last 10 years have seen tremendous progress in off the shelf infrastructure. We wanted to use a tried-and-tested stand-alone solution. We were impressed by the detailed stacktrace information available through sentry and didn’t want to waste manpower re-inventing that wheel.
We also wanted it to be as independent of the rest of our network infrastructure as possible, so that for example errors could continue to be logged even if the regular client-server connection were to be interrupted. Having a hosted service helps with this. The hosted service also brings the huge benefit of not having to requisition server resources for logging, and frees our infrastructure department from worrying about that part.
CCP is about 600 people strong these days. The part chiefly involved in Dust 514 is about 110 people. thereof about 30 engineers and 20 QA people. Most of QA knows Sentry, and a healthy portion of our engineers are aware of it as well.
All of our games are constantly being updated. This is true for EVE Online, as well as Dust 514. In the latter case, we are currently pushing out client updates every few weeks. This necessarily makes efficient QA vitaly important and thus it is imperative that we continue to monitor the health of our game clients during actual use. This makes logging from clients an essential part of our business. Also, the reality of the internet and the complexity of modern web-based server architecture means that some problems don’t occur within a testing environment, how realistic as it may be. So, we need to be constantly aware of any problems that the users are experiencing and improving our software appropriately.
We primarily use Sentry to log and analyze Python tracebacks. Our QA teams watch the event stream closely and notice if any new problems crop up. Since the events are then uniquely available, we can augment our internal defect tracking with Sentry URLs and our developers can then use the detailed traceback information study the problems.
It has reduced our reliance on log files considerably. And since we are using a hosted service, the server is always on. QA now monitors the Sentry server and defects in our defect tracking system cross-reference the events there. This means that often actual client problems don’t need to be reproduced, but can be diagnosed from the Sentry events directly. We can also use Sentry to validate code fixes that would otherwise be hard to verify through defect reproduction, since we can simply monitor the event feed and watch them disappear.
It’s clear that CCP uses Sentry in a unique and interesting way. Can you tell us a little about your implementation?
Perhaps the single most unusual bit is that in addition to our battle-servers, we are logging from player’s own client machines. There are hundreds of thousands of these, scattered around the globe, so identifiers such as hostnames and ip-addresses become meaningless. I’m guessing that Sentry was never really intended for that but rather for server-side logging, but we weren’t aware of that initially. This hasn’t been a problem, however and Sentry is able to cope beautifully.
The client implementation is unusual as well. We are uniquely using Stackless Python to drive important parts of the game logic on a Sony PS3 console. On this generation of hardware we are very constrained by memory. We have a custom version of Python tuned to use as little of this precious resource as possible. For this reason, we could not take, for example, the Raven library and use it as-is. To save program memory we carved out its essential parts into something we call Krunk, the Icelandic name for the raven’s call. Performance is also important, since error reporting can not be allowed to noticeably affect frame rate. So, we are using the ultrajson JSON encoder, which is written in C. But this didn’t allow the custom hooks required by Raven, i.e. for UUIDs and timestamps. Since we are using Stackless Python, we also use a Stackless transport to send each event on a separate tasklet, so as not to block the main execution loop.
Another interesting fact is that we can dynamically change the activation state of Sentry logging on each client. We can choose the percentage of the clients that have it enabled, and also which logging channels are connected and on what level. This happens via the game server and allows us to throttle logging at the source and focus logging on certain areas of the code.
One last thing, could you summarize your experience with Sentry for developers who are considering giving it a try?
Using Sentry has been a simple and intuitive experience and support has been virtually instant for any questions that have come up.