Automated Chaos Testing on the Front-end

As front-end developers, we don’t often hear about Chaos testing. A project related to Twitch front-end availability led us to research the exciting field of Chaos engineering, pioneered by Netflix. Chaos engineering is the science of optimizing the resiliency of a software system by simulating failures and measuring the impact of these failures on the system. These simulations help anticipate real world issues before they happen, and ensure our systems can degrade gracefully. The practice is commonly leveraged for backend and distributed systems, and more and more tooling is being developed in this space.

On the front-end however, we haven’t observed much discussion or activity from web or mobile communities to ensure our clients are as resilient as they should be.

The ultimate goal for us was to be able to answer this question: “If this part of our overall system (e.g. backend service, third-party API) fails, how does our front-end behave and what do our end users see?”

Simulating System Failures

At Twitch, our data is served by an ever growing collection of microservices (hundreds at the time of writing) that depend on each other. These services are abstracted away from our front-end clients (website, mobile apps, console apps, desktop) via a single GraphQL API.

The most common failure that happens in a system like ours is one of our microservice errors out and fails to serve its portion of the data. In this case, GraphQL does a great job of forwarding partial data to the client when configured properly. From there, it’s the client’s job to handle this partial data gracefully and provide the best degraded experience as possible.

With that common use case in mind, the goal became to find a scalable way to test these scenarios and observe the behavior of each client.

Let the Chaos engineering begin! First, we need a way to simulate failures reliably.

The first option considered was intercepting and altering the network responses received. A chaos interceptor would add the ability to inject errors in the GraphQL responses. These errors could be predetermined, or randomly injected.

While exploring this option, we realized that since the vast majority of our API calls go through GraphQL, we could potentially pass a special instruction with our GraphQL calls to simulate certain failures at the GraphQL resolver side. Luckily, it turns out that this capability was already implemented during one of our Hackweeks, and aptly named Chaos Mode.

Chaos Mode adds the ability to pass an extra header to our staging GraphQL calls. Within that header, we can pass the name of one or more services that we wish to simulate failures for. Our staging GraphQL resolver reads this header and short circuits any internal call to those services.

Automated Chaos Testing

Once we had a way to simulate failures, the next step was to automate the process. At Twitch we make heavy use of automation to test our front end user flows. Those are scripted tests that simulate users manipulating our website or our apps, verifying that everything shows up as expected. We realized that it would be fairly easy to automate our failure simulations by running our core tests with the special Chaos Mode header set. This setup can tell us if a particular service failure breaks those core flows or not, which is getting very close to our goal!

After having a quick prototype running on Android, we found that there were a couple of missing things to make this scalable:

Since features and services change over time, we need a way for our test suite to “discover” the services that we should be forcing failures for. A manual mapping is not maintainable and does not scale well.
Seeing the test results is a great start, but we should be able to extract more useful information from these runs. Things like which service caused the failure, what particular API call was affected, and even a screenshot of the failure for a visualization of the user experience.

In order to address the first point, we needed a way to “trace” the GraphQL calls from the client and record which internal services get hit for a certain user flow. We solved this by using another debug header in our GraphQL calls, which enables tracing at the GraphQL resolver layer. The resolvers then record any method call done to its internal service dependencies, and send the information back to the client in the same GraphQL call. From there, the client can extract the service names that were involved in a set of GraphQL calls, and use that as the input for our Chaos Testing suite.

We now have a two step process:

Run the end to end tests with tracing enabled, and record all the services involved with each test.
Using this list of services, we can run the Chaos Mode tests by forcing failures for each service one by one.

With this setup, we now have an automated way to discover services and then force them to fail in a consistent, reproducible manner. We can then scale this up to work with any number of automated tests.

Dashboard and visualization

With this system in place, we went on to extract and record interesting information that will describe the user experience during these forced failures.

We use the test results for a single user flow (ie. navigating to a screen/page, watching a stream, sending a chat message, etc) that runs n times - once for each service failure - and use them to calculate a resilience score for that particular test. A simple percentage number that represents how many times the user flow was successful despite the ongoing service failure. A global resilience score is then calculated for the client, giving a high level overview of how sensitive each client is to failures.

We can use these scores to track the resilience of our user flows over time, measure the improvements we make to our code and also make sure we don’t accidentally regress. We package these scores and other useful information and upload it to a server so we can visualize it all in a web dashboard.

During the test, we capture API calls, extract useful information like which particular GraphQL had errors, what field inside of that query caused the error and we also capture screenshots of the expected state and error state. All this information is helpful to diagnose what exactly happens to the clients for a given failure.

Using this dashboard, we identified a big gap between the Android and iOS resilience scores. Android was showing an overall score of 82% whereas iOS was only at 64%. Using the debug information provided, the iOS team identified the root cause: an over-defensive handling of GraphQL errors deep in the network stack. The team was able to test their fix by re-running the automated chaos tests, and verified that the fix significantly increased resilience, boosting the iOS resilience score to 84%.

Next steps

Right now we run this tool every night on our Android, iOS, and Web clients. This gives us insight into our current frontend resilience to backend service failures. We can now track how our resilience scores change over time, and easily test failure scenarios that were difficult to test in the past.

We plan to expand the coverage by adding more user flows to our automated tests, beyond our core flows. Adding a new test in the system is as easy as writing any automation test for the platform, and we’re currently working on adding mobile web as one of the target platforms.

We also plan to improve Chaos Mode and the ability to trace service calls to address the current limitations. The highest priority is the ability to trace and fail secondary service calls (service to service interactions). Other features we’re considering is to be able to fail individual service calls and the ability to fail multiple services at once.

The discipline of Chaos engineering might be established for backend distributed systems, but it feels like there’s still a lot to explore for the front end! It’s been amazing to see these tools bringing engineers from different fields together, collaborating and giving them much deeper visibility into our system end-to-end.

Want to Join Our Quest to empower live communities on the internet? Find out more about what it is like to work at Twitch on our Career Site, LinkedIn, and Instagram, or check out our Job Openings and apply.

Language

Automated Chaos Testing on the Front-end

Simulating System Failures

Automated Chaos Testing

Dashboard and visualization

Next steps

In other news

Tune In and Celebrate Black Creators All Month Long

Twitch Developer Day 2021: Introducing Game Engine Plugins, EventSub WebSockets, and More