This is Part One (of two) of our story chronicling Twitch’s journey from monolithic architecture to microservices. In Part One, you’ll learn about our early days, from our rapid growth to the performance bottlenecks that pushed us to find better solutions. Part Two, covers “Wexit,” our migration to microservices and all the challenges and benefits that came with the switch.
Almost all the successful microservice stories have started with a monolith that got too big and was broken up” - Martin Fowler
About the Author: My name is Mario Izquierdo, and I’ve been a Full Stack Engineer at Twitch since 2017. I’ve worked in a variety of systems in my time here, including the public API, infrastructure, and tools for services.
First, let’s define some of the terms we will be using:
Monoliths are a single build of unified code. Every service and feature in the code is built into one large system. They usually share the same server, file system, and database. Monoliths work well for smaller teams that need to move fast.
Microservices, on the other hand, are smaller modules of code, each with their own function and focus. Every module is a separate application that usually lives on its own server, communicating with each other through APIs. Microservices work well for large organizations that need to set boundaries between smaller groups.
Twitch.tv launched on June 6th, 2011, after its start as Justin.tv in 2006. It was built as a monolith using Ruby on Rails. Having everything together in one single codebase is a great way to launch a startup, and initially worked well for Twitch. It allowed rapid development while using state-of-the-art coding practices.
However, as organizations grow, there are multiple reasons to split the code into smaller parts:
- Coordination: Avoid having “too many cooks’’ in the same kitchen.
- Resilience: Reduce the blast radius when something goes wrong.
- Scalability: Overcome performance bottlenecks.
For Twitch, the most pressing issue was dealing with performance bottlenecks.
As the Twitch user base increased, different parts of the system started to reach their limits. API, video, database, chat, search, discovery, commerce, and other internal systems needed improvements. To illustrate the situation, let’s use our early chat system as an example. For those who don’t know, chat is a core part of the Twitch experience. The magic of Twitch happens when creators and viewers can talk in real time, turning the chat into an actual conversation. In other words, without chat the spirit of Twitch is gone. That kind of seamless interaction can only happen with a super low-latency connection.
At the time, our chat system ran on a set of 8 machines, with chat channels randomly distributed across those machines. That architecture worked well, until gaming events started exploding in popularity. Big channels would get the (then) unheard of amount of 20,000 viewers! At the time (2010s), those numbers were unbelievable. Nowadays, not so much.
Anyways, big channels would slow down the machine hosting the chat system. In some cases, it would take more than a minute to establish a connection. Moreover, the problem would spread to all other channels assigned to the same host. With only 8 machines, that means 1/8th of all channels! Throwing money at the problem (adding more machines) would reduce the blast radius, but wouldn’t solve the core issue. Having chat working well for large streams was a priority for Twitch.
Our first attempts to build microservices were done to break up bottlenecks like these. Engineers worked against the clock to find new solutions that could scale. There was a lot of experimentation, and most attempts didn’t even make it to production. For example, the team working on the chat system initially attempted to improve the existing architecture by using Sidekiq over a RabbitMQ backend. Of course, that implementation was inside Rails, and only made the monolith bigger. Next, they tried to extract the chat system into a NodeJS service, but unfortunately, the proof of concept was not able to scale due to a new bug found in the NodeJS core (it was still a young technology). Looking for other options, they found Python with Tornado, which seemed like it could work just as well, but with less surprises. Finally, the Python implementation became the new chat service. While that initial extraction would not qualify as a modern microservice, it was a step in the right direction.
In 2012, the Go language was gaining a lot of popularity, especially with the launch of Go 1.0. A number of Twitch engineers attended a GoSF meetup to determine one thing: was Go ready for production?
Emmett Shear, Twitch’s CEO (previously the CTO), was supportive of the choice, but there were still concerns about how well Go services could handle real-world traffic.
This led to an experiment that became the first Go program at Twitch. We created a tool through Go, a fake service, to saturate the browser with a huge amount of chat messages. This tool helped identify a bug in the chat system, but it also did something more important: it safely validated assumptions about the language, setting the stage for all the upcoming changes.
The first Go code to run in production was a tiny Pubsub server (aka the message exchange). It was a small but important piece of the chat system. With only about 100 lines of code, it handled every message at Twitch through a single thread.
This was an excellent opportunity to try Go, and after an initial proof of concept, the team replaced the Pubsub server with a new Go implementation. It was an easy win. The performance-per-thread was doubled with respect to the previous Python implementation, and it could be modified to run in multiple threads using goroutines. Nice!
The new Pubsub server added to the momentum of the Go programming language at Twitch. Other teams were also experimenting with it, and soon the next important Go microservice was created. The video team named the new service Jax.
Jax was designed to list the top livestreams for each category on the Twitch homepage. The challenge was indexing live data in near-real time, while keeping the source of data properly organized, so it could be owned by different teams. The video system knows about the number of live viewers, but it doesn’t care about category metadata. Building the new Jax service outside of Rails, and outside of the video system, allowed for a proper separation of concerns. After that, Jax went through multiple iterations until the initial effort could pay off: first using Postgres, then in-memory indexes, and finally Elasticsearch.
Building the initial few microservices was messy and expensive, but having working examples was crucial.
A Quick Note About the Frontend
The evolution of the frontend is a long story by itself. It was rewritten again in React.js through 2017 and 2019. Curiously enough, unlike the backend, the frontend became a successful monolith (see Guiding A Monolith With A Gentle Touch).
Breaking the Monolith at Twitch: Part 2
This is just part of the story. Twitch’s transition to microservices took years to complete, and there were many challenges and benefits that came with the switch. Microservices aren’t always the best fit for every team.
Check out Part Two, to learn about our journey, lessons learned, and if your company should consider making the jump from Monolith to Microservices.
Thanks for reading! This blog post was developed with the help of many individuals, most of whom were directly involved in the migration. Thank you: Angelo Paxinos, Daniel Lin, Jordan Potter, Jos Kraaijeveld, Marc Cerqueira, Matt Rudder, Mike Gaffney, Nicholas Ngorok, Raymond Lin, Rhys Hiltner, Ross Engers, Ryan Szotak, Shiming Ren, Tony Ghita.
Want to Join Our Quest to empower live communities on the internet? Find out more about what it is like to work at Twitch on our Career Site, LinkedIn, and Instagram, or check out our Job Openings and apply.