Productizing Data Science at Twitch
A key function of data science at Twitch is using behavioral data to build data products that improve our products and services.
Some examples of products that data science has helped to launch include the AutoMod chat moderation system, the similar channel recommendations used for Auto Hosting, and the recommendation system for VODs. This post discusses some of the tradeoffs involved when building data products and presents our approach for scaling predictive models to millions of users.
The decision to build a data product at Twitch is often the result of exploratory analysis performed by a data scientist. For example, an investigation of our user communities may result in findings about which types of channels different groups of users are likely to follow. We can use these insights to build predictive models, such as a recommendation system that identifies similar channels on our platform. A data scientist might build a prototype model using collaborative filtering in Python that shows potential value for one of our products. The next step is to productize the model, make it a data product, so that we can use it in our live services.
One of the main challenges faced when building data products is determining how to take a prototype of a predictive model and scale it to support millions of users. The data scientist that prototypes a model may be unfamiliar with building a robust, live service. Generally, there are two approaches taken in order to address this issue:
Data science owns the model.
Data science hands off the model.
In the first approach, the data science team is responsible for training and running the model. Model evaluation is usually a batch process and the result can be an output file or database table that is used as input for a live service. For example, a classification model could be used to predict which users are most likely to follow a channel and output a file to S3 that is used to send out daily emails to targeted users. In the second approach, the data science team prototypes a model and then hands off the specification of the model to an engineering team that is responsible for deploying the model. An example of this approach is a decision tree model that flags whispers (direct messages) as spam in real-time.
In the daily email example, the data scientist is responsible for prototyping the model and owning the process for outputting the results to S3. In the spam example, the data scientist is responsible for defining the model, but hands off the model definition to the engineering team for deployment. Often, a translation must be performed between the prototype model and the production system, such a converting a Python script to Go code.
Model Factors Here are some of the factors to consider when selecting an approach:
Response Time: Is the model real-time or a batch process?
Reliability: Does it need high availability?
Scalability: Can the model run on a single machine?
Model Maintenance: Will the model be frequently updated?
Expressivity: How complex is the model? Which systems support it?
Response time is the amount of latency that the system can have when making a decision. It is common for systems to run in a batch mode where decisions for all users are made at a regular frequency, such as daily, or in a streaming mode where decisions need to incorporate the most recent data available. A recommendation system for identifying similar channels can run in both modes: in the batch case a daily or hourly script builds a table with similarity scores, while in the streaming mode live data about channel concurrency is used to score the different channels. Both approaches can be used as input to a live service that provides recommendations, but the data used in the batch process will be more stale than the stream-based process. It is much easier for the data science team to own the model when a batch process is used.
Reliability is the level of availability needed for the system to be successful. If the model is powering a live service, such as spam detection, then it needs to be able to quickly return a response for every chat message sent on Twitch. If the model goes down, it directly impacts our user experience. However, if the system is not being used by a live service, then high availability may not be a requirement for the system. For example, if a model used for targeted emails goes down, it does not directly impact the user experience but it may impact our business metrics. If a model at Twitch needs high availability, then we partner with our engineering teams to hand off the model.
Scalability is the amount of processing power and memory needed for a model to operate across our user base. If the model is a simple matrix operation, then a single machine may be capable of meeting the system demands. However, if the model requires a large amount of memory or processing power, then a distributed system may be necessary to scale to our demands. One of the considerations when deciding to scale up a model to run on a cluster is determining which team owns the infrastructure. Our approach is to partner with engineering teams when the model needs to run as a distributed system. However, this usually involves translating the model to a new language or library, which supports distributed execution.
Another factor to consider is how frequently changes need to be made to the model. Changes to a model can include updates to parameter weights, such as coefficients in a logistic regression model, changes to the structure of the model, such as new branches included in a decision tree, or incorporating new features (variables) in the model. If a handoff approach is being used for a model, then a process needs to be established for making updates to the model over time. This can become a bottleneck if a translation is performed when handing off the model from the data science team to an engineering team. For example, updates to a Python script might require changes to production Go code. Usually, the easiest type of update to make is changing the parameter weights, since the parameters from the prototype model can be used directly by the production system. Updating the structure of the model is usually more complicated, since the model implementation may be different across the prototype and production systems. One approach to address this issue is using a model store, which enables a trained model to be saved from a script and then loaded by a variety of languages. Adding new features to the model is usually the most challenging type of update to make, and our approach is to decouple the ETL and model evaluation steps as much as possible.
Expressivity describes the types of models supported by a system, and the complexity of authoring different types of models for this environment. SQL has a low degree of expressivity, because it is challenging to write models beyond linear and logistic regression models in this language, while Spark has a high degree of expressivity and supports a variety of machine learning models through its libraries. One of the challenges when handing off a model is that the expressivity of the training environment is usually higher than that of the production environment. For example, a data scientist may want to use a random forest to implement a classifier, but the production system may not directly support this model. Managed services such as Amazon ML are making it easier for data science teams to hand off models to production systems, but do not support all of the models available in R and Python workflows.
There are different tradeoffs for these factors based on whether or not data science owns or hands off the model. If data science owns the model, then usually model maintenance is easier and a more expressive model can be applied. However, this approach doesn’t work as well when the system needs to use near real-time data, be highly available, or run on a large computing infrastructure.
Twitch’s Approach Our approach at Twitch is to first prototype and demonstrate the value of a model before specifying a process for productizing the model. In this initial phase, the data science team has ownership of the model and is responsible for running the model and making its output available to a product team. We use a variety of tools for performing this step, but the most common approach is to set up a Python script that runs as a cron job on an EC2 instance and outputs the results to S3. We hand off the model once it has demonstrated value, and want the output of the model to be provided as a service. Once handed off, the engineering team owns the infrastructure for running the model and the data science team serves as a consultant for updates to the model.
There are a few steps we are taking to improve the hand off between the data science and engineering teams:
Decouple the ETL and model application steps.
Establish best practices for authoring model scripts.
Leverage managed services such as Amazon ML and Kinesis Analytics.
Decoupling the ETL work from the model application step enables us to leverage our Redshift instance for a huge amount of processing power during the prototyping stage. It also results in shorter scripts that are easier to translate to a production environment and helps with model maintenance. Setting up best practices for authoring model scripts enables us to reduce the amount of time it takes to translate models from a prototype to production environment. It includes using models with the same level of expressivity as the production environment and running the prototype on similar infrastructure to the production system. One of the easiest ways of doing hand offs is using managed services such as Kinesis Analytics, because it is similar to authoring SQL queries and the production environment resembles the test environment. However, many managed services have limited expressivity and do not fit our use cases.
We’re continuing to refine our process for putting data science into production. If you’re interesting in helping Twitch build predictive models at scale, take a look at the openings on our science team page.