Inside the Tech - Solving for Avatar Facial Expressions - Roblox Blog

Inside the Tech – Solving for Avatar Facial Expressions – Roblox Blog

Source Node: 3039939

Inside the Tech is a blog series that accompanies our Tech Talks Podcast. In episode 20 of the podcast, Avatars & Self-Expression, Roblox CEO David Baszucki spoke with Senior Director of Engineering Kiran Bhat, Senior Director of Product Mahesh Ramasubramanian, and Principal Product Manager Effie Goenawan, about the future of immersive communication through avatars and the technical challenges we’re solving to enable it. In this edition of Inside the Tech, we talked with Engineering Manager Ian Sachs to learn more about one of those technical challenges—enabling facial expressions for our avatars—and how the Avatar Creation (under the Engine group) team’s work is helping users express themselves on Roblox.

What are the biggest technical challenges your team is taking on?

When we think about how an avatar represents someone on Roblox, we typically consider two things: How it behaves and how it looks. So one major focus for my team is enabling avatars to mirror a person’s expressions. For example, when someone smiles, their avatar smiles in sync with them. 

One of the hard things about tracking facial expressions is tuning the efficiency of our model so that we can capture these expressions directly on the person’s device in real time. We’re committed to making this feature accessible to as many people on Roblox as possible, and we need to support a huge range of devices. The amount of compute power someone’s device can handle is a vital factor in that. We want everyone to be able to express themselves, not just people with powerful devices. So we’re deploying one of our first-ever deep learning models to make this possible. 

The second key technical challenge we’re tackling is simplifying the process creators use to develop dynamic avatars people can personalize. Creating avatars like that is pretty complicated because you have to model the head and if you want it to animate, you have to do very specific things to rig the model, like placing joints and weights for linear blend skinning. We want to make this process easier for creators, so we’re developing technology to simplify it. They should only have to focus on building the static model. When they do, we can automatically rig and cage it. Then, facial tracking and layered clothing should work right off the bat. 

What are some of the innovative approaches and solutions we’re using to tackle these technical challenges?

We’ve done a couple important things to ensure we get the right information for facial expressions. That starts with using industry-standard FACS (Facial Animation Control System). These are the key to everything because they’re what we use to drive an avatar’s facial expressions—how wide the mouth is, which eyes open and how much, and so on. We can use around 50 different FACS controls to describe a desired facial expression. 

When you’re building a machine learning algorithm to estimate facial expressions from images or video, you train a model by showing it example images with known ground truth expressions (described with FACS). By showing the model many different images with different expressions, the model learns to estimate the facial expression of previously unseen faces.

Normally, when you’re working on facial tracking, these expressions are labeled by humans, and the easiest method is using landmarks—for example, placing dots on an image to mark the pixel locations of facial features like the corners of the eyes. 

But FACS weights are different because you can’t look at a picture and say, “The mouth is open 0.9 vs. 0.5.” To solve for this, we’re using synthetic data to generate FACS weights directly that consist of 3D models rendered with FACS poses from different angles and lighting conditions.

Unfortunately, because the model needs to generalize to real faces, we can’t solely train on synthetic data. So we pre-train the model on a landmark prediction task using a combination of real and synthetic data, allowing the model to learn the FACS prediction task using purely synthetic data.

We want face tracking to work for everyone, but some devices are more powerful than others. This means we needed to build a system capable of dynamically adapting itself to the processing power of any device. We accomplished this by splitting our model into a fast approximate FACS prediction phase called BaseNet and a more accurate FACS refinement phase called HiFiNet. During runtime, the system measures its performance, and under optimal conditions, we run both model phases. But if a slowdown is detected (for example, because of a lower-end device), the system runs only the first phase.

What are some of the key things that you’ve learned from doing this technical work?

One is that getting a feature to work is such a small part of what it actually takes to release something successfully. A ton of the work is in the engineering and unit testing process. We need to make sure we have good ways of determining if we have a good pipeline of data. And we need to ask ourselves, “Hey, is this new model actually better than the old one?”

Before we even start the core engineering, all the pipelines we put in place for tracking experiments, ensuring our dataset represents the diversity of our users, evaluating results, and deploying and getting feedback on those new results go into making the model sufficient. But that’s a part of the process that doesn’t get talked about as much, even though it’s so critical. 

Which Roblox value does your team most align with?

Understanding the phase of a project is key, so during innovation, taking the long view matters a lot, especially in research when you’re trying to solve important problems. But respecting the community is also crucial when you’re identifying the problems that are worth innovating on because we want to work on the problems with the most value to our broader community. For example, we specifically chose to work on “face tracking for all” rather than just “face tracking.” As you reach the 90 percent mark of building something, transitioning a prototype into a functional feature hinges on execution and adapting to the project’s stage.

What excites you the most about where Roblox and your team are headed?

I’ve always gravitated toward working on tools that help people be creative. Creating something is special because you end up with something that’s uniquely yours. I’ve worked in visual effects and on various photo editing tools, using math, science, research, and engineering insights to empower people to do really interesting things. Now, at Roblox, I get to take that to a whole new level. Roblox is a creativity platform, not just a tool. And the scale at which we get to build tools that enable creativity is much bigger than anything I’ve worked on before, which is incredibly exciting.

Time Stamp:

More from Roblox