Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

24 min

•May 13, 20262 months ago

Summary

Snap's engineering team accelerated their data processing pipeline handling 10+ petabytes daily by integrating NVIDIA Spark Rapids and Google Cloud, achieving a 76% reduction in job costs. The company creatively leveraged idle GPU capacity from their online inference infrastructure during off-peak hours to power batch data processing workloads on Kubernetes, requiring zero code changes to existing Spark jobs.

Insights

GPU acceleration for data processing requires architectural rethinking beyond simple hardware swaps—Snap built an entire Kubernetes-based platform to manage preemption, fallback mechanisms, and resource sharing across inference and batch workloads
Zero-code migration paths (like Spark Rapids) dramatically reduce adoption friction and enable rapid production deployment; Snap moved from prototype to full production in 8-9 months
Temporal usage patterns in global consumer platforms create exploitable GPU capacity windows—Snap identified 1-5 a.m. idle inference GPU time as a viable batch processing window
Statistical rigor at scale requires continuous innovation in experimentation methodology—heterogeneous treatment effects, variance reduction, and sample size mismatch detection are table stakes for A/B testing at billion-user scale
Multi-vendor partnerships (NVIDIA, Google Cloud, Snap) with continuous knowledge sharing and collaborative problem-solving are essential for complex infrastructure migrations at enterprise scale

Trends

GPU capacity optimization through temporal load shifting and cross-functional resource sharing becoming standard practiceKubernetes consolidation of batch and serving workloads to maximize hardware utilization and reduce infrastructure fragmentationZero-code acceleration frameworks enabling rapid adoption of GPU computing without engineering rewritesExperimentation platforms evolving beyond basic A/B testing to include heterogeneous treatment effects and advanced statistical methodsCost-per-compute becoming primary optimization metric, driving hardware efficiency over raw performance gainsGraceful degradation patterns (GPU→CPU→Dataproc fallback) as operational requirement for production data pipelinesGlobal consumer platforms leveraging geographic and temporal usage patterns for infrastructure optimizationNVIDIA libraries (Spark Rapids, Ether) becoming standard components of modern data stack architecture

Topics

GPU-Accelerated Apache Spark Data Pipeline Optimization Kubernetes-Based Batch Processing A/B Testing at Scale Experimentation Platforms Cost Reduction in Data Infrastructure GPU Capacity Management Heterogeneous Treatment Effects Detection Variance Reduction in Experiments Sample Size Mismatch in A/B Tests Google Cloud Dataproc NVIDIA Spark Rapids NVIDIA Ether Tuning Preemption and Fallback Mechanisms Augmented Reality and AI Infrastructure

Companies

Snap

Primary subject; 940M+ monthly active users processing 10+ petabytes daily for experimentation and data infrastructure

NVIDIA

Provided Spark Rapids and Ether solutions enabling GPU acceleration with zero code changes and automatic tuning

Google Cloud

Infrastructure partner providing Dataproc, GKE, and Kubernetes services for Snap's data platform

People

Prudhwi Vatala

Guest discussing Snap's GPU acceleration journey, experimentation platform, and data infrastructure optimization

Noah Kravitz

Podcast host conducting interview about Snap's data processing acceleration

Quotes

"We were able to cut almost about 76% of our job costs as a result of this migration."

Prudhwi Vatala•Opening and closing segments

"We didn't have to change a single thing about how we ran the jobs. That was the beauty of it. Zero code changes."

Prudhwi Vatala•Mid-episode

"We were able to cut down the number of cores required by like 62%. The memory footprint, we could drop it by like 80%."

Prudhwi Vatala•Opening segment

"Migrating a production pipeline with 10 plus petabytes from prototyping exploration to full production in a matter of about eight to nine months is phenomenal."

Prudhwi Vatala•Late episode

"Snap has had a massive impact on the planet. Having a direct role to play in it is a great feeling."

Prudhwi Vatala•Closing segment

Full Transcript

We were able to cut almost about 76% of our job costs as a result of this migration. 76? 76. It's phenomenal. I mean, for the engineers out there, we were able to cut down the number of cores required by like 62%. The memory footprint, we could drop it by like 80%. So phenomenal results. The results speak for themselves. Welcome to the NVIDIA AI Podcast. I'm Noah Kravitz. I'm here with Prudhwi Vatala. Prudhwi is the head of engineering platforms at Snap, and we're here to talk about data processing, and in particular, how a social platform with more than 940 million active users accelerated their data pipeline. Prudhwi, welcome to the NVIDIA AI podcast. Thanks so much for taking the time to join us. Yeah, thanks for having me here, Noah. So maybe we can start with the basics. Tell us a little bit about, well, about what Snap is now. I'm old, but I still think of it, you know, the Snapglasses and everything, but Snapchat, obviously a huge social platform. So maybe tell us a little bit about Snap and then your role there. Absolutely. Yeah. I mean, Snapchat at this point is pretty much a household name. You know, it's Snap as a company. It's interesting that you bring up the spectacles because Snap as a company believes that camera is at the center of, you know, improving how people communicate and improve their lives, you know, in the digital world, so to speak. So we've been steadfast on that belief. And, you know, Snap right now is at the intersection of augmented reality, AI, and visual communication. Like you said, serving close to a billion monthly active users. I've been at Snap for a while now. And I lead a multifaceted organization. We do a little bit of it with big data infrastructure, a little bit of it with developer productivity, and a little bit of it with enterprise AI and whatnot. So, yeah. And so when we talk about accelerating data processing, what does that mean to you? What does that mean for Snap and thinking about the scale that you operate on? Can you just talk a little bit about what it means to accelerate data at that level? Absolutely. That's a great question. As you can imagine, with as many users as we have, and Snapchat in particular is a very complex application. So you can imagine the scale at which we operate, especially on the data processing side. We are dealing with my team's experimentation platform is dealing with 10 plus petabytes each day. It's a massive scale, right? It's a huge scale, yeah. And then we have a strict SLA in the morning because experimentation results need to be ready for developers, product managers, data scientists to act on as early as possible so that they can take appropriate action. So for us, accelerating data processing basically means instead of throwing more and more CPUs at the problem, figuring out a way to flatten that scale curve. So in this particular scenario, it was about figuring out how to leverage GPUs for improving our workloads, making sure they run faster, cheaper, and scale linearly or sublinearly. Unlike right now, it's definitely super linear with feature areas. So that's what accelerating. So you mentioned experimentations. What does that mean when you're conducting experiments at Snap? What does that look like? And then maybe how does that fit into, is that where the 10 petabytes of data each morning comes from? Or we can talk about that. Yeah, absolutely. So this 10 petabyte data is only about the experimentation platform. The big data across Snap is far wider. Sure. So experimentation, it's a little bit about Snap's product philosophy. We believe that experimentation, safety, and privacy are core pillars for our product development and iteration. Like when we are thinking about new product areas, when we are shipping new product features to our half a billion daily active users across the globe, we need to think about how the users are receiving it, how they're responding to it, how they're using it, whether or not this is adding value to their daily lives, and also guard railing things. Like, is it regressing their performance? Is it causing their devices to slow down? Or, you know, we need to be very particular about protecting their experiences as well. And so, Prue, along those lines with the experimentation, can you talk a little bit about the importance of A-B testing? So, A-B testing is, you know, the concept of randomized control trials has been around for a long time, you know, especially in the clinical fields and whatnot. But with the digital revolution, it has become the mode of bringing statistical rigor to decision making at scale, right? So, that's what A-B testing adds to us. When we are dealing with this massive user base that is diverse by nature, from all walks of life, across the globe, and we are trying to delight them, we are trying to bring experiences to them, we need to make sure what we are delivering is buttoned down. It's actually really adding value the way we think it is. And at this scale a lot of things can happen And that where having the statistical rigor grounded in holdouts and well controls and statistical methods comes in Over the years, my team has added a bunch of statistical methods to our platform. Heterogeneous treatment effects detection. For example, you may think that a feature is performing well, for the global audience, but it may not perform so well for a subset. Right. So figuring out those heterogeneous effects is one thing that we focus on. And, you know, at this scale, no matter how you slice your experiments, you are still allowing some bias to seep in, as in, you know, some power users may end up on one side of the experiment rather than the other. So how do we make sure the distributions are evened out when the experiment results are read? That's the variance reduction aspect. So that's something my team built over time. And then, you know, sometimes when we ship a feature, if people don't like it, they might even just stop showing up, you know? Right, right, right. So that's the sample size mismatch problem. So we also do a bunch of that rigorously. So that's what A-B testing brings to the table. So with all of the data processing every day, what made you think that maybe some NVIDIA tech put into the stack might help things out? How did that process start? And maybe you can talk about what you've integrated and what you're using. Absolutely. So I'm really proud of this. I'm really proud of my team because over the years that I've been seeing our platform, the number of users grew, like Snap, you know, ballooned, right, in terms of footprint. the number of features we shipped, like, you know, spotlight, you know, AR features, AR lenses, and all of the AI features we shipped in the recent past. So they've also been adding a lot of additional dimensions to the platform. And my team was hard at work making sure we are not, you know, we're scaling appropriately, even as all of this scale grows. And they've done a very good job of it historically for years now, maintaining the cost flat and, you know, performance predictable, meeting the SLAs and whatnot. And one thing we came across, you know, we came across NVIDIA Spark Rapids on one of the blog posts, and we saw NVIDIA is shipping this, you know, solution to speed up our PySpark workloads by anywhere from 3.6x performance versus 50%, you know, runtime, you know. Okay. It was phenomenal on paper. So that's what drew us. I'm waiting to hear the numbers sound good. I'm waiting to hear the rest. So we read those and we got super excited. And then our stack was, it still is, entirely Google Cloud for experimentation platform. We loved working with them. The Google Cloud data proc was phenomenal. They've been a fantastic partner to us throughout the scaling journey. It's great to hear. Yeah. And then when this news came out with Spark Rapids, we wanted to try it out. We did a bunch of benchmarking. We tried, obviously, like I said, we do a lot of things. So there is a lot of complexity to the nature of the jobs we run. So we had to benchmark each kind of job as well, like taking jobs that are heavy with joins and repartitions and shuffling of data that moves data around versus jobs that are purely unioning data from various places versus jobs that are purely aggregating, like running sums and whatnot. So we had to benchmark across all of them. And we noticed that even on Google Dataproc with Spark Rapids, we got about, I want to say, 3x plus improvement for the joint jobs and about close to 2x for the union jobs and a little over 1.5x for aggregations. That's largely because CPUs are already good at aggregation. And then the other thing is GPUs by nature support parallelism and high bandwidth memory on the hardware itself. So that made it like a very good candidate for us to pursue. And so you're running your GPU accelerated pipelines on Google Kubernetes. Is that right? Yes. Yes. That has been a very interesting journey from, you know, testing out our pipelines with Dataproc GPUs. and to today. And one other thing, like with Spark Rapids, I want to mention it, we didn't have to change a single thing about how we ran the jobs. That was the beauty of it. Zero code changes. Oh, that's amazing. Zero code changes. So I'm into developer productivity and developer enablement. So for me, that was music to my ears. Sure, of course. So that was very impressive. So with Dataproc, which abstracts out the Spark runtime for us and Spark Rapids, which didn't require us to change the jobs, it was phenomenal. Yeah, amazing. So it went very well. So we wanted to productionize this. We were able to, at our scale, pipelines aren't just monolithic, right? We do a bunch of sharding and then batching of work. So we were able to migrate one shard to production on Google Data Proc using 300 GPUs. The results were phenomenal. And then in the next phase we wanted to migrate 10 shards for total 50 plus shard architecture and then it needed about 3 GPUs which was still doable with data proc on GPUs Because GPU capacity is on everybody minds these days So that was well and good, but then we didn't have a path forward after that. So we kind of hit a roadblock with on-demand GPU capacity. So we had to get creative. So we started looking around. We were like, where at Snap do we have GPU capacity that we can borrow? And that's where the real insight came for us. Snap has a global audience, and the Snapchatter's behavior is cyclical during the day. People wake up, they use Snapchat, and they go to bed, they don't. So what that meant was when some of our biggest markets went to bed, a lot of our online inference GPU capacity was sitting idle. somewhere between 1 a.m. and 5 a.m. So that was our opening, our opportunity to go tackle. And that brought about its own set of complexity, right? Because online serving stack is not built for batch data processing. They were considered fundamentally different words, right? So all the online GPUs were tied to Kubernetes and GKE. and we were already on Google Clouds. GKE wasn't an issue for us at all. It was actually very welcome. So we had to migrate our workloads to Kubernetes-based Spark runtime and host it on GKE so that we can leverage what the online GPUs had to offer. And for that, we had to actually build a data platform ground up. Okay. Because it's one thing for my team to just use this idle capacity, But at Snap, we wanted to make sure even as the online need for GPUs increased, as our AI footprint increased, we should still have any team at Snap be able to leverage that capacity for any of their needs as available. And then we had to also acknowledge that if a user wanted to see fresh spotlight content, it supersedes GPU need for experimentation. Preemption had to be built in. Yes. Yeah. So if if if if we had a sudden spike in traffic, we had to give up GPU capacity. So with all of that in mind, we built out a platform ground up. OK. And then then we started migrating. And that's that's the and we had a lot of blockers along the way and the team got really creative. Yeah, it was it was a phenomenal journey. Amazing. Yeah. And so you're also running an accelerated Apache Spark pipeline. Yes, yes. So a lot of our pipelines, at a high level, our pipelines are split into daily and hourly cadence. So hourly is mostly for guard railing, like I said. We don't want to break users' experience no matter what. And having that hourly feedback cycle goes a long way in doing that. And then we also have daily pipelines, which serve as the statistical authority for decision making. So our first migration to GKE plus NVIDIA Spark Rapids was the hourly pipeline. Because, you know, speed mattered there far more, right? So we migrated. And then we migrated and operationalized it. And during that process, we ran into a few corner cases. You know, if the GPU capacity wasn't available at like 11 a.m. when everybody was active on Snap, right, what do we do? So we had to figure out how to gracefully fall back from GPUs to CPUs. And then if the shared GKE resources itself was the constraint, then we had to gracefully fall back from CPUs to data proc clusters. So building all of that with operational reliability in mind was also great. Yeah. Looking back on it, what learnings would you, you know, if there's a listener out there who's embarking on a similar project or trying to figure out maybe there's a, you know, like you said, kind of a daily cycle of when the GPUs are in use for inference and when they're not, they're thinking about, you know, borrowing GPUs from other parts of the company. Learnings you would share from this whole process? Is there a big takeaway, something that surprised you? Right. So the direction that NVIDIA is headed in is phenomenal for these kinds of needs. You know, NVIDIA Spark Rapids, like I said, zero code written. Yeah. Zero code changed to enable it. We had to figure out the image building and environment difference and whatnot. The testing cycles, obviously, any production workload needs to go through that rigorous rollout process. So everybody needs to pay attention to it. But this is a real possibility, you know, the NVIDIA direction. The other thing that NVIDIA offered that really helped us a lot was NVIDIA Ether. It's another solution that gives us Spark tuning out of the box. Because especially when we had this fallback mechanism in place, where we had to go from GPUs to CPUs to data proc, the environments are different, the Spark parameters had to be different. So something like NVIDIA Ether giving us a starting point and making sure the tuning stayed consistent across all of these versions was also very helpful. So you've mentioned, obviously, the work with NVIDIA and Google Cloud as well. Kind of from taking a step back, sort of bigger picture, what are these partnerships and working hand in hand so closely with Google Cloud with NVIDIA What is that doing to the way that you and Snap see your roadmaps for both data and AI kind of growing going forward Yeah, it's, I mean, huge props to the NVIDIA team and the Google Cloud team, honestly. It's been a phenomenal three-way partnership like I've never seen in my career before. Oh, amazing, yeah. It was phenomenal. And the impact speaks for itself, right? Like we were able to cut almost about 76% of our job costs as a result of this migration. 76? 76. Wow. It's phenomenal. Yeah. Right? I mean, for the engineers out there, like we were able to cut down the number of cores required by like 62%. Amazing. The memory footprint, we could drop it by like 80%. I mean, for the Spark nerds out there, we were able to cut out almost 120 terabytes of disk spill, disk and memory spill from our pipelines. Wow. Just vanished once we started doing all of this. Yeah. So that is one of the biggest headaches any data pipeline at scale runs into. So phenomenal results. The results speak for themselves. So without the partnership, this would not have been possible in the timescale that it was possible. Right. Like migrating a production pipeline with 10 plus petabytes from prototyping exploration to full production in a matter of about eight to nine months is phenomenal. And without the continuous back and forth and knowledge sharing and partnership across these three companies, this wouldn't have been possible. That's great. Yeah, and in terms of the roadmap, it definitely had an impact. Like I said, my team built this bottom-up data platform to enable any team at Snap to leverage the GPU capacity and what NVIDIA libraries have to offer. And we're already seeing movement with it, right? Even my own team started migrating other things that we haven't even tried out so far, experimenting with them, trying out, Because even if we don't have ideal capacity to fit all of our workloads all the time, if we can schedule things creatively, if we can move things around, we can maximize the capacity as much as we can. And a lot of other teams are also picking this out. Yeah, it's fantastic. So you've been at Snap for eight years, is that right? Seven? Close to eight. Okay. And Snap's been around for about 15 years, give or take? Yes. working at a social media, a huge social media platform over this span of time where social media has just, you know, become such a, such a core part of the fabric of so many people's lives. What's it been like to be at Snap and to see the changes both, you know, I said at the beginning, right, I remember the spectacles. That's my first thought of Snap. And obviously now Snapchat, you know, same, same lineage, same philosophy, different product, obviously, right? But what's it like to just have seen the evolution of social media and then also so many technological changes that impact, you know, what you're able to do and how you do it, as you were just describing? What's it been like from the inside? Yeah, it's been unbelievable of an experience, Noah. Like, that's what gets me up in the morning every day, you know? Like Snap, I mean, in the visual communication AR, AI landscape, Snap has had a massive impact on the planet. Yes. Honestly, and having a direct role to play in it is a great feeling. Right. I've seen the company grow from, you know, the camera messaging, you know, picture messaging to what it is today. AR stories, which is something we invented and the whole world, including some newspapers. So the stories as a format. And then to your point about spectacles, we did it before anybody else was even thinking about it. So the company is innovative. We come up with so many new things and running platforms inside means that I have to figure out a way to enable all of this, even as the company evolves. And that's been having a front row seat to that evolution and playing a big part of it has been very fulfilling. Fantastic. Proof for listeners, viewers who, there are some out there who haven't used Snapchat before, for anyone who wants to get the experience, but also to learn more about Snap and maybe about some of the technical work that you're doing. Are there, obviously, the website, their social media? Is there a research blog? Where can people go? Absolutely. So we have an engineering blog that's pretty active. We share a lot of phenomenal work that engineers in the company are working on. And, you know, we are also participating in events like this and sharing our knowledge with the world. So, you know, and Snapchat, if you haven't used it, you should definitely give it a try. It's different from social media. This is a true story. I got a snap from my younger son maybe 45 minutes before we sat down to do this, and it made my day. So absolutely, if you haven't. Pruvatala, thank you so much. This has been a great conversation, and I'm sure the developers, the engineers, and the audience hopefully have taken a lot from it. But thank you so much for taking the time to join us and all the best to you and everybody at Snap to keep changing the world for the better. Thank you so much. Thanks for having me. Appreciate it.