ML Platform teams with Prassanna Ravishankar from Papercup

Speaker 1:

Hi, and welcome to Fireside AI. My name is Catherine Breslin, and I'm here to talk about how companies build AI technology. Today, I'm joined by Prasanna Rupashankar from Papercup in London. Pras, welcome to the show.

Speaker 2:

Thank you. Thank you.

Speaker 1:

So for those who are meeting you for the first time today, who are you? Tell us a little bit about yourself.

Speaker 2:

Awesome. So I go by Pras. I've been in the ML space for roughly around fifteen years, give or take a year here and there. I started off my journey actually as a computer vision researcher, went deep into that thing, and then came out on the other side quite interested in the engineering behind machine learning. So then I became a machine learning engineer.

Speaker 2:

And post that over the last, maybe half a decade, I've been in the ops space. So, predominantly, my passion's been around MLOps and building platforms as tools and products. So I currently work with Papercup. And Papercup, it's an AI dubbing solution and predominantly works with converting videos in one language to videos in another language. Within that, we have multiple teams, and one of the teams is the machine learning team.

Speaker 2:

The other team is the product team, and I sit in between both of them running a team and an initiative called the platform initiative, which is a combination of MLOps data and SREs, essentially.

Speaker 1:

Fantastic. So that's a great journey from sort of computer vision through MLOps and on into dubbing and audio space as well. So very interesting to see how all of those have evolved over the past fifteen years. You mentioned the platform team. Like, you're you're looking at this platform initiative.

Speaker 1:

So platform is one of those words that gets thrown around a lot. What what's your view on what a platform team is?

Speaker 2:

So this this is this is quite a debatable thing usually, but, essentially, the roots of the platform team lie in the whole birth of the DevOps and the SRE boom and the spirit behind the DevOps boom. Right? So you had the developer workflow and then you had the ops workflow, and this got combined as the DevOps skill set, essentially. And this helped you ship things faster by introducing a lot of automation in the shipping process. So CICD was a big part of it.

Speaker 2:

This happened in the early two thousands, evolved into SREs that Google spearheaded in the February. And as that became more and more popular, you had this almost this, explosion of internal tooling that started developing. And with the growth of internal tooling and the maintenance required for the internal tooling, you had the birth of the platforms team. And pushing that paradigm especially onto AI companies, especially because you have a lot of specific tooling around your data. You have specific tooling around how you train your models, you have specific tooling around how you deploy your models, it becomes almost necessary for an a machine learning or an AI company to start having a skill set around the platforms team.

Speaker 2:

Whether it's formally called platforms or not is a a cultural debate specific to that company.

Speaker 1:

And so the platform team is then building some of that internal tooling specifically to help with ML model building, AI model building?

Speaker 2:

Correct. So there there's ML stuff and ML adjacent stuff, like workflow orchestration. So when it comes to pure ML, it's about how can I scale my operations better, essentially? Right? And this could be model training.

Speaker 2:

This could be data processing. This could be model serving. And how can I make it cheaper, faster? And in the end, the new, the new paradigm that's emerging is around developer experience as well. How can I make it easier for the developers in my team to use the tooling that I push?

Speaker 2:

And at least that's the anchor that I use. Like, hey, how can I anchor on the developer, make life easier for the developer?

Speaker 1:

So there's this real parallel here between DevOps and how software engineering is developed and, I guess, MLOps and how machine learning engineering is developing. So how do you think about MLOps and its role at the moment?

Speaker 2:

Interesting. Interesting question. And this is, again, debated a lot by the community. Right? A lot of people think, MLOps is just model operations or model deployment that is build a Docker image and deploy it somewhere.

Speaker 2:

The way at least I personally look at MLOps is how can I optimize the entire machine learning life cycle? And this goes all the way from data to model deployment, and then integration into your final product, and how it's consumed by users downstream. Right? Sometimes your model itself is a product like ChatGPT and things like that. And sometimes it's used downstream by other services.

Speaker 2:

And so your ownership roughly ends only over there. And across this life cycle, there's multiple tasks. It could be, how can I set up a distributed training infrastructure so I can scale my training more effectively? It could be, how can I scale to multiple aspects of my data? How can I train 10 models in the time of one model?

Speaker 2:

And how can I deploy effectively and cheaply depending on your final user? Right? So in certain cases, you have enough users. You probably have a million, a hundred million users, in which case you optimize for the scale of 10 to hundred. In some cases, you have the occasional user, and then you have to optimize for the scale of zero to one.

Speaker 2:

So it's to understand where where the the user is in this ecosystem and go all the way back to data.

Speaker 1:

Fantastic. So you have this platform team, which is really thinking about these questions about how can you scale up the model building and the model serving so that your your developers, your your machine learning engineers who are building the models are actually able to do their job faster and and quicker and much, much more easily. And that this is complex, problems to be dealing with, and they're quite new. So what sort of skills, do you look for on your ML platform team? What what sort of, skill sets you're trying to build up?

Speaker 2:

So there's the there's the hard skills and the soft skills. Right? Or let's say less hard skills. Let's not call it soft skills. It's it's almost a spectrum.

Speaker 2:

So on one level, we're looking to constantly be connected with our developer. So on one level, that is something like understanding Python, writing good Python, writing understanding good PyTorch, especially if it's MLOps. And it's also about understanding where things get executed. So if it's training on the cloud, getting a good understanding of that particular cloud. Hey.

Speaker 2:

These are the these are buckets. These are containers. Our containers are hosted in Kubernetes, and this is the networking between the containers. So, like, when I'm starting to talk about, the cloud aspect, an important piece about that becomes system design. So this is the less technical skill that I'm talking about.

Speaker 2:

Essentially, we're working a lot with cloud native infrastructure. So a lot of work we do involves building systems from scratch, whether that is something for model serving, whether that's something to improve the network speeds in my distributed training infrastructure, or whether that is something like scaling my data pipelines to 200 GPUs. All of these require essentially system design skills. And at the end of the day, the super soft skill that I'm looking for is this hunger for, like, actually pushing surgical changes to make massive impacts rather than minor impact.

Speaker 1:

This technology hasn't really been around that long. So you're not looking for people who have twenty, thirty, forty years experience in this technology, but more who have the the willingness to learn and the eagerness to adapt to what's going on because it's a pretty fast paced changing world, right, the world of MLOps and platforms.

Speaker 2:

Correct. And the world of ML itself is insanely fast changing. Right? So you have frameworks emerge every single day, and now you have this whole second layer of frameworks evolve around models, like foundational models as well.

Speaker 1:

So with this sort of multi skilled, multi new, fast changing world, I guess one question. What's your what's your top tip for keeping up with what's going on at at the moment? Because this is a question everybody asks. There's so much coming your way. What's what's your one tip for how to keep up with this fast paced world?

Speaker 2:

I don't have a one stop solution. One is Twitter. And follow the right people. Keep retweeting. And retweet is like your kind of bookmarking scenario.

Speaker 2:

The other one is, follow some famous people like Andre Karpathy. If, if you know of some particular ML models, follow their authors on Twitter. The other thing is also Discord. Like if you are interested in a particular repository on GitHub or a particular initiative, see if they have a Discord server, join their Discord. And the one thing that I really do, I do less of the other thing.

Speaker 2:

The one thing that I really do is use a lot of ChatGPT's search feature. So I just keep aiding my curiosity by periodically asking it the same question, like every three months. Hey, has there been a new framework to do this? Or has there been a new framework to do that? And, it, that's that's one way I try to keep up with things.

Speaker 1:

Fantastic. Great advice there. I know it's everybody struggles with this, so great to have some tips. Back back to platform teams, though. After that, I'll slide detour it.

Speaker 1:

Something. Keeping up with the world. So a platform team, like you like you describe it, has obviously got lots of different people invested and interested in what it's doing. So you've got you you talked about machine learning engineers, product, you know, business, development, lots of different teams interested. So how have you found have been really good ways to sort of organize yourself and work with the various stakeholders that your platform team has?

Speaker 2:

So I think me personally, I was slightly lucky in my journey because I was exposed to the research side of things and the engineering or the product side of things. So I generally got a little bit of both in my background. But at least the way I look at things is it's easier to look at stakeholders as personas rather than teams or rather than particular heads of teams. So at least for our team, we have a finance persona who's constantly tracking bills and making sure that our spend is not going above a certain amount. We have, we have a product persona, like, any kind of downtime in our services is immediately affecting their experience.

Speaker 2:

We also have a machine learning researcher persona. Like, if we are introducing a new kind of tooling or a new kind of platform or a new kind of service, are we affecting their lives? And if we are affecting their lives, is it a good trade off based on what they are doing currently? And I personally anchor myself through these personas as we are developing or ideating through various ideas. And that helps us, like, preempt stakeholder questions.

Speaker 2:

It helps us preempt, interactions with stakeholders. And in a way, also helps us take the right design decisions during the whole designing process.

Speaker 1:

So this is about sort of putting yourself in the shoes of those different stakeholders and thinking through their minds, think thinking through their perspective as you think about what you're doing.

Speaker 2:

Correct. Because these are your stakeholders at the end of the day. You have your head of finance who is your stakeholder. You have the product team that's your stakeholder, you have your machine learning engineering team that's your stakeholder. Like, if I, for example, do a cloud switch and I forget to do the right kind of security migration, and the entire product is down, which causes multiple users, multiple downtime, Was there a way I could preempt that?

Speaker 2:

If that is, great. Let's do it from the next time onwards. And this just helps us take those questions and almost lead with those questions in any idea that we we we start thinking about. And in a way, it also helps us ask the question about whether a certain idea is useful or not useful. Are we in premature engineering phase, or are we are we actually solving a real problem?

Speaker 1:

Fantastic. I love this way to sort of get inside the minds of your stakeholders and bring their perspective into your your sort of planning and thought process. So as you've grown the platform team and you've taken on this initiative in Papercup, what impact have you seen it having to sort of separate out the platform and build up this platform initiative?

Speaker 2:

There's impact from a couple of angles. Right? There's an internal impact piece, and then there's an external impact piece, let's say. Internally, like, it's helped us, bring certain team members who were who were fringes in their own particular teams elsewhere together as one common initiative. In terms of the actual impact on the workflows and the life cycles that we are we are overseeing, it's caused quite a few changes around the machine learning side where we've been able to scale to way larger amounts of compute.

Speaker 2:

We've been able to improve our model serving speed as well. So the kinds of models and the kinds of workflows that we've enabled for the machine learning engineer has also improved. So on on one scale, we've been able to train much larger models. Like, we now train on the scale of several tens or twenties of GPUs. And on the other scale, we've been able to run our processing pipelines on massive amounts of compute, like 250 GPUs and, stuff.

Speaker 2:

From the other side, we have unified this whole security aspect, cloud observability aspect, And this mix of skill set that we have, which is data, DevOps, and MLOps, has played an integral part because at the end of the day, in an AI company, there's a lot of machine learning internal tooling that gets developed. And having this skill set of DevOps or SREs to lean on while you're building that internal tooling is quite important because it allows you to not only build for your machine learning engineer, but also allows you to build in the right way for your machine learning engineer and keeping the product mindset so that it integrates really well downstream. And this has been quite, quite pivotal in the changes that we've started making.

Speaker 1:

Brilliant. So bringing all those people together who had this, this focus in the backs of their mind, putting them on a single team and getting them to focus on this question of how do you build ML models at a more scalable way? How can you you build that up and scale up your capacity for doing that? That sounds like what's not to love about that?

Speaker 2:

Correct. Correct. And we've there there there's multiple metrics we track internally, like, how fast it takes to train a model, how many models can we train within a given month. The other thing is on the data processing side, how much data can we actually process. And then it's on the serving side.

Speaker 2:

We track couple of things on the serving side, serving speed, how fast models react basically to predictions, and two, how expensive are they? So all the way from product optimization, you've gone back to help the machine learning engineer experiment and iterate incredibly faster. So this this holistic view of the entire stakeholder set under one team, which is the platforms team, has helped us solve these problems, like, holistically and push out solutions that are not only helping one subsection of our stakeholders, but also, in a scalable and secure way, which is quite and also under constraints. Because at the end of the day, in a startup, you have constraints. So how well you can build on the constraints becomes a thing.

Speaker 1:

Yeah. I think this aspect you mentioned of enabling the ML team to experiment quickly, I think that's really important aspect of being successful in AI. The quicker you can experiment, the quicker you can try stuff out, the the quicker you get that signal about what is or isn't working, and you can move to towards, you know, figuring out what you actually really want to build and deploy. And that that is so important in an AI company.

Speaker 2:

Yeah. And that's gone that's gone down from several weeks to now probably in the order of at most an hour. So that's that's the shift. Like, everything that we want can be deployed. It's a matter of choosing whether to deploy it or not.

Speaker 1:

Fantastic. So if you had to talk to other startups who are in this situation thinking about how to organize their people and whether, you know, building ML platform team is the right thing for them, what do you think are the decision points or the signs that a company should be looking for that it's the right sort of time to be doing that?

Speaker 2:

This is an interesting one because I don't I don't think there's, like, a single correct answer. I I I can maybe talk about, like, my mentality or my framework about how to think about this, and then maybe, Catherine, you can you can correct me. Sure.

Speaker 1:

I can correct you.

Speaker 2:

The way I think about it is, premature optimization is probably worse than late optimization. So you don't wanna start off your your company with a platform's team or an predominant MLOps team. You might want that MLOps skill set initially when you're starting off, when you have, like, less than 10 engineers on your team. You want that MLOps skill set to be shared amongst the existing members of your team. And so everyone does a little bit of deployment.

Speaker 2:

Everyone does a little bit of maintenance. That's what I would say when you're about 10 people also. It's it's around the time where you start looking at series a, if you're a VC funded company or when you start having in the order of about 15 to 20 engineers that you just need to start investing in your MLOps specific skill set. And roughly, my thought process over there is you wanna have a three is to one, four is to one ratio. So your ML engineer to your ML ops engineer should be roughly around four is to one.

Speaker 2:

Because as your team scales, one thing that you're gonna see is people are gonna be training more models. People are gonna be deploying more models. People are going to be integrating more stuff, whether it's whether it's your internal machine learning models or your external ones, like maybe a Lama model, maybe an OpenAI model. And all of these start stacking up as tasks that can be decoupled into a particular skill set, which is an MLOps skill set. On that aspect, I would say generally my framework is to generally look at maintenance load.

Speaker 2:

Maintenance load is a way to, to almost quantify the amount of time that is spent on maintaining certain things. And, and the more times you hit pain points and the more you're investing in the pain point over the future, that's when you got to start investing in the right kind of resources. So if that's happening early on in your journey, when you're nine, 18 years, eight engineers, time to invest Yeah. If it's happening later on, okay, you can wait till that starts affecting you.

Speaker 1:

So it's about paying attention to the sorts of work your engineering team are doing and your machine learning engineers are doing and seeing whether the pain points are coming up and and stopping them progressing perhaps as much as they could do if there was dedicated MLOps.

Speaker 2:

Correct.

Speaker 1:

Yeah. And I think it's a good point that you made about premature optimization and not rushing into doing these things until you can see the the need and the benefit for them. Sometimes sometimes it's it's easy to want to rush into doing these things when when you haven't really proven that they're they're needed yet.

Speaker 2:

Correct. And there's also this thing about building a platform. Right? And if you build a platform prematurely, you're kind of constraining your future development path. So I'm more of the I always want to be in the Goldilocks spectrum, but there is no Goldilocks spectrum.

Speaker 2:

So I'm always like edging towards the late optimization. And you kind of want to figure out, so this is again, the paved path versus the golden path approach. It's probably better off in my so this is again an opinionated debate. It's better off in my opinion to watch where users and their developer workflows are going, build tooling to constantly make that better, rather than saying from the outside that this is what I'm gonna build, and this is the path you as a user or a developer are going to take.

Speaker 1:

Great. And so just to finish off, this thread, what are maybe some of the lessons that you've learned from from building this platform team up at Papercup? And what can you, you know, what wisdom can you give us?

Speaker 2:

I think I think there there's a couple of things that maybe I've learned, but one thing I would say is, on the technical side of things, is there is still no perfect machine learning stack. And things are evolving quite rapidly. You have the Hugging Face ecosystem. You have the PyTorch ecosystem. You have the combination of their ecosystems.

Speaker 2:

You have these services to do training. These services do serving. And it's it's highly fragmented right now. So, I would say there is there is a value in keeping things nimble, because things are highly fragmented. And, again, always anchoring on your developer, which is a machine learning engineer or a software engineer.

Speaker 2:

The other completely opposite side of things, completely nontechnical side of things, as I've been building this team, I realized that culture matters, and culture matters a big deal. It's important to define, or at least reinforce constantly what the platform's team is about. It's important to call out wins by specific individuals when they go over and beyond and achieve certain things. And not just achieve certain things technically, but have the right mindset. Again, it comes on to the SRE mindset and are able to able to take something complex, make simple surgical changes, and create a lot of impact.

Speaker 2:

This is what I'm constantly looking for in the platforms team as well. So these two things I've learned, during my journey as the lead of this team.

Speaker 1:

Drastic. Stay agile, build a great culture. There's no perfect answer, I think, is what I got from that.

Speaker 2:

Everything is a trade off.

Speaker 1:

Everything is a trade off. That seems like a good note to end on. So tell us a bit more. Where where can we find out more about you?

Speaker 2:

I'm generally active on LinkedIn, so find me as Prasanna Ravishankar on LinkedIn. That's with a double s and a double n in Prasanna. Or you can also, check out my podcast. It's called the Feed Forward Podcast.

Speaker 1:

I'll put the the links for both of those in in our show notes so you can go and find them there.

Speaker 2:

Perfect.

Speaker 1:

Thanks a lot. Press learning learning about your journey and the platform team at Papercup. So thanks so much for joining me today.

Speaker 2:

Thank you. The pleasure was mine.

Speaker 1:

That's it for today. Thanks for listening. I'm Catherine Brisson, and I hope you'll join me again next time for Fireside AI.

ML Platform teams with Prassanna Ravishankar from Papercup
Broadcast by