Google Cloud Run embraces Nvidia GPUs for serverless AI inference
There are several different costs associated with running AI, one of the most fundamental is providing the GPU power needed for inference.
To date, organizations that need to provide AI inference have had to run long-running cloud instances or provision hardware on-premises. Today, Google Cloud is previewing a new approach, and it could reshape the landscape of AI application deployment. The Google Cloud Run serverless offering now integrates Nvidia L4 GPUs, enabling organizations to run serverless inference.
The promise of serverless is that a service only runs when needed and users only pay for what is used. That’s in contrast to a typical cloud instance which will run for a set amount of time as a persistent service and is always available. With a serverless service, in this case, a GPU for inference only fires up and is used when needed.
The serverless inference can be deployed as an Nvidia NIM, as well as other frameworks such as VLLM, Pytorch and Ollama. The addition of Nvidia L4 GPUs is currently in preview.
“As customers increasingly adopt AI, they are seeking to run AI workloads like inference on platforms they are familiar with and start up on,” Sagar Randive, Product Manager, Google Cloud Serverless, told VentureBeat. “Cloud Run users prefer the efficiency and flexibility of the platform and have been asking for Google to add GPU support.”
Article