The promise of serverless is that a service only runs when needed and users only pay for what is used. That’s in contrast to a typical cloud instance which will run for a set amount of time as a persistent service and is always available. With a serverless service, in this case, a GPU for inference only fires up and is used when needed.
The serverless inference can be deployed as an Nvidia NIM, as well as other frameworks such as VLLM, Pytorch and Ollama. The addition of Nvidia L4 GPUs is currently in preview.
“As customers increasingly adopt AI, they are seeking to run AI workloads like inference on platforms they are familiar with and start up on,” Sagar Randive, Product Manager, Google Cloud Serverless, told VentureBeat. “Cloud Run users prefer the efficiency and flexibility of the platform and have been asking for Google to add GPU support.”