Deploying an ML App on GCP using L4 GPU-backed MIG

Last Edited:

Apr 14, 2025

Deploying an ML App on GCP using L4 GPU-backed MIG

Research

Deploying an ML App on GCP using L4 GPU-backed MIG

In this article, we will be exploring some steps for deploying a FastAPI application that runs a YOLOv11n (nano) model for image inference on Google Cloud, particularly on an NVIDIA L4 GPU. The idea came from a recent project where I needed to get a GPU-powered app up and running quickly — but ran into all the classic problems: sometimes the Docker container wouldn’t detect the GPU, other times the drivers were installed but the app just wouldn’t start. After a lot of trial and error (and some long evenings debugging startup scripts), I landed on a setup that works reliably using a Managed Instance Group (MIG) with G2 VMs on GCP. The app image is built and pushed to the Google Artifact Registry, and everything else is handled through a startup script and Cloud Build. If you’re in a similar spot — trying to deploy an L4 GPU-based ML service without losing your mind — this guide should help.

FastAPI Application

For the sake of this guide, we will be building a FastAPI application that takes as input an image bytes and runs a Yolov11 model on it. We will be using the pre-trained Yolov11n (nano) weights. The following script was written in the Cursor IDE using the Agent mode with model at ‘Auto’.

from fastapi import FastAPI, File, UploadFile, HTTPException
from typing import List, Optional
import torch
from ultralytics import YOLO
import numpy as np
from PIL import Image
import io
import logging
from pydantic import BaseModel
import time

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(
    title="YOLOv11 Inference API",
    description="API for running YOLOv11 object detection on images",
    version="1.0.0"
)


# Load model
try:
    model = YOLO("yolo11n.pt")
    if torch.cuda.is_available():
        model.to("cuda")
    elif torch.backends.mps.is_available():
        model.to("mps")
    else:
        model.to("cpu")
    logger.info("Model loaded successfully. Using device: %s", model.device)
except Exception as e:
    logger.error(f"Error loading model: {str(e)}")
    raise

class Detection(BaseModel):
    class_id: int
    class_name: str
    confidence: float
    x1: float
    y1: float
    x2: float
    y2: float

class DetectionResponse(BaseModel):
    detections: List
    processing_time: float

def process_image(image_bytes: list[bytes]) -> DetectionResponse:
    start_time = time.time()
    
    try:
        # Convert bytes to PIL Image
        images = [Image.open(io.BytesIO(image)) for image in image_bytes]
        
        # Run inference
        results = model(images)
        print(results)
        # # Process results
        detections = []

        
        for result in results:
            json_result = result.to_json()
            detections.append(json_result)
        processing_time = time.time() - start_time
        
        return DetectionResponse(
            detections=detections,
            processing_time=processing_time
        )
    except Exception as e:
        logger.error(f"Error processing image: {str(e)}")
        raise HTTPException(status_code=500, detail=str(e))



@app.post("/detect", response_model=List[DetectionResponse])
async def detect(files: List[UploadFile] = File(...)):
    try:
        results = []
        image_bytes = [await file.read() for file in files]
        
        result = process_image(image_bytes)
        results.append(result)
        return results
    except Exception as e:
        logger.error(f"Error in detect_batch endpoint: {str(e)}")
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
async def health_check():
    return {"status": "healthy"}

This is a very simple application with three main endpoints:

Post /detect — endpoint that receives the list of images and returns the detection outputs.
GET /health — endpoint to check if the application is running or not.

Other than this, we need the requirements.txt file:

fastapi==0.104.1
uvicorn==0.24.0
python-multipart==0.0.6
torch==2.1.0
torchvision==0.16.0
numpy==1.24.3
Pillow==10.1.0
pydantic==2.4.2
opencv-python 
ultralytics

Cloud Build Configuration

In order to automate our deployment, we will use Cloud Build in our CI/CD pipeline. The following is a general cloudbuild.yaml file for building and pushing the Docker image to the Google Artifact Registry, do fill out the variables according to your needs.

steps:
  # Build the Docker image
  - name: 'gcr.io/cloud-builders/docker'
    args:
      - 'build'
      - '--platform=linux/amd64'
      - '-t'    
      - '{REGION}-docker.pkg.dev/{PROJECT-ID}/{REPOSITORY-ID}/{IMAGE-NAME}:{TAG}'
      - 'app/'

  # Push the Docker image to Artifact Registry
  - name: 'gcr.io/cloud-builders/docker'
    args:
      - 'push'
      - '{REGION}-docker.pkg.dev/{PROJECT-ID}/{REPOSITORY-ID}/{IMAGE-NAME}:{TAG}'


images:
  - '{REGION}-docker.pkg.dev/{PROJECT-ID}/{REPOSITORY-ID}/{IMAGE-NAME}:{TAG}'

To run this, we run the following command:

Cloud Build Triggers (Optional)

In order to further optimize the CI/CD pipeline, you can integrate Google Cloud Build Triggers which automatically trigger the Cloud Build. This means that when you push new changes to your code, your docker image in the Artifact registry automatically updates. In order to implement this, you can follow the documentation from Google, it is quite straightforward.

https://cloud.google.com/build/docs/triggers

Managed Instance Group (MIG)

A Managed Instance Group consists of multiple VMs running as a single entity. This is one of the ways to deploy a GPU-dependent application on GCP, with the other being GKE, or Vertex AI.

The first step of creating a MIG, is to create an Instance Template, i.e. a blueprint/configuration of the individual VM in the MIG. To create this Instance Template, head to Compute Engine > Instace Templates > Create Instance Template:

Here, we can set the name for the template, the region, and the machine configuration.

In the Boot disk section, the default disk is set as such, and it gives a warning regarding the NVIDIA CUDA stack.

When working with a single VM (for development purposes), my preference is to use the Deep Learning VM Images for the boot disk. However, specifically for the G2 machine types, I need to perform an extra step to setup the NVIDIA drivers. That is,ssh into the instance from the GCP console. Once I do that, I am prompted with an option asking me to enter ‘Y’ for NVIDIA setup or ’n’ to skip. Once I enter ‘Y’, I can see the installation occur and the L4 GPU is then accessible, evidenced by the successful execution of the nvidia-smi command.

When configuring the MIG, we can set up autoscaling, which automatically scales instances up and down depending on certain established criteria, usually utilization targets. For the autoscaling to work properly, we need the newly started VM to automatically complete its setup, pull our docker image, and run it. The manual execution of driver installation cannot be done in this case and the best way to automate this process is through startup scripts.

Startups Script

Startup script is an optional script that run when a new VM is provisioned. The main focus of this guide is on this particular part since it handles the drivers installation, as well as the image fetching and starting.

There are a couple of points to note from parts of the relevant GCP documentation

“You can’t use Deep Learning VM Images as boot disks for your VMs that use G2 machine types.”
“The current default driver for Container-Optimized OS doesn’t support L4 GPUs running on G2 machine types. Container-Optimized OS also only support a select set of drivers.”

Firstly, to address the first point, we choose the Container-Optimized OS instead of the Deep Learning VM Images, particularly the stable version (very important to select one of the stable versions).

For the second point, we use startup scripts which follow these main steps:

The first step is to get the driver version compatible with the L4 GPU. From the relevant GCP documentation, this is 550.90.07.
Next, we need to install the driver and confirm its installation. The driver can be installed using the following script:

The following commands can be used to check if the drivers are successfully installed.

3. Now we connect to the Artifact Registry, pull the image and start the container using the following commands:

We use these commands and convert them into the following startup script (replace the variables with your particular values):

#! /bin/bash

sudo cos-extensions install gpu -- -version=550.90.07

sudo mount --bind /var/lib/nvidia /var/lib/nvidia
sudo mount -o remount,exec /var/lib/nvidia

echo "Waiting for GPU drivers to finish installing..."
max_retries=15
retries=0
until /var/lib/nvidia/bin/nvidia-smi; do
  if [ $retries -ge $max_retries ]

This startup script can be entered under the Management tab, in the Automation section:

This wraps up the Instance Template creation step. The next step is to use this template for your Managed Instance Group, here you set the minimum, maximum VMs, health checks, and other configurations. Since this guide’s focus is on the startup script, we will leave it here and you can complete the setup using the following documentation:

https://cloud.google.com/compute/docs/instance-groups/create-mig-with-gpu-vms?source=post_page-----57ca72e3b5cc---------------------------------------

Conclusion

The focus of this guide was to present a solution that solves the issues around driver installation and docker container setup for deploying a machine learning application on Google Cloud Platform. In particular, the setup deals with issues around VMs with the NVIDIA L4 GPU (G2 machine type).

Author

Muhammad Abdullah Mulkana