Serve a model behind an endpoint

A kind: service spec deploys a long-running endpoint instead of a one-shot job. Komputo places replicas on the greenest EU capacity within a latency budget, keeps the endpoint URL stable while compute moves, and scales the service to zero when it’s idle.

Describe the service

name: echo-llm
kind: service
image: registry.komputo.eu/examples/vllm:latest
command: python -m vllm.entrypoints.openai.api_server --model mistral-7b
resources:
  accelerators: H100:1
policy:
  optimize: greenest-eu
service:
  min_replicas: 0
  max_replicas: 3
  target_qps_per_replica: 10
  port: 8000

min_replicas: 0 enables scale-to-zero: when no traffic arrives the service drops to zero replicas and the first request after that cold-starts a replica. Set min_replicas: 1 to keep one warm and trade idle carbon for lower tail latency. The service autoscales toward target_qps_per_replica, up to max_replicas.

Deploy it

komputo serve echo-llm.komputo.yaml

service deployed: echo-llm  [RUNNING]
  endpoint (OpenAI base_url): https://api.komputo.eu/v1
  endpoint key (shown once):  kmpt_svc_…

The endpoint key is shown once — store it as a secret. It authorizes calls to this service and nothing else.

Call it from the OpenAI SDK

The endpoint speaks the OpenAI chat-completions API. Point an existing OpenAI client at the base_url and use the endpoint key as the API key — no other code changes:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.komputo.eu/v1",
    api_key="kmpt_svc_…",
)
reply = client.chat.completions.create(
    model="echo-llm",
    messages=[{"role": "user", "content": "Hello"}],
)

Every response carries an X-Komputo-CO2-g header with the grams of CO2 attributed to that request, so per-call carbon is measurable from the client side.

Why live traffic isn’t time-shifted

Unlike a deferrable job, a serving request can’t wait for a cleaner grid window — it has to answer now. The carbon wins here come from greener region placement within the latency budget and from scaling to zero when idle, not from deferral.