Skip to content

Serve a model behind an endpoint

A kind: service spec deploys a long-running endpoint instead of a one-shot job. Komputo places replicas on the greenest EU capacity within a latency budget, keeps the endpoint URL stable while compute moves, and scales the service to zero when it’s idle.

name: echo-llm
kind: service
image: registry.komputo.eu/examples/vllm:latest
command: python -m vllm.entrypoints.openai.api_server --model mistral-7b
resources:
accelerators: H100:1
policy:
optimize: greenest-eu
service:
min_replicas: 0
max_replicas: 3
target_qps_per_replica: 10
port: 8000

min_replicas: 0 enables scale-to-zero: when no traffic arrives the service drops to zero replicas and the first request after that cold-starts a replica. Set min_replicas: 1 to keep one warm and trade idle carbon for lower tail latency. The service autoscales toward target_qps_per_replica, up to max_replicas.

Terminal window
komputo serve echo-llm.komputo.yaml
service deployed: echo-llm [RUNNING]
endpoint (OpenAI base_url): https://api.komputo.eu/v1
endpoint key (shown once): kmpt_svc_…

The endpoint key is shown once — store it as a secret. It authorizes calls to this service and nothing else.

The endpoint speaks the OpenAI chat-completions API. Point an existing OpenAI client at the base_url and use the endpoint key as the API key — no other code changes:

from openai import OpenAI
client = OpenAI(
base_url="https://api.komputo.eu/v1",
api_key="kmpt_svc_…",
)
reply = client.chat.completions.create(
model="echo-llm",
messages=[{"role": "user", "content": "Hello"}],
)

Every response carries an X-Komputo-CO2-g header with the grams of CO2 attributed to that request, so per-call carbon is measurable from the client side.

Unlike a deferrable job, a serving request can’t wait for a cleaner grid window — it has to answer now. The carbon wins here come from greener region placement within the latency budget and from scaling to zero when idle, not from deferral.