Serve a model behind an endpoint
A kind: service spec deploys a long-running endpoint instead of a one-shot job.
Komputo places replicas on the greenest EU capacity within a latency budget, keeps
the endpoint URL stable while compute moves, and scales the service to zero when
it’s idle.
Describe the service
Section titled “Describe the service”name: echo-llmkind: serviceimage: registry.komputo.eu/examples/vllm:latestcommand: python -m vllm.entrypoints.openai.api_server --model mistral-7bresources: accelerators: H100:1policy: optimize: greenest-euservice: min_replicas: 0 max_replicas: 3 target_qps_per_replica: 10 port: 8000min_replicas: 0 enables scale-to-zero: when no traffic arrives the service drops
to zero replicas and the first request after that cold-starts a replica. Set
min_replicas: 1 to keep one warm and trade idle carbon for lower tail latency.
The service autoscales toward target_qps_per_replica, up to max_replicas.
Deploy it
Section titled “Deploy it”komputo serve echo-llm.komputo.yamlservice deployed: echo-llm [RUNNING] endpoint (OpenAI base_url): https://api.komputo.eu/v1 endpoint key (shown once): kmpt_svc_…The endpoint key is shown once — store it as a secret. It authorizes calls to this service and nothing else.
Call it from the OpenAI SDK
Section titled “Call it from the OpenAI SDK”The endpoint speaks the OpenAI chat-completions API. Point an existing OpenAI
client at the base_url and use the endpoint key as the API key — no other code
changes:
from openai import OpenAI
client = OpenAI( base_url="https://api.komputo.eu/v1", api_key="kmpt_svc_…",)reply = client.chat.completions.create( model="echo-llm", messages=[{"role": "user", "content": "Hello"}],)Every response carries an X-Komputo-CO2-g header with the grams of CO2 attributed
to that request, so per-call carbon is measurable from the client side.
Why live traffic isn’t time-shifted
Section titled “Why live traffic isn’t time-shifted”Unlike a deferrable job, a serving request can’t wait for a cleaner grid window — it has to answer now. The carbon wins here come from greener region placement within the latency budget and from scaling to zero when idle, not from deferral.