LLM Inference API Service¶
The LLM Inference API service is in early access
The service is under development, and is available by request to users who want to help CSCS build the service.
During the beta we want to understand the following:
- Single-site deployments vs. geo-redundant deployments for higher availability.
- Usage patterns (e.g. scale-to-zero on inactivity, vs. keep warm with a minimum replica count).
- Which models that should be offered, and under which conditions?
- The trade-off between existing capacity and future cost.
- What are the appropriate accounting metrics?
During the beta users should expect that:
- Capacity and availability is limited. Downtimes and slowdowns are to be expected.
- The models that are available can change over time.
- Access to the Beta is upon invitation, without any cost.
Please contact Pablo Fernandez at pablo.fernandez@cscs.ch if you are interested to participate in the Beta, describing your use case, relevant project or organizational context, and an estimate of your expected requirements including load, preferred models, and availability expectations.
The LLM Inference API service provides Internet-accessible OpenAI/Anthropic-compatible inference endpoints backed by selected open-weight LLM models such as Apertus and other vetted models. Users consume from a shared pool of models where requests are efficiently multiplexed across shared serving capacity, without needing to deploy, patch, scale, or operate the underlying serving stack.
Private model deployment is not supported. If you are interested to deploy a model that is not available in this service, we encourage using the sml tool developed by the Swiss AI community.
Usage of sensitive or personal data is not allowed. For privacy reasons, CSCS does not track user prompts or model responses. However, CSCS collects infrastructure metrics and telemetry, including prompt and response lengths, in order to monitor the service quality.
Service at a glance¶
- Managed endpoints
Standard API access over HTTPS using familiar client libraries and tooling.
- Curated models
Selected models are made available and updated centrally.
- No infrastructure management
Let CSCS manage GPUs, containers, autoscaling, and model servers.
- Sovereign and private
Your data is yours and is processed entirely within CSCS in Switzerland. Prompts and responses are not tracked.
Quick Start¶
Before using the API, obtain an authentication token by following the access guide. Include this token in every API request.
Anthropic’s claude
Example environment configuration to be set before starting a claude session.
List available models
Chat completion request
$ curl -X POST "https://llm-proxy.svc.cscs.ch/chat/completions" \
-H "Authorization: Bearer <AUTHENTICATION_TOKEN>" \
-H "Content-Type: application/json" \
-d '{
"model": "Apertus-70B-Instruct-2509",
"messages": [
{"role": "user", "content": "Explain gradient descent in one paragraph."}
],
"temperature": 0.2
}'
Access¶
Request access¶
Early access to this service requires an invitation.
If you would like to participate, please contact Pablo Fernandez (pablo.fernandez@cscs.ch) describing your use case, relevant project or organizational context, and an estimate of your expected requirements including load, preferred models, and availability expectations.
Obtain your authentication token¶
Approved projects receive an authentication token, which can be retrieved and managed through the project management portal. The token can be accessed by selecting “Inference Service” under “Resources” on the left side bar menu on the portal, as demonstrated in the image below:

API¶
The service is accessed through the gateway base URL https://llm-proxy.svc.cscs.ch, and support standard endpoints, such as:
| Path | Purpose |
|---|---|
/v1/models |
Query available models |
/v1/chat/completions |
Chat completions |
/v1/embeddings |
Get a vector representation of a given input |
Todo
Describe API support. If we provide both OpenAI and Anthropic APIs, is it sufficient to provide links to external documentation for these APIs, with notes about any differences?
Guides¶
Reducing consumption¶
- Longer prompts increase cost and latency
- Future costs may differentiate across models with different computational load
Known issues and limitations¶
- Project key management is still evolving; currently one key is issued per project and rotation requires contacting the team.
- Detailed self-service telemetry is limited today.
- Documentation and model-specific configuration transparency are work in progress.
- Load balancing and other QoS need to be understood.