Configure Inference Routing - NVIDIA OpenShell

This page covers the managed local inference endpoint (https://inference.local). External inference endpoints are controlled by sandbox network_policies — see Policies for details. The configuration requires two values:

Value	Description
Provider record	The credential backend OpenShell uses to authenticate with the upstream model host.
Model ID	The model to use for generation requests.

For a list of tested providers and their base URLs, see Providers.

Configure the inference backend

Create a provider

Create a provider record that holds the backend credentials OpenShell will use when forwarding requests from inference.local.

NVIDIA API Catalog
OpenAI-compatible
Local endpoint
Anthropic

openshell provider create --name nvidia-prod --type nvidia --from-existing

This reads NVIDIA_API_KEY from your environment.

Any cloud provider that exposes an OpenAI-compatible API works with the openai provider type. You need the base URL, an API key, and a model name from your provider.

openshell provider create \
    --name my-cloud-provider \
    --type openai \
    --credential OPENAI_API_KEY=<your_api_key> \
    --config OPENAI_BASE_URL=https://api.example.com/v1

Replace the base URL and API key with values from your provider. For other providers, refer to their documentation for the correct base URL, available models, and API key setup.

Use --config OPENAI_BASE_URL to point to any OpenAI-compatible server running where the gateway runs.

openshell provider create \
    --name my-local-model \
    --type openai \
    --credential OPENAI_API_KEY=empty-if-not-required \
    --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1

For host-backed local inference, use host.openshell.internal or the host’s LAN IP. Avoid 127.0.0.1 and localhost — requests originate from the gateway or sandbox runtime, not your shell.Set OPENAI_API_KEY to a placeholder value if the server does not require authentication.

For a self-contained setup, the Ollama community sandbox bundles Ollama inside the sandbox itself — no host-level provider is needed. See the local inference tutorial for details.

openshell provider create --name anthropic-prod --type anthropic --from-existing

This reads ANTHROPIC_API_KEY from your environment.

Set inference routing

Point inference.local at the provider you created and choose the model to use:

openshell inference set \
    --provider nvidia-prod \
    --model nvidia/nemotron-3-nano-30b-a3b

By default, openshell inference set probes the resolved upstream endpoint before saving. If the endpoint is not live yet, add --no-verify to persist the route without the probe.

Verify the configuration

Confirm the provider and model are set correctly:

openshell inference get

Gateway inference:

  Provider: nvidia-prod
  Model: nvidia/nemotron-3-nano-30b-a3b
  Version: 1

Update part of the config

Use openshell inference update when you want to change only one field without repeating the other:

# Change only the model
openshell inference update --model nvidia/nemotron-3-nano-30b-a3b

# Switch providers without changing the model
openshell inference update --provider openai-prod

Use the local endpoint from a sandbox

After inference is configured, code inside any sandbox can call https://inference.local directly:

from openai import OpenAI

client = OpenAI(base_url="https://inference.local/v1", api_key="unused")

response = client.chat.completions.create(
    model="anything",
    messages=[{"role": "user", "content": "Hello"}],
)

The model and api_key values supplied by the client are not sent upstream. The privacy router injects the real credentials from the configured provider and rewrites the model before forwarding.

Some SDKs require a non-empty API key value even though inference.local does not use it. Pass any placeholder such as test or unused in those cases.

Verify end-to-end from a sandbox

To confirm connectivity from inside a sandbox:

curl https://inference.local/v1/responses \
    -H "Content-Type: application/json" \
    -d '{
      "instructions": "You are a helpful assistant.",
      "input": "Hello!"
    }'

A successful response confirms the privacy router can reach the configured backend and the model is serving requests.

How gateway-level config applies

Gateway-scoped: The active provider and model apply to every sandbox using that gateway. All sandboxes see the same inference.local backend.
HTTPS only: inference.local is intercepted only for HTTPS traffic.
Hot-refresh: Provider and inference changes are picked up within about 5 seconds by default. Sandboxes do not need to be restarted.

Self-hosted NIM endpoint

To configure inference.local to forward to a self-hosted NVIDIA NIM instance:

# Create an OpenAI-compatible provider pointing at your NIM endpoint
openshell provider create \
    --name my-nim \
    --type openai \
    --credential OPENAI_API_KEY=<nim_api_key> \
    --config OPENAI_BASE_URL=https://your-nim-host.example.com/v1

# Set inference routing to use it
openshell inference set \
    --provider my-nim \
    --model meta/llama-3.1-8b-instruct

Replace the base URL, API key, and model ID with values from your NIM deployment.

Local inference with Ollama

To point inference.local at an Ollama instance running on the same host as the gateway:

openshell provider create \
    --name ollama-local \
    --type openai \
    --credential OPENAI_API_KEY=ollama \
    --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1

openshell inference set \
    --provider ollama-local \
    --model llama3.2

If the gateway runs on a remote host, host.openshell.internal points to that remote machine, not your laptop. A locally running Ollama process is not reachable from a remote gateway unless you add a tunnel or shared network path.

Ollama also supports cloud-hosted models using the :cloud tag suffix (for example, qwen3.5:cloud), which do not require local hardware. For a fully self-contained Ollama setup — with Ollama running inside the sandbox itself — see the local inference tutorial.

Troubleshooting

Endpoint probe fails on inference set openshell inference set verifies the upstream endpoint before saving by default. If the model server is not running yet, use --no-verify to save the config first and retry verification later. Requests fail with a connection error inside the sandbox Check that the upstream server is bound to 0.0.0.0 rather than 127.0.0.1. Requests to inference.local originate from the gateway runtime, so loopback addresses are not reachable. Use host.openshell.internal or the host’s LAN IP in the provider’s OPENAI_BASE_URL. SDK rejects an empty API key Some SDKs validate that the API key field is non-empty before sending the request. Pass any non-empty placeholder — inference.local ignores whatever value the sandbox provides. Changes not taking effect Hot-refresh propagates within about 5 seconds. If a sandbox still uses the old config after that window, check openshell inference get to confirm the saved configuration is correct.

Next steps

Inference routing overview

Understand the two routing paths and supported API patterns.

Manage providers

View and update provider credential records.

Local inference tutorial

Complete walkthrough for local inference with Ollama and LM Studio.

Documentation Index

​Configure the inference backend

​Update part of the config

​Use the local endpoint from a sandbox

​Verify end-to-end from a sandbox

​How gateway-level config applies

​Self-hosted NIM endpoint

​Local inference with Ollama

​Troubleshooting

​Next steps

Inference routing overview

Manage providers

Local inference tutorial

Configure the inference backend

Update part of the config

Use the local endpoint from a sandbox

Verify end-to-end from a sandbox

How gateway-level config applies

Self-hosted NIM endpoint

Local inference with Ollama

Troubleshooting

Next steps