Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/NVIDIA/OpenShell/llms.txt

Use this file to discover all available pages before exploring further.

This page covers the managed local inference endpoint (https://inference.local). External inference endpoints are controlled by sandbox network_policies — see Policies for details. The configuration requires two values:
ValueDescription
Provider recordThe credential backend OpenShell uses to authenticate with the upstream model host.
Model IDThe model to use for generation requests.
For a list of tested providers and their base URLs, see Providers.

Configure the inference backend

1

Create a provider

Create a provider record that holds the backend credentials OpenShell will use when forwarding requests from inference.local.
openshell provider create --name nvidia-prod --type nvidia --from-existing
This reads NVIDIA_API_KEY from your environment.
2

Set inference routing

Point inference.local at the provider you created and choose the model to use:
openshell inference set \
    --provider nvidia-prod \
    --model nvidia/nemotron-3-nano-30b-a3b
By default, openshell inference set probes the resolved upstream endpoint before saving. If the endpoint is not live yet, add --no-verify to persist the route without the probe.
3

Verify the configuration

Confirm the provider and model are set correctly:
openshell inference get
Gateway inference:

  Provider: nvidia-prod
  Model: nvidia/nemotron-3-nano-30b-a3b
  Version: 1

Update part of the config

Use openshell inference update when you want to change only one field without repeating the other:
# Change only the model
openshell inference update --model nvidia/nemotron-3-nano-30b-a3b

# Switch providers without changing the model
openshell inference update --provider openai-prod

Use the local endpoint from a sandbox

After inference is configured, code inside any sandbox can call https://inference.local directly:
from openai import OpenAI

client = OpenAI(base_url="https://inference.local/v1", api_key="unused")

response = client.chat.completions.create(
    model="anything",
    messages=[{"role": "user", "content": "Hello"}],
)
The model and api_key values supplied by the client are not sent upstream. The privacy router injects the real credentials from the configured provider and rewrites the model before forwarding.
Some SDKs require a non-empty API key value even though inference.local does not use it. Pass any placeholder such as test or unused in those cases.

Verify end-to-end from a sandbox

To confirm connectivity from inside a sandbox:
curl https://inference.local/v1/responses \
    -H "Content-Type: application/json" \
    -d '{
      "instructions": "You are a helpful assistant.",
      "input": "Hello!"
    }'
A successful response confirms the privacy router can reach the configured backend and the model is serving requests.

How gateway-level config applies

  • Gateway-scoped: The active provider and model apply to every sandbox using that gateway. All sandboxes see the same inference.local backend.
  • HTTPS only: inference.local is intercepted only for HTTPS traffic.
  • Hot-refresh: Provider and inference changes are picked up within about 5 seconds by default. Sandboxes do not need to be restarted.

Self-hosted NIM endpoint

To configure inference.local to forward to a self-hosted NVIDIA NIM instance:
# Create an OpenAI-compatible provider pointing at your NIM endpoint
openshell provider create \
    --name my-nim \
    --type openai \
    --credential OPENAI_API_KEY=<nim_api_key> \
    --config OPENAI_BASE_URL=https://your-nim-host.example.com/v1

# Set inference routing to use it
openshell inference set \
    --provider my-nim \
    --model meta/llama-3.1-8b-instruct
Replace the base URL, API key, and model ID with values from your NIM deployment.

Local inference with Ollama

To point inference.local at an Ollama instance running on the same host as the gateway:
openshell provider create \
    --name ollama-local \
    --type openai \
    --credential OPENAI_API_KEY=ollama \
    --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1

openshell inference set \
    --provider ollama-local \
    --model llama3.2
If the gateway runs on a remote host, host.openshell.internal points to that remote machine, not your laptop. A locally running Ollama process is not reachable from a remote gateway unless you add a tunnel or shared network path.
Ollama also supports cloud-hosted models using the :cloud tag suffix (for example, qwen3.5:cloud), which do not require local hardware. For a fully self-contained Ollama setup — with Ollama running inside the sandbox itself — see the local inference tutorial.

Troubleshooting

Endpoint probe fails on inference set openshell inference set verifies the upstream endpoint before saving by default. If the model server is not running yet, use --no-verify to save the config first and retry verification later. Requests fail with a connection error inside the sandbox Check that the upstream server is bound to 0.0.0.0 rather than 127.0.0.1. Requests to inference.local originate from the gateway runtime, so loopback addresses are not reachable. Use host.openshell.internal or the host’s LAN IP in the provider’s OPENAI_BASE_URL. SDK rejects an empty API key Some SDKs validate that the API key field is non-empty before sending the request. Pass any non-empty placeholder — inference.local ignores whatever value the sandbox provides. Changes not taking effect Hot-refresh propagates within about 5 seconds. If a sandbox still uses the old config after that window, check openshell inference get to confirm the saved configuration is correct.

Next steps

Inference routing overview

Understand the two routing paths and supported API patterns.

Manage providers

View and update provider credential records.

Local inference tutorial

Complete walkthrough for local inference with Ollama and LM Studio.