Run local inference in a sandbox

This tutorial covers two ways to run local inference with OpenShell: using Ollama or using LM Studio. Both approaches expose a local model backend through inference.local so that agents inside a sandbox can make inference requests without reaching external APIs.

Ollama
LM Studio

Ollama offers two approaches: a self-contained community sandbox with Ollama pre-installed, or routing sandbox inference to a host-level Ollama instance shared across multiple sandboxes.

Prerequisites

A working OpenShell installation. Complete the Quickstart before proceeding.

Option A: Ollama community sandbox (recommended)

The Ollama community sandbox bundles Ollama, Claude Code, OpenCode, and Codex into a single image. Ollama starts automatically when the sandbox launches.

Create the sandbox

$ openshell sandbox create --from ollama

This pulls the community sandbox image, applies the bundled policy, and drops you into a shell with Ollama running.

Run a model

Chat with a local model:

$ ollama run qwen3.5

Or run a cloud-hosted model (no local GPU required):

$ ollama run kimi-k2.5:cloud

To start a coding agent with Ollama as the model backend, use ollama launch:

$ ollama launch claude
$ ollama launch codex
$ ollama launch opencode

For CI/CD and automated workflows, ollama launch supports a headless mode:

$ ollama launch claude --yes --model qwen3.5

Model recommendations

Use case	Model	Notes
Smoke test	`qwen3.5:0.8b`	Fast and lightweight, good for verifying setup
Coding and reasoning	`qwen3.5`	Strong tool calling support for agentic workflows
Complex tasks	`nemotron-3-super`	122B parameter model, requires 48 GB+ VRAM
No local GPU	`qwen3.5:cloud`	Runs on Ollama’s cloud infrastructure, no `ollama pull` required

Cloud models use the :cloud tag suffix and do not require local hardware.

Tool calling

Agentic workflows (Claude Code, Codex, OpenCode) rely on tool calling. The following models have reliable tool calling support: Qwen 3.5, Nemotron-3-Super, GLM-5, and Kimi-K2.5. Check the Ollama model library for the latest additions.

Updating Ollama

To update Ollama inside a running sandbox:

$ update-ollama

To auto-update on every sandbox start:

$ openshell sandbox create --from ollama -e OLLAMA_UPDATE=1

Option B: Host-level Ollama

Use this approach when you want a single Ollama instance on the gateway host, shared across multiple sandboxes through inference.local.

This approach uses Ollama because it is easy to install and run locally, but you can substitute other inference engines such as vLLM, SGLang, TRT-LLM, and NVIDIA NIM by changing the startup command, base URL, and model name.

Install and start Ollama

Install Ollama on the gateway host:

$ curl -fsSL https://ollama.com/install.sh | sh

Start Ollama on all interfaces so it is reachable from sandboxes:

$ OLLAMA_HOST=0.0.0.0:11434 ollama serve

If you see Error: listen tcp 0.0.0.0:11434: bind: address already in use, Ollama is already running as a system service. Stop it first:

$ systemctl stop ollama
$ OLLAMA_HOST=0.0.0.0:11434 ollama serve

Pull a model

In a second terminal, pull a model:

$ ollama run qwen3.5:0.8b

Type /bye to exit the interactive session. The model stays loaded.

Create a provider

Create an OpenAI-compatible provider pointing at the host Ollama instance:

$ openshell provider create \
    --name ollama \
    --type openai \
    --credential OPENAI_API_KEY=empty \
    --config OPENAI_BASE_URL=http://host.openshell.internal:11434/v1

OpenShell injects host.openshell.internal so sandboxes and the gateway can reach the host machine. You can also use the host’s LAN IP.

Set inference routing

$ openshell inference set --provider ollama --model qwen3.5:0.8b

Confirm the saved config:

$ openshell inference get

Verify from a sandbox

$ openshell sandbox create -- \
    curl https://inference.local/v1/chat/completions \
    --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'

The response should be JSON from the model.

Troubleshooting

Problem	Fix
Ollama not reachable from sandbox	Ollama must be bound to `0.0.0.0`, not `127.0.0.1`. The community sandbox handles this automatically.
Wrong `OPENAI_BASE_URL`	Use `http://host.openshell.internal:11434/v1`, not `localhost` or `127.0.0.1`.
Model not found	Run `ollama ps` to confirm the model is loaded. Run `ollama pull <model>` if needed.
HTTPS vs HTTP	Code inside sandboxes must call `https://inference.local`, not `http://`.
AMD GPU driver issues	Ollama v0.18+ requires ROCm 7 drivers for AMD GPUs. Update your drivers if you see GPU detection failures.

$ openshell status
$ openshell inference get
$ openshell provider get ollama

LM Studio provides an easy-to-set-up local inference server with both OpenAI-compatible and Anthropic-compatible endpoints.

Prerequisites

A working OpenShell installation. Complete the Quickstart before proceeding.
LM Studio installed and running in the same environment as your gateway.

If you prefer to work without keeping the LM Studio app open, install lms (headless LM Studio):

$ curl -fsSL https://lmstudio.ai/install.sh | bash

Then start the daemon:

$ lms daemon up

Start the LM Studio local server

Start the local server from the Developer tab and verify the OpenAI-compatible endpoint is enabled.LM Studio listens on 127.0.0.1:1234 by default. For use with OpenShell, configure it to listen on all interfaces (0.0.0.0).

GUI: Go to the Developer tab, select Server Settings, then enable Serve on Local Network.
Headless: Run lms server start --bind 0.0.0.0.

Download and load a model

In the LM Studio app, go to the Model Search tab to download a small model such as Qwen3.5 2B.Using the CLI:

$ lms get qwen/qwen3.5-2b
$ lms load qwen/qwen3.5-2b

Add LM Studio as a provider

Choose the provider type that matches the client protocol you want to route through inference.local.

OpenAI-compatible
Anthropic-compatible

$ openshell provider create \
    --name lmstudio \
    --type openai \
    --credential OPENAI_API_KEY=lmstudio \
    --config OPENAI_BASE_URL=http://host.openshell.internal:1234/v1

Use this provider for clients that send OpenAI-compatible requests such as POST /v1/chat/completions or POST /v1/responses.

$ openshell provider create \
    --name lmstudio-anthropic \
    --type anthropic \
    --credential ANTHROPIC_API_KEY=lmstudio \
    --config ANTHROPIC_BASE_URL=http://host.openshell.internal:1234

Use this provider for Anthropic-compatible POST /v1/messages requests.

Configure LM Studio as the inference provider

Set the managed inference route for the active gateway.

OpenAI-compatible
Anthropic-compatible

$ openshell inference set --provider lmstudio --model qwen/qwen3.5-2b

$ openshell inference set --provider lmstudio-anthropic --model qwen/qwen3.5-2b

The active inference.local route is gateway-scoped, so only one provider and model pair is active at a time. Re-run openshell inference set whenever you want to switch between OpenAI-compatible and Anthropic-compatible clients.Confirm the saved config:

$ openshell inference get

You should see Provider: lmstudio (or Provider: lmstudio-anthropic) along with Model: qwen/qwen3.5-2b.

Verify from inside a sandbox

Run a request through https://inference.local:

OpenAI-compatible
Anthropic-compatible

openshell sandbox create -- \
    curl https://inference.local/v1/chat/completions \
    --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'

Or using the Responses API:

openshell sandbox create -- \
    curl https://inference.local/v1/responses \
    --json '{
      "instructions": "You are a helpful assistant.",
      "input": "hello",
      "max_output_tokens": 10
    }'

openshell sandbox create -- \
    curl https://inference.local/v1/messages \
    --json '{"messages":[{"role":"user","content":"hello"}],"max_tokens":10}'

Troubleshooting

Problem	Fix
LM Studio server not reachable	Confirm the local server is running and listening on `0.0.0.0`, not `127.0.0.1`.
Wrong `OPENAI_BASE_URL`	Use `http://host.openshell.internal:1234/v1` for OpenAI-compatible providers.
Wrong `ANTHROPIC_BASE_URL`	Use `http://host.openshell.internal:1234` (no `/v1`) for Anthropic-compatible providers.
Model name mismatch	The model name in `openshell inference set` must match the name exposed by LM Studio exactly.
Gateway and LM Studio on different machines	`host.openshell.internal` resolves to the gateway host. If LM Studio runs elsewhere, use its LAN IP instead.

$ openshell status
$ openshell inference get
$ openshell provider get lmstudio
$ openshell provider get lmstudio-anthropic

GPU support for local inference

Both Ollama and LM Studio can use local GPU resources:

NVIDIA GPUs: Both tools support CUDA automatically when the appropriate drivers are installed. No additional configuration is required in OpenShell.
AMD GPUs: Ollama v0.18+ requires ROCm 7 drivers. LM Studio uses ROCm automatically on supported hardware.
Apple Silicon: Both tools use Metal for hardware acceleration on M-series Macs.
CPU fallback: If no GPU is detected, inference runs on CPU. For most coding assistant workloads, a small quantized model (such as qwen3.5:0.8b) runs acceptably on CPU.

GPU resources are available to Ollama and LM Studio running on the gateway host. Sandboxes themselves do not have direct GPU access — inference is routed from the sandbox through inference.local to the host-side backend.

What’s next

Managed inference

Learn how OpenShell routes inference requests and manages provider configuration.

Configure inference backends

Configure vLLM, SGLang, TRT-LLM, NVIDIA NIM, or any other OpenAI-compatible backend.

Community sandboxes

Explore pre-built sandbox images for common development workflows.

LM Studio CLI docs

Learn more about the lms CLI for headless LM Studio usage.

Documentation Index

​Prerequisites

​Option A: Ollama community sandbox (recommended)

​Model recommendations

​Tool calling

​Updating Ollama

​Option B: Host-level Ollama

​Troubleshooting

​Prerequisites

​Troubleshooting

​GPU support for local inference

​What’s next

Managed inference

Configure inference backends

Community sandboxes

LM Studio CLI docs

Prerequisites

Option A: Ollama community sandbox (recommended)

Model recommendations

Tool calling

Updating Ollama

Option B: Host-level Ollama

Troubleshooting

Prerequisites

Troubleshooting

GPU support for local inference

What’s next