Back to Blog
Infrastructure45 min readIntermediate

Local LLM Guide

Run OpenClaw with local LLMs using Ollama and self-hosted models for privacy, lower long-term cost, and offline operation.

Published March 15, 2026Updated March 23, 2026
local LLM OpenClawOllama OpenClawself-hosted AI assistant

Complete Guide to Local LLMs + OpenClaw

Target keywords: local LLM OpenClaw, Ollama OpenClaw, self-hosted AI, local AI assistant


Introduction

Concerned about privacy? Don’t want to depend on OpenAI? This guide shows you how to run large language models locally and build a fully private AI assistant with OpenClaw.

What you will learn:

  • How to choose a local LLM
  • Ollama installation and configuration
  • OpenClaw integration
  • Performance optimization

Estimated time: 45 minutes
Difficulty: Intermediate


Why Use a Local LLM?

Advantages

Category Cloud API Local LLM
Privacy Data is sent to a third party Fully local, zero external leakage
Cost Pay per token One-time hardware investment
Latency Depends on the network Millisecond-level response
Control Limited Full control
Offline use Requires network access Runs fully offline

Good Fit Scenarios

Privacy-sensitive applications

  • Medical consultation
  • Legal document analysis
  • Internal enterprise knowledge bases

Cost-sensitive workloads

  • High-frequency calls
  • Large-text processing
  • Long-running services

Offline environments

  • Internal network deployment
  • Edge devices
  • Regions with unstable connectivity

Hardware Requirements

Model Size vs VRAM Requirements

Model Parameters VRAM Requirement Suitable Hardware
TinyLlama 1.1B 2GB Laptop CPU
Llama 3.1 8B 6GB GTX 1060
Llama 3.1 70B 40GB+ RTX 4090 / A100
Mixtral 8x7B 47B 28GB RTX 3090
Qwen 2.5 72B 40GB+ Multi-GPU / cloud instance

Recommended Configurations

Entry level (7B-8B models):

  • GPU: GTX 1060 6GB / RTX 3060
  • RAM: 16GB
  • Storage: 50GB SSD

Advanced (13B-30B models):

  • GPU: RTX 3090 24GB / RTX 4090
  • RAM: 32GB
  • Storage: 100GB SSD

Professional (70B+ models):

  • GPU: 2x RTX 4090 / A100 80GB
  • RAM: 64GB+
  • Storage: 200GB NVMe

Step 1: Install Ollama

macOS

# Use Homebrew
brew install ollama

# Start the service
brew services start ollama

Linux

# One-command installation
curl -fsSL https://ollama.com/install.sh | sh

# Start the service
sudo systemctl start ollama
sudo systemctl enable ollama

Windows

# Download the installer
# https://ollama.com/download/windows

# Or use WSL2 (recommended)
wsl --install
# Then run the Linux installation command in WSL

Docker (Recommended for Servers)

# Run Ollama
docker run -d \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Step 2: Download Models

View Available Models

# List official models
ollama list

# Search models
ollama search llama

Download Recommended Models

# Llama 3.1 (8B) - Balanced performance and quality
ollama pull llama3.1

# Llama 3.1 (70B) - Best quality (requires a lot of VRAM)
ollama pull llama3.1:70b

# Mixtral - Excellent multilingual capability
ollama pull mixtral

# Qwen 2.5 - Excellent Chinese performance
ollama pull qwen2.5

# Phi-4 - Small Microsoft model, fast
ollama pull phi4

Verify the Installation

# Test the model
ollama run llama3.1

# Enter a test message
>>> 你好,请介绍一下自己

# Exit
/bye

Step 3: Configure OpenClaw

3.1 Install the Ollama Skill

openclaw skill install ollama

3.2 Configure the Connection

Edit the OpenClaw config:

openclaw config edit

Add the Ollama configuration:

# ~/.openclaw/config.yaml
models:
  local-llama:
    provider: ollama
    base_url: "http://localhost:11434"  # Ollama default port
    model: "llama3.1"
    temperature: 0.7
    
  local-mixtral:
    provider: ollama
    base_url: "http://localhost:11434"
    model: "mixtral"
    temperature: 0.8

# Set the default model
default_model: local-llama

3.3 Create a Local Agent

openclaw agent create local-assistant

Configuration:

# ~/.openclaw/agents/local-assistant.yaml
agent:
  name: "Local AI Assistant"
  description: "Fully private AI assistant"
  
  model: local-llama  # Use the model configured above
  
  system_prompt: |
    You are a helpful AI assistant running locally.
    You respect user privacy and never share data externally.
    
  memory:
    enabled: true
    type: local  # Store locally, not in the cloud

3.4 Test the Connection

# Test the local model
openclaw chat local-assistant "Hello, are you running locally?"

# Expected reply:
# "Yes, I'm running entirely on your local machine..."

Step 4: Advanced Configuration

4.1 Model Quantization (Save VRAM)

Quantization reduces model size so larger models can run on smaller GPUs:

# Download the quantized version (4-bit)
ollama pull llama3.1:70b-q4_0

# Comparison:
# 70b (FP16): ~140GB
# 70b-q4_0: ~40GB

Quantization levels:

  • q4_0: 4-bit, smallest, fastest, with slightly lower quality
  • q5_0: 5-bit, balanced choice
  • q8_0: 8-bit, close to original quality

4.2 Multi-Model Switching

# Switch models based on the task
routes:
  chat:
    - pattern: "*"
      agent: local-assistant
      model: local-llama  # Default
      
    - pattern: "*code*"
      agent: local-assistant
      model: local-code  # Model for code
      
    - pattern: "*translate*"
      agent: local-assistant
      model: local-mixtral  # Multilingual model

4.3 Context Length Tuning

models:
  local-llama:
    provider: ollama
    model: "llama3.1"
    context_window: 8192  # Default 4096
    max_tokens: 2048

Step 5: Performance Optimization

5.1 GPU Acceleration

NVIDIA GPU:

# Make sure CUDA is installed
docker run -d \
  --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama

AMD GPU:

# Use ROCm
docker run -d \
  --device /dev/kfd \
  --device /dev/dri \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama:rocm

Apple Silicon:

# Metal is enabled automatically
ollama serve

5.2 CPU Optimization

If you do not have a GPU, optimize CPU inference:

# Use all CPU cores
export OLLAMA_NUM_PARALLEL=4

# Enable memory mapping (reduce RAM usage)
export OLLAMA_USE_MMAP=true

ollama serve

5.3 Concurrent Processing

# OpenClaw configuration
models:
  local-llama:
    provider: ollama
    model: "llama3.1"
    max_concurrent: 4  # Maximum concurrent requests
    timeout: 60  # Timeout in seconds

5.4 Caching Strategy

agent:
  cache:
    enabled: true
    ttl: 3600  # Cache for 1 hour
    similar_responses: true  # Reuse answers for similar questions

Step 6: Production Deployment

6.1 Internal Network Deployment

# Listen only on the internal network
server:
  host: 192.168.1.100  # Internal network IP
  port: 8080

models:
  local-llama:
    provider: ollama
    base_url: "http://192.168.1.100:11434"  # Same machine or another machine on the internal network

6.2 Multi-Machine Deployment

Architecture:

[OpenClaw Server] ←→ [Ollama Server 1]
                ←→ [Ollama Server 2]
                ←→ [Ollama Server 3]

Configure load balancing:

models:
  local-llama:
    provider: ollama
    base_urls:
      - "http://ollama-1:11434"
      - "http://ollama-2:11434"
      - "http://ollama-3:11434"
    load_balance: round_robin

6.3 Model Preloading

Avoid latency on the first request:

# Load the model automatically at startup
ollama run llama3.1 &

# Or set it in the configuration
export OLLAMA_KEEP_ALIVE=24h  # Keep the model loaded for 24 hours

Step 7: Troubleshooting

Issue 1: Out of VRAM (OOM)

Symptom: CUDA out of memory

Fix:

# 1. Use a smaller model
ollama pull llama3.1  # Use 8B instead of 70B

# 2. Use a quantized version
ollama pull llama3.1:q4_0

# 3. Reduce the context length
# Set `context_window: 2048` in the configuration

Issue 2: Responses are too slow

Symptom: A single token takes several seconds to generate

Fix:

# 1. Confirm the GPU is being used
nvidia-smi  # Check GPU utilization

# 2. Use a faster quantization level
ollama pull llama3.1:q4_0  # 2-3x faster than q8_0

# 3. Enable streaming responses
# Add `stream: true` to the agent configuration

Issue 3: Model download failed

Symptom: The pull command hangs or fails

Fix:

# 1. Check the network
ping ollama.com

# 2. Use a mirror
export OLLAMA_REGISTRY="https://ollama.mirror.example.com"

# 3. Download manually
# Download the GGUF file from HuggingFace
# Then run: `ollama create mymodel -f Modelfile`

Issue 4: Garbled Chinese output

Symptom: Chinese text is displayed as question marks or corrupted characters

Fix:

# Use a model optimized for Chinese
model: qwen2.5  # OR
model: llama3.1-chinese  # Community fine-tuned Chinese version

# Set the system prompt
system_prompt: |
  请用中文回答所有问题。
  保持回答简洁明了。

Model Recommendations

Choose by Use Case

Scenario Recommended Model Reason
General conversation Llama 3.1 8B Balanced performance and quality
Code generation CodeLlama 34B Strong code understanding
Chinese tasks Qwen 2.5 72B Optimized for Chinese
Long documents Mixtral 8x7B 32K context
Edge devices Phi-4 Small model, fast
Creative writing Llama 3.1 70B High-quality generation

Performance Comparison

Model VRAM Speed (tok/s) Quality
Llama 3.1 8B 6GB 50+ ⭐⭐⭐⭐
Llama 3.1 70B 40GB+ 10+ ⭐⭐⭐⭐⭐
Mixtral 8x7B 28GB 15+ ⭐⭐⭐⭐⭐
Qwen 2.5 72B 40GB+ 12+ ⭐⭐⭐⭐⭐
Phi-4 4GB 80+ ⭐⭐⭐

Security Best Practices

1. Network Isolation

# Allow only internal network access
server:
  host: 127.0.0.1  # Or an internal network IP
  
# Firewall rule
# sudo ufw allow from 192.168.1.0/24 to any port 11434

2. Access Control

# API key validation
security:
  api_key_required: true
  allowed_ips:
    - "192.168.1.0/24"
    - "10.0.0.0/8"

3. Data Protection

agent:
  memory:
    encryption: true  # Encrypt stored conversation history
    retention: 7d     # Automatically delete after 7 days

Cost Analysis

Self-Hosted vs Cloud API (Monthly)

Scenario: 1 million tokens per day

Option Hardware Cost Electricity Total Equivalent API Cost
Self-hosted RTX 4090 $100/month (amortized) $30 $130 ~$1,500 (GPT-4)
Cloud server (A100) $500/month - $500 ~$1,500
Laptop CPU $0 $10 $10 ~$300 (GPT-3.5)

ROI: Self-hosting typically pays off after 3-6 months


Next Steps


Resources


Get more tutorials for free:

[Email subscription form]


Last updated: March 2026