Complete Guide to Local LLMs + OpenClaw
Target keywords: local LLM OpenClaw, Ollama OpenClaw, self-hosted AI, local AI assistant
Introduction
Concerned about privacy? Don’t want to depend on OpenAI? This guide shows you how to run large language models locally and build a fully private AI assistant with OpenClaw.
What you will learn:
- How to choose a local LLM
- Ollama installation and configuration
- OpenClaw integration
- Performance optimization
Estimated time: 45 minutes
Difficulty: Intermediate
Why Use a Local LLM?
Advantages
| Category | Cloud API | Local LLM |
|---|---|---|
| Privacy | Data is sent to a third party | Fully local, zero external leakage |
| Cost | Pay per token | One-time hardware investment |
| Latency | Depends on the network | Millisecond-level response |
| Control | Limited | Full control |
| Offline use | Requires network access | Runs fully offline |
Good Fit Scenarios
✅ Privacy-sensitive applications
- Medical consultation
- Legal document analysis
- Internal enterprise knowledge bases
✅ Cost-sensitive workloads
- High-frequency calls
- Large-text processing
- Long-running services
✅ Offline environments
- Internal network deployment
- Edge devices
- Regions with unstable connectivity
Hardware Requirements
Model Size vs VRAM Requirements
| Model | Parameters | VRAM Requirement | Suitable Hardware |
|---|---|---|---|
| TinyLlama | 1.1B | 2GB | Laptop CPU |
| Llama 3.1 | 8B | 6GB | GTX 1060 |
| Llama 3.1 | 70B | 40GB+ | RTX 4090 / A100 |
| Mixtral 8x7B | 47B | 28GB | RTX 3090 |
| Qwen 2.5 | 72B | 40GB+ | Multi-GPU / cloud instance |
Recommended Configurations
Entry level (7B-8B models):
- GPU: GTX 1060 6GB / RTX 3060
- RAM: 16GB
- Storage: 50GB SSD
Advanced (13B-30B models):
- GPU: RTX 3090 24GB / RTX 4090
- RAM: 32GB
- Storage: 100GB SSD
Professional (70B+ models):
- GPU: 2x RTX 4090 / A100 80GB
- RAM: 64GB+
- Storage: 200GB NVMe
Step 1: Install Ollama
macOS
# Use Homebrew
brew install ollama
# Start the service
brew services start ollama
Linux
# One-command installation
curl -fsSL https://ollama.com/install.sh | sh
# Start the service
sudo systemctl start ollama
sudo systemctl enable ollama
Windows
# Download the installer
# https://ollama.com/download/windows
# Or use WSL2 (recommended)
wsl --install
# Then run the Linux installation command in WSL
Docker (Recommended for Servers)
# Run Ollama
docker run -d \
-v ollama:/root/.ollama \
-p 11434:11434 \
--name ollama \
ollama/ollama
Step 2: Download Models
View Available Models
# List official models
ollama list
# Search models
ollama search llama
Download Recommended Models
# Llama 3.1 (8B) - Balanced performance and quality
ollama pull llama3.1
# Llama 3.1 (70B) - Best quality (requires a lot of VRAM)
ollama pull llama3.1:70b
# Mixtral - Excellent multilingual capability
ollama pull mixtral
# Qwen 2.5 - Excellent Chinese performance
ollama pull qwen2.5
# Phi-4 - Small Microsoft model, fast
ollama pull phi4
Verify the Installation
# Test the model
ollama run llama3.1
# Enter a test message
>>> 你好,请介绍一下自己
# Exit
/bye
Step 3: Configure OpenClaw
3.1 Install the Ollama Skill
openclaw skill install ollama
3.2 Configure the Connection
Edit the OpenClaw config:
openclaw config edit
Add the Ollama configuration:
# ~/.openclaw/config.yaml
models:
local-llama:
provider: ollama
base_url: "http://localhost:11434" # Ollama default port
model: "llama3.1"
temperature: 0.7
local-mixtral:
provider: ollama
base_url: "http://localhost:11434"
model: "mixtral"
temperature: 0.8
# Set the default model
default_model: local-llama
3.3 Create a Local Agent
openclaw agent create local-assistant
Configuration:
# ~/.openclaw/agents/local-assistant.yaml
agent:
name: "Local AI Assistant"
description: "Fully private AI assistant"
model: local-llama # Use the model configured above
system_prompt: |
You are a helpful AI assistant running locally.
You respect user privacy and never share data externally.
memory:
enabled: true
type: local # Store locally, not in the cloud
3.4 Test the Connection
# Test the local model
openclaw chat local-assistant "Hello, are you running locally?"
# Expected reply:
# "Yes, I'm running entirely on your local machine..."
Step 4: Advanced Configuration
4.1 Model Quantization (Save VRAM)
Quantization reduces model size so larger models can run on smaller GPUs:
# Download the quantized version (4-bit)
ollama pull llama3.1:70b-q4_0
# Comparison:
# 70b (FP16): ~140GB
# 70b-q4_0: ~40GB
Quantization levels:
q4_0: 4-bit, smallest, fastest, with slightly lower qualityq5_0: 5-bit, balanced choiceq8_0: 8-bit, close to original quality
4.2 Multi-Model Switching
# Switch models based on the task
routes:
chat:
- pattern: "*"
agent: local-assistant
model: local-llama # Default
- pattern: "*code*"
agent: local-assistant
model: local-code # Model for code
- pattern: "*translate*"
agent: local-assistant
model: local-mixtral # Multilingual model
4.3 Context Length Tuning
models:
local-llama:
provider: ollama
model: "llama3.1"
context_window: 8192 # Default 4096
max_tokens: 2048
Step 5: Performance Optimization
5.1 GPU Acceleration
NVIDIA GPU:
# Make sure CUDA is installed
docker run -d \
--gpus all \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama
AMD GPU:
# Use ROCm
docker run -d \
--device /dev/kfd \
--device /dev/dri \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama:rocm
Apple Silicon:
# Metal is enabled automatically
ollama serve
5.2 CPU Optimization
If you do not have a GPU, optimize CPU inference:
# Use all CPU cores
export OLLAMA_NUM_PARALLEL=4
# Enable memory mapping (reduce RAM usage)
export OLLAMA_USE_MMAP=true
ollama serve
5.3 Concurrent Processing
# OpenClaw configuration
models:
local-llama:
provider: ollama
model: "llama3.1"
max_concurrent: 4 # Maximum concurrent requests
timeout: 60 # Timeout in seconds
5.4 Caching Strategy
agent:
cache:
enabled: true
ttl: 3600 # Cache for 1 hour
similar_responses: true # Reuse answers for similar questions
Step 6: Production Deployment
6.1 Internal Network Deployment
# Listen only on the internal network
server:
host: 192.168.1.100 # Internal network IP
port: 8080
models:
local-llama:
provider: ollama
base_url: "http://192.168.1.100:11434" # Same machine or another machine on the internal network
6.2 Multi-Machine Deployment
Architecture:
[OpenClaw Server] ←→ [Ollama Server 1]
←→ [Ollama Server 2]
←→ [Ollama Server 3]
Configure load balancing:
models:
local-llama:
provider: ollama
base_urls:
- "http://ollama-1:11434"
- "http://ollama-2:11434"
- "http://ollama-3:11434"
load_balance: round_robin
6.3 Model Preloading
Avoid latency on the first request:
# Load the model automatically at startup
ollama run llama3.1 &
# Or set it in the configuration
export OLLAMA_KEEP_ALIVE=24h # Keep the model loaded for 24 hours
Step 7: Troubleshooting
Issue 1: Out of VRAM (OOM)
Symptom: CUDA out of memory
Fix:
# 1. Use a smaller model
ollama pull llama3.1 # Use 8B instead of 70B
# 2. Use a quantized version
ollama pull llama3.1:q4_0
# 3. Reduce the context length
# Set `context_window: 2048` in the configuration
Issue 2: Responses are too slow
Symptom: A single token takes several seconds to generate
Fix:
# 1. Confirm the GPU is being used
nvidia-smi # Check GPU utilization
# 2. Use a faster quantization level
ollama pull llama3.1:q4_0 # 2-3x faster than q8_0
# 3. Enable streaming responses
# Add `stream: true` to the agent configuration
Issue 3: Model download failed
Symptom: The pull command hangs or fails
Fix:
# 1. Check the network
ping ollama.com
# 2. Use a mirror
export OLLAMA_REGISTRY="https://ollama.mirror.example.com"
# 3. Download manually
# Download the GGUF file from HuggingFace
# Then run: `ollama create mymodel -f Modelfile`
Issue 4: Garbled Chinese output
Symptom: Chinese text is displayed as question marks or corrupted characters
Fix:
# Use a model optimized for Chinese
model: qwen2.5 # OR
model: llama3.1-chinese # Community fine-tuned Chinese version
# Set the system prompt
system_prompt: |
请用中文回答所有问题。
保持回答简洁明了。
Model Recommendations
Choose by Use Case
| Scenario | Recommended Model | Reason |
|---|---|---|
| General conversation | Llama 3.1 8B | Balanced performance and quality |
| Code generation | CodeLlama 34B | Strong code understanding |
| Chinese tasks | Qwen 2.5 72B | Optimized for Chinese |
| Long documents | Mixtral 8x7B | 32K context |
| Edge devices | Phi-4 | Small model, fast |
| Creative writing | Llama 3.1 70B | High-quality generation |
Performance Comparison
| Model | VRAM | Speed (tok/s) | Quality |
|---|---|---|---|
| Llama 3.1 8B | 6GB | 50+ | ⭐⭐⭐⭐ |
| Llama 3.1 70B | 40GB+ | 10+ | ⭐⭐⭐⭐⭐ |
| Mixtral 8x7B | 28GB | 15+ | ⭐⭐⭐⭐⭐ |
| Qwen 2.5 72B | 40GB+ | 12+ | ⭐⭐⭐⭐⭐ |
| Phi-4 | 4GB | 80+ | ⭐⭐⭐ |
Security Best Practices
1. Network Isolation
# Allow only internal network access
server:
host: 127.0.0.1 # Or an internal network IP
# Firewall rule
# sudo ufw allow from 192.168.1.0/24 to any port 11434
2. Access Control
# API key validation
security:
api_key_required: true
allowed_ips:
- "192.168.1.0/24"
- "10.0.0.0/8"
3. Data Protection
agent:
memory:
encryption: true # Encrypt stored conversation history
retention: 7d # Automatically delete after 7 days
Cost Analysis
Self-Hosted vs Cloud API (Monthly)
Scenario: 1 million tokens per day
| Option | Hardware Cost | Electricity | Total | Equivalent API Cost |
|---|---|---|---|---|
| Self-hosted RTX 4090 | $100/month (amortized) | $30 | $130 | ~$1,500 (GPT-4) |
| Cloud server (A100) | $500/month | - | $500 | ~$1,500 |
| Laptop CPU | $0 | $10 | $10 | ~$300 (GPT-3.5) |
ROI: Self-hosting typically pays off after 3-6 months
Next Steps
Resources
Get more tutorials for free:
[Email subscription form]
Last updated: March 2026