Self-Host Ollama on Your Homelab: Local LLM Inference Without the Cloud Bill

The $300/Month Problem

I hit my OpenAI API billing dashboard last month and stared at $312.47. That’s what three months of prototyping a RAG pipeline cost me — and most of those tokens were wasted on testing prompts that didn’t work.

Meanwhile, my TrueNAS box sat in the closet pulling 85 watts, running Docker containers I hadn’t touched in weeks. That’s when I started looking at Ollama — a dead-simple way to run open-source LLMs locally. No API keys, no rate limits, no surprise invoices.

Three weeks in, I’ve moved about 80% of my development-time inference off the cloud. Here’s exactly how I set it up, what hardware actually matters, and the real performance numbers nobody talks about.

Why Ollama Over vLLM, LocalAI, or text-generation-webui

I tried all four. Here’s why I stuck with Ollama:

vLLM is built for production throughput — batched inference, PagedAttention, the works. It’s also a pain to configure if you just want to ask a model a question. Setup took me 45 minutes and required building from source to get GPU support working on my machine.

LocalAI supports more model formats (GGUF, GPTQ, AWQ) and has an OpenAI-compatible API out of the box. But the documentation is scattered, and I hit three different bugs in the Whisper integration before giving up.

text-generation-webui (oobabooga) is great if you want a chat UI. But I needed an API endpoint I could hit from scripts and other services, and the API felt bolted on.

Ollama won because: one binary, one command to pull a model, instant OpenAI-compatible API on port 11434. I had Llama 3.1 8B answering prompts in under 2 minutes from a cold start. That matters when you’re trying to build things, not babysit infrastructure.

Hardware: What Actually Moves the Needle

I’m running Ollama on a Mac Mini M2 with 16GB unified memory. Here’s what I learned about hardware that actually affects performance:

Memory is everything. LLMs need to fit entirely in RAM (or VRAM) to run at usable speeds. A 7B parameter model in Q4_K_M quantization needs about 4.5GB. A 13B model needs ~8GB. A 70B model needs ~40GB. If the model doesn’t fit, it pages to disk and you’re looking at 0.5 tokens/second — basically unusable.

GPU matters less than you think for models under 13B. Apple Silicon’s unified memory architecture means the M1/M2/M3 chips run these models surprisingly well — I get 35-42 tokens/second on Llama 3.1 8B with my M2. A dedicated NVIDIA GPU is faster (an RTX 3090 with 24GB VRAM will push 70+ tok/s on the same model), but the Mac Mini uses 15 watts doing it versus 350+ watts for the 3090.

CPU-only is viable for small models. On a 4-core Intel box with 32GB RAM, I was getting 8-12 tokens/second on 7B models. Not great for chat, but perfectly fine for batch processing, embeddings, or code review pipelines where latency doesn’t matter.

If you’re building a homelab inference box from scratch, here’s what I’d buy today:

  • Budget ($400-600): A used Mac Mini M2 with 16GB RAM runs 7B-13B models at very usable speeds. Power draw is laughable — 15-25 watts under inference load.
  • Mid-range ($800-1200): A Mac Mini M4 with 32GB lets you run 30B models and keeps two smaller models hot in memory. The M4 with 32GB unified memory is the sweet spot for most homelab setups.
  • GPU path ($500-900): If you already have a Linux box, grab a used RTX 3090 24GB — they’ve dropped to $600-800 and the 24GB VRAM handles 13B models at 70+ tok/s. Just make sure your PSU can handle the 350W draw.

The Setup: 5 Minutes, Not Kidding

On macOS or Linux:

curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull llama3.1:8b

That’s it. The model downloads (~4.7GB for the Q4_K_M quantized 8B), and you’ve got an API running on localhost:11434.

Test it:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain TCP three-way handshake in two sentences.",
  "stream": false
}'

For Docker (which is what I use on TrueNAS):

docker run -d \
  --name ollama \
  -v ollama_data:/root/.ollama \
  -p 11434:11434 \
  --restart unless-stopped \
  ollama/ollama:latest

Then pull your model into the running container:

docker exec ollama ollama pull llama3.1:8b

Real Benchmarks: What I Actually Measured

I ran each model 10 times with the same prompt (“Write a Python function to merge two sorted lists with O(n) complexity, with docstring and type hints”) and averaged the results. Mac Mini M2, 16GB, nothing else running:

Model Size (Q4_K_M) Tokens/sec Time to first token RAM used
Llama 3.1 8B 4.7GB 38.2 0.4s 5.1GB
Mistral 7B v0.3 4.1GB 41.7 0.3s 4.6GB
CodeLlama 13B 7.4GB 22.1 0.8s 8.2GB
Llama 3.1 70B (Q2_K) 26GB 3.8 4.2s 28GB*

*The 70B model technically ran on 16GB with aggressive quantization but spent half its time swapping. I wouldn’t recommend it without 32GB+ RAM.

For context: GPT-4o through the API typically returns 50-80 tokens/second, but you’re paying per token and dealing with rate limits. 38 tokens/second from a local 8B model is fast enough that you barely notice the difference when coding.

Making It Useful: The OpenAI-Compatible API

This is the part that made Ollama actually practical for me. It exposes an OpenAI-compatible endpoint at /v1/chat/completions, which means you can point any tool that uses the OpenAI SDK at your local instance by just changing the base URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.0.43:11434/v1",
    api_key="not-needed"  # Ollama doesn't require auth
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Review this PR diff..."}]
)
print(response.choices[0].message.content)

I use this for:

  • Automated code review — a git hook sends diffs to the local model before I push
  • Log analysis — pipe structured logs through a prompt that flags anomalies
  • Documentation generation — point it at a module and get decent first-draft docstrings
  • Embedding generationollama pull nomic-embed-text gives you a solid embedding model for RAG without paying per-token

None of these need GPT-4 quality. A well-prompted 8B model handles them at 90%+ accuracy, and the cost is literally zero per request.

Gotchas I Hit (So You Don’t Have To)

Memory pressure kills everything. When Ollama loads a model, it stays in memory until another model evicts it or you restart the service. If you’re running other containers on the same box, set OLLAMA_MAX_LOADED_MODELS=1 to prevent two 8GB models from eating all your RAM and triggering the OOM killer.

Network binding matters. By default Ollama only listens on 127.0.0.1:11434. If you want other machines on your LAN to use it (which is the whole point of a homelab setup), set OLLAMA_HOST=0.0.0.0. But don’t expose this to the internet — there’s no auth layer. Put it behind a reverse proxy with basic auth or Tailscale if you need remote access.

Quantization matters more than model size. A 13B model at Q4_K_M often beats a 7B at Q8. The sweet spot for most use cases is Q4_K_M — it’s roughly 4 bits per weight, which keeps quality surprisingly close to full precision while cutting memory by 4x.

Context length eats memory fast. The default context window is 2048 tokens. Bumping it to 8192 with ollama run llama3.1 --ctx-size 8192 roughly doubles memory usage. Plan accordingly.

When to Stay on the Cloud

I still use GPT-4o and Claude for anything requiring deep reasoning, long context, or multi-step planning. Local 8B models are not good at complex architectural analysis or debugging subtle race conditions. They’re excellent at well-scoped, repetitive tasks with clear instructions.

The split I’ve landed on: cloud APIs for thinking, local models for doing. My API bill dropped from $312/month to about $45.

What I’d Do Next

If your homelab already runs Docker, adding Ollama takes 5 minutes and costs nothing. Start with llama3.1:8b for general tasks and nomic-embed-text for embeddings. If you find yourself using it daily (you will), consider dedicated hardware — a Mac Mini or a used GPU that stays on 24/7.

The models are improving fast. Llama 3.1 8B today is better than Llama 2 70B was a year ago. By the time you read this, there’s probably something even better on Ollama’s model library. Pull it and try it — that’s the beauty of running your own inference server.

Full disclosure: Hardware links above are affiliate links.


📡 Want daily market intelligence with the same no-BS approach? Join Alpha Signal on Telegram for free daily signals and analysis.

📧 Get weekly insights on security, trading, and tech. No spam, unsubscribe anytime.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *