5 Best LLM Models You Can Run in Docker on Low-Power Hardware

I have been super interested in running LLMs at home with the help of great open source tools freely available out there. Large Language Models (LLMs) are now not just limited to large datacenters full of GPUs. It is now possible to run LLMs locally on your local hardware, even on low-power systems without a dedicated GPU. Now, granted, the experience will be better with a dedicated GPU. However, there are lightweight LLMs that can run on low power hardware, including mini PCs. Let’s look at LLM models you can run in Docker very modest hardware using tools like Ollama, LM Studio and OpenWebUI.
Why lightweight LLMs are a good choice for the home lab
For most of us, running a full-scale LLM like GPT-4 or LLaMA 65B are just not possible due to the lack of video RAM that most of us may have when it comes to running those types of models. However, smaller models are trained or distilled especially for being efficient in the home lab or local environment.
These models include some of the following characteristics:
- The need less than 8GB of RAM
- Can run on CPUs (no GPU required)
- Use quantized formats like GGUF for low memory usage
- These support Docker-based deployment for portability
This means you can self-host your own chatbot, summarizer, or even a private AI assistant. All of this can be done in a fully air-gapped mini PC and without needing the cloud.
Tools for Running Local LLMs in Docker
Before diving into specific models, itโs helpful to know the ecosystem used to host them. The most common tooling includes:
Ollama
Ollama is a lightweight runtime that lets you run quantized LLMs locally using simple commands. It has a built-in model registry and it does a great job of working well with Docker and OpenWebUI. For an in-depth look at how to spin up Ollama and OpenWebUI, check out my post here: Self-Hosting LLMs with Docker and Proxmox: How to Run Your Own GPT.
Docker Example:
docker run -d --name ollama -p 11434:11434 ollama/ollama
OpenWebUI
OpenWebUI is a free and open source front-end that looks very much like OpenAI ChatGPT interface for Ollama. It provides a clean interface, multi-model support, and chat history.
Docker Example:
docker run -d -p 3000:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 -v open-webui:/app/data openwebui/openwebui
LM Studio
A GUI for downloading and running GGUF models (no Docker needed, but a solid alternative for desktop users). You can read my write up on LM Studio here: Local LLM Model in Private AI server in WSL.
5 lightweight LLMs you can run on low-power hardware
Here are 5 lightweight LLMs you can run on very low-power hardware. Let’s take a look at these one-by-one.
1. Gemma3:4b (new)
This is one of the latest ones out therethe Gemma3:4b model that is new. Gemma is the lightweight model from Google built on Gemini technology. The Gemma 3 models are multimodal can can do a lot including processing both text and images. It features 128K context window with support for over 140 languages. For many, the 4B will be a sweet spot in size. It comes in 1B, 4B, 12B, and 27B parameter sizes. You can use it to answer questions and reasoning. It has a compact design for low-power devices.
Why It’s Great:
- Its new
- Has really excellent features for size
- Good response times
- It can run on CPU
2. Phi-3 by Microsoft
- Size: 3B parameters
- RAM Usage: ~3-4GB (4-bit quantized)
- Use Case: Reasoning, Q&A, coding help
- Runs With: Ollama (
ollama run phi
3)
Phi-3 is a tiny Microsoft LLM that performs really well on reasoning tasks. It can rival larger models in some benchmarks. It’s designed for education and lightweight inference. This makes it ideal for edge use cases.
Why It’s Great:
- Excellent reasoning for its size
- Good response times
- It can run on CPU
3. TinyLlama 1.1B
- Size: 1.1B parameters
- RAM Usage: ~2-3GB
- Use Case: General language tasks, basic conversation
- Runs With: Ollama (
ollama run tinyllama
)
TinyLlama is a great model that is trained on over 1 trillion tokens which is pretty impressive considering its size. Itโs perfect for using in chat interfaces or small apps.
Why It’s Great:
- Tiny footprint
- Surprisingly coherent replies
- Ideal for ultra-low-end devices
4. Mistral 7B (Quantized)
- Size: 7B parameters
- RAM Usage: ~5-6GB (Q4_0)
- Use Case: Chatbots, code generation, general tasks
- Runs With: Ollama (
ollama run mistral
)
Mistral 7B has become a favorite open model due to the fact it has a good balance of speed and capability. It performs better than LLaMA 2 13B on many tasks. Also, it has quantized versions that are CPU-friendly.
Why It’s Great:
- Strong performance
- Great for full conversations
- It supports instruct-tuned chats
5. LLaVA 7B: Vision + Language (Optional)
- Size: 7B parameters
- RAM Usage: ~6GB (Q4)
- Use Case: Multimodal (text + image)
- Runs With: Docker and custom containers (or Ollama for text-only use)
LLaVA (Large Language and Vision Assistant) is based on Vicuna and adds image understanding. This one is not the most friendly for the smallest LLM setups. However, it’s a fascinating lightweight model to run if you want to experiment with vision and language tasks.
Why It’s Great:
- Brings vision into the mix
- Good Vicuna-style responses
- Works surprisingly well even without GPU (just slower)
Tips for running LLMs on low-powered systems you may want to run
1. Use 4-bit or 5-bit Quantized Models
These formats reduce memory requirements significantly. Stick with GGUF format and use the smallest quant level that gives usable results.
2. Allocate Swap Space
If you’re on a Linux host with 8GB RAM or less, ensure you have swap configured or Docker may crash when loading models.
3. Use CPU-only Inference Flags
Ollama and other runtimes default to CPU if no GPU is detected, but you can explicitly set it:
OLLAMA_NUM_THREADS=4 ollama run mistral
4. Keep Logs Trimmed
Running in Docker? Add --log-opt max-size=10m
to avoid filling up disk with logs over time.
5. Use Proxmox Containers
Instead of full VMs, LXC containers in Proxmox let you host Ollama or OpenWebUI with far lower resource usage.
Docker compose for running both Ollama and OpenWebUI
We mentioned OpenWebUI earlier. However, just a few more details on this really great web frontend to Ollama. It provides the following:
- Multi-model switching
- Prompt templates
- System prompts
- API-like usage from other apps
How do you set this up easily as a “stack” with Ollama? Note the following Docker compose code:
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama:/root/.ollama
webui:
image: openwebui/openwebui
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
- webui:/app/data
volumes:
ollama:
webui:
Wrapping up
While the notion that running LLMs requires GPUs, these LLM models you can run in Docker that we have covered show just how effectively you can run AI at home even without dedicated hardware. Albeit, these will still not be as fast as running them on a GPU, it still shows just how far we have come with LLMs that can be run in the home lab.
Thanks to Docker, Ollama, OpenWebUI, and the rise of quantized models, it’s now super easy to experiment with AI right in your home lab. Even if you are running these on a Proxmox server, mini PC, or even a Raspberry Pi 5 (with swap), these 5 lightweight models will give you powerful AI capabilities with minimal setup. Let me know in the comments what your favorite LLMs are and what type of setup you are running.
love the article, i’ll have to investigate some of these models and run them on my minipc!
Awesome David,
I have been experimenting with models for use with kubectl-ai. Pretty cool stuff!
Brandon
I’m on a Ryzen 9 5950X with 64GB of RAM (no usable GPU) and I can run models much bigger than these. My current list that I run with the MSTY app are:
deepcoder:14b 9.0 GB
qwen3:14b-q4_K_M 9.3 GB
mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_M-1745249036493:latest 14 GB
gemma3:12b-it-qat 8.9 GB
snowflake-arctic-embed2:latest 1.2 GB
Light-R1-14B-Q4-KM-1744563560764:latest 9.0 GB
deepseek-r1:14b 9.0 GB
mxbai-embed-large:latest 669 MB
nomic-embed-text:latest 274 MB
Some of the bigger ones, especially the mistral 24B, do run rather slowly. But in general, most of them run at an acceptable speed. The “thinking” models, of course, are much slower to give an actual answer.
The models listed in the article are older models. While they may still be very usable, they’re not the “latest and greatest.” The pace of new model releases has not slowed, and new models come out virtually every week. While “model chasing” is not a good idea because of that, it’s useful to keep up with at least models from the main suppliers (and that includes the Chinese companies.)
Richard,
This is a great list! I will add these to my list of models to try. Which ones are your favorites for most tasks?
Brandon