Home ยป AI ยป 5 Best LLM Models You Can Run in Docker on Low-Power Hardware
AI

5 Best LLM Models You Can Run in Docker on Low-Power Hardware

Learn about the top LLM models you can run in Docker, even on low-power systems that don't have a dedicated GPU.

I have been super interested in running LLMs at home with the help of great open source tools freely available out there. Large Language Models (LLMs) are now not just limited to large datacenters full of GPUs. It is now possible to run LLMs locally on your local hardware, even on low-power systems without a dedicated GPU. Now, granted, the experience will be better with a dedicated GPU. However, there are lightweight LLMs that can run on low power hardware, including mini PCs. Let’s look at LLM models you can run in Docker very modest hardware using tools like Ollama, LM Studio and OpenWebUI.

Why lightweight LLMs are a good choice for the home lab

For most of us, running a full-scale LLM like GPT-4 or LLaMA 65B are just not possible due to the lack of video RAM that most of us may have when it comes to running those types of models. However, smaller models are trained or distilled especially for being efficient in the home lab or local environment.

Looking at best llm models you can run in docker
Looking at best llm models you can run in docker

These models include some of the following characteristics:

  • The need less than 8GB of RAM
  • Can run on CPUs (no GPU required)
  • Use quantized formats like GGUF for low memory usage
  • These support Docker-based deployment for portability

This means you can self-host your own chatbot, summarizer, or even a private AI assistant. All of this can be done in a fully air-gapped mini PC and without needing the cloud.

Tools for Running Local LLMs in Docker

Before diving into specific models, itโ€™s helpful to know the ecosystem used to host them. The most common tooling includes:

Ollama

Ollama is a lightweight runtime that lets you run quantized LLMs locally using simple commands. It has a built-in model registry and it does a great job of working well with Docker and OpenWebUI. For an in-depth look at how to spin up Ollama and OpenWebUI, check out my post here: Self-Hosting LLMs with Docker and Proxmox: How to Run Your Own GPT.

Docker Example:

docker run -d --name ollama -p 11434:11434 ollama/ollama

OpenWebUI

OpenWebUI is a free and open source front-end that looks very much like OpenAI ChatGPT interface for Ollama. It provides a clean interface, multi-model support, and chat history.

Docker Example:

docker run -d -p 3000:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 -v open-webui:/app/data openwebui/openwebui

LM Studio

A GUI for downloading and running GGUF models (no Docker needed, but a solid alternative for desktop users). You can read my write up on LM Studio here: Local LLM Model in Private AI server in WSL.

5 lightweight LLMs you can run on low-power hardware

Here are 5 lightweight LLMs you can run on very low-power hardware. Let’s take a look at these one-by-one.

1. Gemma3:4b (new)

This is one of the latest ones out therethe Gemma3:4b model that is new. Gemma is the lightweight model from Google built on Gemini technology. The Gemma 3 models are multimodal can can do a lot including processing both text and images. It features 128K context window with support for over 140 languages. For many, the 4B will be a sweet spot in size. It comes in 1B, 4B, 12B, and 27B parameter sizes. You can use it to answer questions and reasoning. It has a compact design for low-power devices.

Why It’s Great:

  • Its new
  • Has really excellent features for size
  • Good response times
  • It can run on CPU

2. Phi-3 by Microsoft

  • Size: 3B parameters
  • RAM Usage: ~3-4GB (4-bit quantized)
  • Use Case: Reasoning, Q&A, coding help
  • Runs With: Ollama (ollama run phi3)

Phi-3 is a tiny Microsoft LLM that performs really well on reasoning tasks. It can rival larger models in some benchmarks. It’s designed for education and lightweight inference. This makes it ideal for edge use cases.

Pulling microsoft phi which is phi 2 from ollama
Pulling microsoft phi which is phi 2 from ollama

Why It’s Great:

  • Excellent reasoning for its size
  • Good response times
  • It can run on CPU

3. TinyLlama 1.1B

  • Size: 1.1B parameters
  • RAM Usage: ~2-3GB
  • Use Case: General language tasks, basic conversation
  • Runs With: Ollama (ollama run tinyllama)

TinyLlama is a great model that is trained on over 1 trillion tokens which is pretty impressive considering its size. Itโ€™s perfect for using in chat interfaces or small apps.

Why It’s Great:

  • Tiny footprint
  • Surprisingly coherent replies
  • Ideal for ultra-low-end devices

4. Mistral 7B (Quantized)

  • Size: 7B parameters
  • RAM Usage: ~5-6GB (Q4_0)
  • Use Case: Chatbots, code generation, general tasks
  • Runs With: Ollama (ollama run mistral)

Mistral 7B has become a favorite open model due to the fact it has a good balance of speed and capability. It performs better than LLaMA 2 13B on many tasks. Also, it has quantized versions that are CPU-friendly.

Why It’s Great:

  • Strong performance
  • Great for full conversations
  • It supports instruct-tuned chats

5. LLaVA 7B: Vision + Language (Optional)

  • Size: 7B parameters
  • RAM Usage: ~6GB (Q4)
  • Use Case: Multimodal (text + image)
  • Runs With: Docker and custom containers (or Ollama for text-only use)

LLaVA (Large Language and Vision Assistant) is based on Vicuna and adds image understanding. This one is not the most friendly for the smallest LLM setups. However, it’s a fascinating lightweight model to run if you want to experiment with vision and language tasks.

Why It’s Great:

  • Brings vision into the mix
  • Good Vicuna-style responses
  • Works surprisingly well even without GPU (just slower)

Tips for running LLMs on low-powered systems you may want to run

1. Use 4-bit or 5-bit Quantized Models

These formats reduce memory requirements significantly. Stick with GGUF format and use the smallest quant level that gives usable results.

2. Allocate Swap Space

If you’re on a Linux host with 8GB RAM or less, ensure you have swap configured or Docker may crash when loading models.

3. Use CPU-only Inference Flags

Ollama and other runtimes default to CPU if no GPU is detected, but you can explicitly set it:

OLLAMA_NUM_THREADS=4 ollama run mistral

4. Keep Logs Trimmed

Running in Docker? Add --log-opt max-size=10m to avoid filling up disk with logs over time.

5. Use Proxmox Containers

Instead of full VMs, LXC containers in Proxmox let you host Ollama or OpenWebUI with far lower resource usage.

Docker compose for running both Ollama and OpenWebUI

We mentioned OpenWebUI earlier. However, just a few more details on this really great web frontend to Ollama. It provides the following:

  • Multi-model switching
  • Prompt templates
  • System prompts
  • API-like usage from other apps

How do you set this up easily as a “stack” with Ollama? Note the following Docker compose code:

version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama:/root/.ollama

  webui:
    image: openwebui/openwebui
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    volumes:
      - webui:/app/data

volumes:
  ollama:
  webui:

Wrapping up

While the notion that running LLMs requires GPUs, these LLM models you can run in Docker that we have covered show just how effectively you can run AI at home even without dedicated hardware. Albeit, these will still not be as fast as running them on a GPU, it still shows just how far we have come with LLMs that can be run in the home lab.

Thanks to Docker, Ollama, OpenWebUI, and the rise of quantized models, it’s now super easy to experiment with AI right in your home lab. Even if you are running these on a Proxmox server, mini PC, or even a Raspberry Pi 5 (with swap), these 5 lightweight models will give you powerful AI capabilities with minimal setup. Let me know in the comments what your favorite LLMs are and what type of setup you are running.

Brandon Lee

Brandon Lee is the Senior Writer, Engineer and owner at Virtualizationhowto.com, and a 7-time VMware vExpert, with over two decades of experience in Information Technology. Having worked for numerous Fortune 500 companies as well as in various industries, He has extensive experience in various IT segments and is a strong advocate for open source technologies. Brandon holds many industry certifications, loves the outdoors and spending time with family. Also, he goes through the effort of testing and troubleshooting issues, so you don't have to.

Related Articles

4 Comments

  1. love the article, i’ll have to investigate some of these models and run them on my minipc!

  2. I’m on a Ryzen 9 5950X with 64GB of RAM (no usable GPU) and I can run models much bigger than these. My current list that I run with the MSTY app are:

    deepcoder:14b 9.0 GB
    qwen3:14b-q4_K_M 9.3 GB
    mistralai_Mistral-Small-3.1-24B-Instruct-2503-Q4_K_M-1745249036493:latest 14 GB
    gemma3:12b-it-qat 8.9 GB
    snowflake-arctic-embed2:latest 1.2 GB
    Light-R1-14B-Q4-KM-1744563560764:latest 9.0 GB
    deepseek-r1:14b 9.0 GB
    mxbai-embed-large:latest 669 MB
    nomic-embed-text:latest 274 MB

    Some of the bigger ones, especially the mistral 24B, do run rather slowly. But in general, most of them run at an acceptable speed. The “thinking” models, of course, are much slower to give an actual answer.

    The models listed in the article are older models. While they may still be very usable, they’re not the “latest and greatest.” The pace of new model releases has not slowed, and new models come out virtually every week. While “model chasing” is not a good idea because of that, it’s useful to keep up with at least models from the main suppliers (and that includes the Chinese companies.)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.