I Built a 5-Node Proxmox and Ceph Home Lab with 17TB and Dual 10Gb LACP

Proxmox and ceph home lab

Over the last few weeks I have been deeply involved in a Proxmox home lab project that is designed to reshape my home lab as I know it as I am fully lifting my footprint from having any VMware at all running my production services. For this, I have chosen a lot of things that will be different for me historically compared to my full sized racks and VMware running things. I have chosen a 5 node Proxmox cluster running on top of Minisforum MS-01s which fit perfectly in my TecMojo 12-U mini rack. However, this has been quite a deep dive for me and more serious learning around Ceph, Proxmox networking, and NVMe OSDs. So sit back and enjoy this post as it is meant to be a walkthrough of my latest rendition of the mini rack as it has evolved even from what it was just a few weeks back

What I have been up to with the home lab

My home lab in 2026 has intentionally become smaller, not necessarily in compute or storage power but in terms of actual footprint. I am going away from my extensive full-size rack gear and collapsing things down to a mini rack this go around.

My new configuration with the mini cluster
My new configuration with the mini cluster

This is due to a lot of things, including industry trends, physical space, noise, power draw, and other things. This has coincided with my move to get totally away from VMware vSphere running my production workloads. Proxmox has become my defacto choice for this migration for many reasons.

First of all, refreshingly compared to VMware and Broadcom shenanigans, it is free and open-source. I no longer have to rely on paying for a VMUG subscription, which now no longer gets you licensing any way. I also wanted to give Ceph a real go in the home lab.

Honestly I really liked vSAN once I started using it in the VMware-based lab a few years ago and have been itching to get back to HCI for home lab workloads. But Broadcom killed vSAN in my opinion with licensing and I seriously saw many businesses walk away from vSAN back to traditional storage just simply due to the licensing changes. Ceph gives us a real way forward for HCI that is powerful and very enterprise ready. Ceph can scale as large as you want it to and is natively integrated with Proxmox.

Why I rebuilt my cluster using uniform nodes

I had played around with Ceph before and even use Microceph with my Docker Swarm cluster, but I always had the itch to do Ceph erasure coding. To do erasure coding, I went with a 5 node Ceph cluster. However, starting out with this, if you read my previous post here, you know I had dissimilar hardware and other variables. While it worked shockingly well, I was mainly limited by 2.5 GB networking. I wanted to get back to 10 GbE which is where HCI in general and Ceph in particular really start performing.

To do that, I decided to retrofit my original 5-node Ceph cluster that used dissimilar hardware with a 5-node uniform cluster made up of Minisforum MS-01’s. These are still great little machines and have native 2.5 and 10 gig networking built in, with (2) SFP+ connections. I already had (2) MS-01s that I had previously been using to run my VMware ESXi cluster connected to an all flash NVMe Terramaster NAS.

Read my review of the MS-01 here: Minisforum MS-01 Review: Best Home Server Mini PC Early 2024.

Network ports on the minisforum ms 01 for my proxmox ceph home lab
Network ports on the minisforum ms 01 for my proxmox ceph home lab

What I decided to do was buy (3) more MS-01 barebones units and use the memory and storage from my other mini PCs that would be replaced. This resulted in a uniform 5 node Ceph cluster running 10 GbE with erasure coding. Awesome!

Minisforum ms 01 exploded view of drives and internals
Minisforum ms 01 exploded view of drives and internals

So the reasons come down to the following for using uniform nodes:

  • 10 gig networking for ALL the nodes in the Ceph cluster
  • Uniform performance from all nodes
  • No inconsistencies for HCI that would require troubleshooting and chasing different hardware

Ceph is extremely forgiving, but HCI in general does MUCH better on uniform hardware with the fewest amount of variables. Same machines, same networking, same storage, etc.

Storage configuration of the MS-01s with Ceph

Currently with the MS-01s, I am running the following disks in the (3) NVMe slots of the MS-01. The M.2 slot that is closest to the U.2 switch is a PCI-e 4×4 and is the fastest slot in the MS-01. So, I have a 2 TB Samsung 980 2TB NVMe drive in each of the 5 MS-01 nodes in the cluster in this slot as the first OSD for Ceph. Then, in the next slot, you have a PCI-e 3.0×4 slot that I have also used for Ceph. Two fo the nodes have a 2TB drive and 3 of the nodes have a 1 TB drive here. Then in the slot furthest away from the U.2 switch in the MS-01, I have the boot M.2 drive. This is the slowest slot in the MS-01 so reserving this for the Proxmox install itself. These are 1TB PCI-e 3.0 NVMe drives in each of my MS-01s.

Viewing the drive configuration in one of my ms 01s
Viewing the drive configuration in one of my ms 01s

Careful migration to get to this

You may be wondering how I did the migration from all dissimilar mini PCs to the 5 MS-01s. This happened very carefully and one mini PC at a time. Believe it or not, what I was able to do was simply pull the drives out of the other mini PCs and install into the new MS-01s and Proxmox just picked right up and figured itself out on the new hardware.

The only thing I had to do was reconfigure the networking on the MS-01s Proxmox installation once things booted since the NIC naming changed with the different hardware.

This was a quick edit of the network config file in Proxmox:

/etc/network/interfaces

Here I just simply changed the network interfaces for the default Linux bridge to the names of the new adapters. Below is an example of one of my PVE host’s network configuration:

auto lo
iface lo inet loopback

iface enp2s0f0np0 inet manual
        mtu 9000

iface enp2s0f1np1 inet manual
        mtu 9000

iface wlo1 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves enp2s0f0np0 enp2s0f1np1
        bond-miimon 100
        bond-mode 802.3ad
        bond-xmit-hash-policy layer2+3
        mtu 9000

auto vmbr0
iface vmbr0 inet static
        address 10.3.33.210/24
        gateway 10.3.33.1
        bridge-ports bond0
        bridge-stp off
        bridge-fd 0
        bridge-vlan-aware yes
        bridge-vids 2-4094
        mtu 9000

# VLAN 2 - Dedicated VM bridge (no VM tagging needed)
auto bond0.2
iface bond0.2 inet manual
        vlan-raw-device bond0
        mtu 9000

auto vmbr2
iface vmbr2 inet manual
        bridge-ports bond0.2
        bridge-stp off
        bridge-fd 0
        mtu 9000

# VLAN 10 - Dedicated VM bridge (no VM tagging needed)
auto bond0.10
iface bond0.10 inet manual
        vlan-raw-device bond0
        mtu 9000

auto vmbr10
iface vmbr10 inet manual
        bridge-ports bond0.10
        bridge-stp off
        bridge-fd 0
        mtu 9000

# VLAN 149 - Dedicated VM bridge (no VM tagging needed)
auto bond0.149
iface bond0.149 inet manual
        vlan-raw-device bond0
        mtu 9000

auto vmbr149
iface vmbr149 inet manual
        bridge-ports bond0.149
        bridge-stp off
        bridge-fd 0
        mtu 9000

# VLAN 222 - Dedicated VM bridge (no VM tagging needed)
auto bond0.222
iface bond0.222 inet manual
        vlan-raw-device bond0
        mtu 9000

auto vmbr222
iface vmbr222 inet manual
        bridge-ports bond0.222
        bridge-stp off
        bridge-fd 0
        mtu 9000

auto vmbr0.334
iface vmbr0.334 inet static
        address 10.3.34.210/24
        mtu 9000
#Ceph

auto vmbr0.335
iface vmbr0.335 inet static
        address 10.3.35.210/24
        mtu 9000
#Cluster

auto vmbr0.336
iface vmbr0.336 inet static
        address 10.3.36.210/24
        mtu 9000
#Migration

source /etc/network/interfaces.d/*

One of my mini PCs required more work which I documented in another blog post my journey getting from a SATA SSD over to an M.2 drive. I used Hiren’s Boot CD and Macrium Reflect to image the SATA SSD over to the M.2 drive. This worked flawlessly and had zero issues doing that.

The Ceph lessons that understanding that took the most time

Ceph was one of the areas for me where the real learning and experience of “doing” really paid off. Ceph is incredibly forgiving to be honest. It will absolutely run when you think it may not and it will try its best to get your data back if things come back up.

One of the things I learned with my setup is provisioning Ceph with erasure coding, but also about setting it up for autogrow placement groups. I configured it so that these autogrow on their own when more storage is added.

However, I also shot myself in the foot on one of my nodes and actually rebuilt and wiped the wrong drive! DOH! In looking at the output of lsblk I misread the NVMe drive and zapped the wrong NVMe identifier. But, all was not lost. I simply removed the OSD that I actually wiped from the Ceph config to be safe and then just brought it back in as a new OSD and let Ceph do what it does and copy and move data back. do yourself a favor and double/triple check to make sure you have the right drive before you wipe anything ever.

My erasure coding setup

So here are the details of my erasure coding Ceph configuration.

Viewing the details of the erasure coding profile
Viewing the details of the erasure coding profile

Erasure Coding Scheme: 3+2 and what this means.

  • k=3 – Data is split into 3 data chunks
  • m=2 – There are 2 parity/coding chunks
  • Total: 5 chunks stored across your cluster

Storage Efficiency:

  • My cluster can tolerate 2 simultaneous failures (any 2 hosts/OSDs)
  • Storage overhead: 66.67% efficiency (uses 5 chunks to store 3 chunks worth of data)
  • This is much better than 3x replication (33% efficiency)

What this configuration means in real-world terms:

  • With 5 total chunks, you need at least 3 chunks available to read/write data
  • Can lose any 2 hosts and still access your data
  • Uses roughly 67% less storage than triple replication
  • Great balance of efficiency and reliability for your setup

Networking took the most time in the cluster to get right

The main reason that I went with the MS-01s in the Ceph cluster was for 10 gig networking. Ceph really opens up in terms of performance when you have 10 gig in place. So, I had a plan of action in my cluster to get up to 10 gig and an order of operations.

  1. Replace the dissimilar mini PCs with MS-01s to get the 2 SFP+ ports in place
  2. Bought a 10 gig switch for the mini rack top of rack switch. You can read my review on that here:
  3. Migrate from the 2.5 gig ports on the MS-01s currently up to the 10 gig SFP+ ports
  4. Get a single 10 gig connection up and running with the MS-01s
  5. Get both SFP+ ports up and running with connectivity on the cluster with LACP in place

So, I wanted to take a guaged approach to introducing 10 gig connectivity in the cluster. Bonding them with LACP seemed to be the obvious choice to make the best use of both “lanes” of 10 gig connectivity to work together in a meaningful way. This was the plan, but I ran into some “pain” along the way, mostly due to some mistakes I made in implementing the config.

Jumbo frame issues

The first issue I had was self-inflicted jumbo frame issues. I missed the MTU config on a couple of my MS-01s Proxmox networking config and had a few hours of troubleshooting some really weird behavior because of this.

The dangerous thing with Jumbo frames is they don’t always give you an obvious issue that you can easily pinpoint. Instead, you may see some servies work just fine with a mismatched jumbo frame configuration but then see other apps and services have all kinds of issues. This is exactly what I saw.

Because:

  • Small control packets work
  • ARP works
  • ICMP 1500 works
  • SSH works
  • Web UI works

But:

  • Storage I/O stalls
  • Ceph flaps
  • VM migration fails
  • SMB metadata errors
  • Only large bursts fail

It is easy to forget, but you actually need to make sure you have jumbo frames enabled in the entire chain of communication:

VM → virtio → bridge → VLAN → NIC → Switch → NIC → bridge → VM. Even if one hop is 1500, large frames get dropped somewhere in that chain.

Linux bridge, VLAN tagging, and Firewall flag issues when used in combination

I learned something very interesting about Proxmox and Linux bridges in general that was discovered after I started trying to enable LACP across my cluster hosts. After enabling the first host with LACP, everything was working fine. But after enabling LACP on the second host, I started to see very weird issues in my cluster. I couldn’t ping certain hosts on the network, or certain hosts even on other segments started going down. Very strange.

When I first looked at this, it looked like “LACP was being flaky,” or I was having issues with my new switch. But, I discovered the issue was not LACP itself, but how Proxmox handles VLAN tagging when the VM firewall is enabled.

When you configure a VM NIC with:

  • bridge=vmbr0
  • tag=<VLAN_ID>
  • and firewall=1

Proxmox dynamically creates additional bridge and VLAN subinterfaces behind the scenes. These are not defined in /etc/network/interfaces. They are generated at runtime to allow per-VM firewall filtering on tagged traffic.

For example, if a VM is configured with tag=149 and firewall enabled, Proxmox can create interfaces such as:

  • vmbr0v149
  • bond0.149
  • nicX.149
  • and related fwbr, fwpr, and tap devices

These interfaces can form separate VLAN processing paths specifically for firewall filtering of your VMs. Under a single physical NIC, this often works without obvious symptoms. However, once I configured an 802.3ad LACP bond, that is when I started to see problems. The dynamically created VLAN subinterfaces could exist on the bond and possibly on the bond slaves. This will cause multiple VLAN plumbing paths.

I don’t think this created a classic spanning tree loop, but it did create parallel forwarding paths that were not part of my intended traffic flow. So, it led to inconsistent pings, etc.

Ceph and CephFS

In this cluster, I am running both Ceph and CephFS. CephFS is a file system that exists on top of Ceph underneath. So you get the resiliency of Ceph storage that underpins normal file storage that you can use for all kinds of things.

One of the cool things that I do with CephFS is run shared storage for my Kubernetes cluster. With CephFS, you get easy to interact with storage that is also shared across your nodes. So you can work with the files with any tool that you want to use to connect to your CephFS Proxmox host and copy or migrate data you want to live in CephFS.

Running Talos Kubernetes on Top of CephFS

Once I got my networking lined out with the bridge, VM VLAN tagging, and firewall flag configuration, layering my Talos Kubernetes setup on top of that was pretty simple. In this configuration, I run Talos Linux as virtual machines stored directly on my Ceph RDB storage, and then I have a storage provider that I have configured to work to have CephFS as the shared persistent storage for my Kubernetes cluster.

Some people have asked me or I have seen some criticism on my posts about this as to why I would choose to run Talos in Proxmox instead of on baremetal, especially with Sidero Omni to manage Talos? Well, I can tell you this setup works beautifully and gives you a lot of advantages I think. First of all, when you run Talos on top of Proxmox, it abstracts the Talos nodes from the underlying physical hardware.

Talos omni running on my proxmox and ceph home lab
Talos omni running on my proxmox and ceph home lab

This means if you want to take down a physical node for maintenance, you can still have ALL of your Talos Kubernetes nodes up and running. Also, it allows me to use the cluster and the hardware for more than just Kubernetes. I can use Proxmox to run standalone Docker nodes, Windows nodes, or really anything else I want to do with the resources. And, for me, options are king in the home lab.

All in all, I think this is a great combination of hypervisor, Kubernetes management, and immutable Kubernetes nodes with Talos that make this a killer setup.

Why five nodes was the right number

Could I have gotten away with three nodes? I could. That would have been the minimum. Four nodes would have added some margin in there as well. However, five nodes gives the cluster breathing room. With the 5 nodes, I can lose two nodes and still have quorum in the cluster. Ceph has more flexibility in data placement.

Viewing the proxmox cluster nodes with pvecm
Viewing the proxmox cluster nodes with pvecm

Also, maintenance windows are less stressful because taking a node offline doesn’t push the cluster near its limit or failure tolerance threshold.

Seventeen terabytes in a mini rack

While 17 terabytes might not sound massive compared to enterprise storage arrays, for me, it is pretty awesome to have an all NVMe distributed storage solution across five physical nodes and the cluster support VM migrations, Kubernetes workloads, and tolerate node failures like a boss.

Viewing ceph storage in pulse
Viewing ceph storage in pulse

And, what is even more cool is that all of this lives inside a compact mini rack footprint that is powered by small mini PCs. Having a setup like this no longer requires having enterprise server gear with all the power draw they require and it fits into a MUCH smaller footprint and is much quieter.

Performance Testing!

Ok so I know everyone would be curious on the performance of this little mini cluster. All benchmarks were captured while approximately 22 virtual machines were actively running on the cluster. So, to say this is real world, is definitely true. They reflect real-world contention and distributed storage behavior.

The rbd_vm pool is replicated (size 2, min_size 1). The cephfs_ec_data pool uses 3+2 erasure coding. VM disks live on the replicated pool. CephFS data lives on the erasure coded pool.

Block-level tests were performed using fio against a mapped RBD image. Object-level tests were performed using rados bench with 4 MB object sizes.

RBD Performance (Virtual Machine Pool – Replicated, size 2)

4K Random Read (QD32, 4 jobs, 60 seconds)

  • ~185,000 IOPS
  • ~721 MiB/sec sustained throughput
  • ~0.69 ms average latency
  • ~1.8 ms 99th percentile latency

These numbers I thoguht are excellent small-block read performance for distributed storage over 10Gb networking. It has sub-millisecond average latency. This helps to confirm the cluster is balanced and not bottlenecked at the OSD or network layer. Reads scale efficiently because they can be served in parallel across nodes. So, blazing performance here.

Proxmox ceph cluster fio results for reads
Proxmox ceph cluster fio results for reads

4K Random Write (QD32, 4 jobs, 60 seconds)

  • ~7,700 IOPS
  • ~30 MiB/sec throughput
  • ~16–18 ms average latency
  • ~30 ms 99th percentile latency

This is where you see the most demanding workload for distributed storage in random writes. Each write must be replicated and committed before acknowledgment. With replication size 2 and live VM load in the cluster with 22 VMs running, this latency profile is very good. The important thing I observed is stability. There were no latency collapses or cluster health issues during the load test.

4K Mixed 70/30 Read/Write (QD32, 4 jobs, 90 seconds)

  • ~15.9K read IOPS
  • ~6.8K write IOPS
  • ~0.45 ms average read latency
  • ~17.7 ms average write latency
4k mixed read and write benchmark with fio in proxmox cluster
4k mixed read and write benchmark with fio in proxmox cluster

This mixed workload more closely resembles what you would see in a cluster like this. Reads are still very responsive, while writes carry the replication overhead that we expect. Latency was steady, even with active VMs on the cluster.

1M Sequential Write (QD16, 60 seconds)

  • ~542 MiB/sec sustained throughput

Sequential IO benefits from the the parallel operations across OSDs and network links. Sustaining over 500 MiB/sec of replicated write traffic on a live cluster like this I think is great. It demonstrates that the architecture can push high throughput without becoming unstable. Also, the large block throughput is getting to the practical limits of 10Gb networking.

Raw Ceph object performance (rados bench)

To validate performance at the object layer independently of RBD and Proxmox, I ran rados bench using 4 MB object sizes.

4 MB Object Write (60 seconds)

  • ~786 MB/sec sustained throughput
  • Peak bandwidth over 1.1 GB/sec
  • ~81 ms average latency per 4 MB object
Rados bench object write
Rados bench object write

You might think the 81 ms latency is sluggish and slow. However this is measured per 4 MB transfer, not per 4K IO. Large IO operations will take longer in a distributed system due to replication and network transport. The key metric here is sustained throughput. This approaches line-rate on a single 10Gb link and definitely confirms to us that aggregate bandwidth across the bonded 10Gb interfaces is being effectively utilized.

4 MB Object Read

  • ~1.23 GB/sec sustained throughput
  • ~51 ms average latency per 4 MB object
Rados bench object read
Rados bench object read

Because the cluster uses dual 10Gb links bonded with LACP and distributes reads across multiple OSDs, aggregate bandwidth can exceed single-link limits. This confirms that both the networking and OSD layer are scaling properly across all the nodes.

Happy with the results

I am really happy with the performance here. A few important points help put these numbers in context:

  • The VM pool (rbd_vm) is replicated, not erasure coded. Erasure coding is used only for CephFS data
  • Small 4K random writes are latency-bound due to replication and durability guarantees with Ceph
  • Large sequential IO show the true throughput capability of the cluster
  • Object-level tests also confirm that the underlying Ceph OSD and network layers are performing near expected limits
  • All benchmarks were captured under active workload, not on an idle system

The takeaway is this little mini rack Proxmox Ceph cluster is delivering strong read performance and predictable write behavior. It also is capable of high sequential throughput on dual 10Gb networking.

Video

Wrapping up

I am really excited to see what I can accomplish with my new mini rack Proxmox-powered home lab that is running Ceph and CephFS, along with Talos Kubernetes. It can run basically anything I want to run at this point and I am really looking forward to the workloads and experimentation I can do in the lab in 2026. What do you guys think? Would you change anything that I have done here? I would like your honest feedback.

Google
Add as a preferred source on Google

Google is updating how articles are shown. Don’t miss our leading home lab and tech content, written by humans, by setting Virtualization Howto as a preferred source.

About The Author

Brandon Lee

Brandon Lee

Brandon Lee is the Senior Writer, Engineer and owner at Virtualizationhowto.com, and a 7-time VMware vExpert, with over two decades of experience in Information Technology. Having worked for numerous Fortune 500 companies as well as in various industries, He has extensive experience in various IT segments and is a strong advocate for open source technologies. Brandon holds many industry certifications, loves the outdoors and spending time with family. Also, he goes through the effort of testing and troubleshooting issues, so you don't have to.

5 3 votes
Article Rating
Subscribe
Notify of
guest
7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
jdodd14

So cool Brandon, you went all out! I am looking at building a mini rack this year at some point.

b101

I’d did not see any performance numbers agasint the pool. Erasure coding probably made those nvmes(consumer? no plp? no dram cache?) slow as floppies in your pool. please provide your #s:
rados bench -p 19 write –no-cleanup && rados bench -p 19 seq && rados cleanup -p

Felix

Cheers,
3+2 erasure coding actually means 3/5=60% efficiency. 😂
Do you know, how this setup handles losing one node and two individual disks in different nodes? Would 6+4 erasure coding make a difference in this case?
BR
Felix

Gai King

And the total power draw is?