My Daily Home Lab Checklist That Prevents Most Problems

Home lab checklist

When it comes to keeping problems headed off in the home lab, it often comes down to the small things. Most problems are not random. They build up slowly and they start showing warning signs. They give little hints before turning into major outages or weird performance issues. Keeping these small problems from building up is not necessarily difficult. It requires a routine and small checks that take just a few minutes each day. This is the daily checklist I follow in my home lab. I think that it is simple, practical, and based on real issues I have run into with Proxmox, Ceph, Docker, and Kubernetes environments.

Check cluster and node health first

For me, I like to start at the foundation of the home lab. This includes my Proxmox cluster and the physical nodes in the environment. So, everything I can touch is what I start taking a few quick minutes on. The reason for this is that if you start too high up in the chain, problems you see there may just be noise from problems at a much lower level in the stack.

For Proxmox checks, this is pretty straightforward. I start by making sure that all nodes are online and responsive. A node that is slow or impaired can be related to a hardware issue or maybe even a networking problem. If you are running Ceph as part of your Proxmox cluster this needs to be included as part of the checks from the physical cluster perspective. You need to make sure your OSDs are all IN and HEALTHY.

Running the ceph status command
Running the ceph status command

Also pay attention to quorum. If your cluster is on the edge of losing quorum, you are one failure away from a much bigger problem.

Daily checks as part of your checklist for your cluster and nodes:

CategoryCheckHow to checkWhat to look for
Physical nodesNode availabilityProxmox UI or pvecm statusAll nodes online and responsive
PerformanceNode responsivenessProxmox UI, SSH into nodesSlow UI, lag, high CPU or IO
Cluster healthCluster statuspvecm statusNo errors, stable cluster state
Ceph overall healthCeph statusceph -sHEALTH_OK
Ceph data healthObject stateceph -sNo degraded or misplaced objects
Ceph OSD statusOSD availabilityceph osd tree or ceph -sAll OSDs are IN and UP
Ceph warningsCluster warningsceph health detailAny HEALTH_WARN messages
Quorum statusCluster quorumpvecm statusAll expected nodes in quorum

Look for storage latency and weird performance

Storage issues can be one of the hardest problems to troubleshoot after the fact. The key is to catch these when they first start showing up. In a Ceph cluster that is part of your Proxmox cluster, there are several Ceph commands that you can use to troubleshoot and get visibility into Ceph performance.

  • ceph osd perf for latency across OSDs
  • Any outliers where one OSD is significantly slower than others
  • General responsiveness of the cluster or workloads maybe hanging
Ceph osd perf helps spot outlier disks that may have issues
Ceph osd perf helps spot outlier disks that may have issues

Even if everything is technically healthy, latency spikes can indicate underlying problems like:

  • A failing drive
  • Network congestion
  • Resource contention on a node

On the host side, a quick look at iostat or similar tools can help you spot anything that is out of the ordinary. If something feels slower than usual, I have found to trust my instincts and dig a little bit deeper into the issue. One thing I have learned is that storage performance usually doesn’t fail all at once. It usually slows down over time due to certain issues. Catching it early can save you from a much larger issue later.

Check containers and services for failures or issues

I run most of my self-hosted services inside containers or Kubernetes pods. But due to their very small and efficient nature, you could have a container that has become unstable, and it may not even be something you catch when you have a restart happen every so often.

I use Pulse in the home lab to keep an eye on all my running containers and get a good high level view on resource usage on those. It tells me quickly things like uptime, CPU, memory, disk usage, containers running, last update, etc.

Checking docker container resource usage in the home lab
Checking docker container resource usage in the home lab

If you are not running something like pulse, you can also run something simple from the command line like docker ps or podman ps. If you are running Portainer or Komodo or something else, these tools

  • docker ps or podman ps
  • Portainer dashboard if I want a quick visual overview

Here are some things to look for in your quick checklist of things with your containers:

CategoryWhat to look forWhy it mattersExamples
StabilityContainers that are in restart loopYou have a bad config, missing data, bad perms, etcAny containerized app
AvailabilityStopped servicesOther services may be impactedReverse proxies, APIs
HealthHealth checks failingServices may be degradedMonitoring stacks, auth services
DependenciesFailures across connected servicesProblems may impact other servicesInternal APIs, authentication services

Verify backup jobs ran

This is one that is super important. You want to make sure that your home lab backups have ran and they were successful. We all know this that backups can be successful and may not be able to be restored successfully. So, that is another check you need to make. But, first, it is still a good idea to make sure you aren’t getting error messages on your backups.

You need to check if the backup ran successful and the data size is similar to what you expect. If you are backing up 20 gigs worth of data, and your backup file is only 5 MB, there is probably something wrong there.

Checking backups for errors
Checking backups for errors

Again, make sure you test that even successful backups can restore. That is something that I learned the hard way with Kubernetes backups recently. Everything looked fine until I needed to restore something. But I still think the quick spot checks are the first line of defense.

Check out the following checks that you should do daily:

CategoryCheckHow to checkWhat to look for
JobsJob is successBackup UI or logsJobs completed without errors
ErrorsFailure messagesBackup logs or alertsAuthentication failures, job errors
Storage capacityBackup storage usageStorage dashboard or CLIEnough free space, no full volumes
SnapshotsSnapshot creationBackup tool or storage systemSnapshots being created on schedule
Retention Retention policiesBackup configuration and logsNo unexpected deletions
Backup sizeData size consistencyLook at backup sizesBackup size is in line with expected data
Recoverable?Test restoresTest one of your backups to restore themData can be successfully restored

Check disk space and resource creep

When it comes to a predictable problem in a home lab, disk space issues are always up there as one of the most common. This issue almost always comes up due to the slow accumulation of old VM disks, unused container volumes, logs growing over time, backup retention not applying when it needs to or as expected.

I make it a habit to check:

  • Ceph pool usage
  • Local disk usage on hosts
  • Any volumes that are unexpectedly large
Running disk space command on proxmox node
Running disk space command on proxmox node

In container environments, this is especially important. Docker can accumulate unused images and volumes quickly if you are testing and iterating often. But, resource creep is not just about disk space. It also includes things like:

  • Memory use
  • CPU usage
  • VMs or containers that are no longer needed

Catching this early helps you avoid sudden issues where something stops working because a disk filled up or a node ran out of resources. Check out the following daily checks as part of your checklist:

CategoryCheckHow to checkWhat to look for
Ceph storagePool usageceph df or Proxmox UIPools are close to capacity or uneven
Host storageLocal disk usagedf -h on hostsHigh usage or disks close to full
Container storageDocker usagedocker system dfUnused images, volumes, or excessive usage
Log growthLog file size/var/log, journald, or logging toolsLogs consuming excessive space
Backup storageRetention behaviorBackup UI or storage viewOld backups not being cleaned up
Memory usageRAM consumptionProxmox UI, free -h, or monitoring toolsHigh or steadily increasing usage
CPU usageCPU loadProxmox UI, top, or htopConstant high CPU or spikes
Resource cleanupUnused workloadsProxmox UI or container toolsVMs or containers no longer needed

Quick network checks

Another issue can be networking problems. These are some of the worst sometimes to troubleshoot. But quick little sanity checks every day can help you catch issues before these are difficult to find or diagnose. Most of us have uptime monitors that help us know if services go down when we don’t expect them to and that helps us to be proactive. But I still perform a few quick sanity checks like:

  • Can I reach key services
  • Are there any obvious latency issues or speed issues
  • Does everything feel normal or sluggish?

With VLANs, bonds, or more advanced networking, this becomes even more important. Misconfigurations or partial failures can cause issues that are hard to trace later. Sometimes this step just means opening up your Gitlab instance, or your HomeAssistant dashboard and make sure things come up and are populated.

Daily checklist of your network:

CategoryCheckHow to checkWhat to look for
Reachable?Uptime Kuma or manually look at servicesBrowser, curl, or pingServices respond as expected
ConnectivityNetwork reachabilityping, traceroute, or mtrNo packet loss or routing issues
LatencyResponse timesping or monitoring toolsHigher than normal latency
ThroughputNetwork speediperf, file transfers, or monitoringSlower than expected speeds
User experienceGeneral responsivenessAccess apps and dashboardsSluggish or delayed responses
VLAN configurationVLAN healthSwitch config, Proxmox bridge settingsCorrect tagging, no misconfigurations
Monitoring alertsUptime checksMonitoring system dashboardsAny recent outages or alerts

Wrapping up

Most problems in the home lab come from little issues that go unnoticed until they are bigger. I like having a daily checklist of things like this to go from being reactive to proactive. Then, you will likely catch things as they are a much smaller problem and before they grow to be major and take things offline or services down. If you are running a mix of Proxmox, Ceph, containers, or other technologies, having a routine is key. It doesn’t have to be complicated, but consistent. how about you? Do you have a routine checklist like this you do in your home lab environment? Let me know in the comments.

Google
Add as a preferred source on Google

Google is updating how articles are shown. Don’t miss our leading home lab and tech content, written by humans, by setting Virtualization Howto as a preferred source.

About The Author

Brandon Lee

Brandon Lee

Brandon Lee is the Senior Writer, Engineer and owner at Virtualizationhowto.com, and a 7-time VMware vExpert, with over two decades of experience in Information Technology. Having worked for numerous Fortune 500 companies as well as in various industries, He has extensive experience in various IT segments and is a strong advocate for open source technologies. Brandon holds many industry certifications, loves the outdoors and spending time with family. Also, he goes through the effort of testing and troubleshooting issues, so you don't have to.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments