vSAN

vSAN Daemon Liveness Check Failed – vSAN 7

vSAN Daemon Liveness Check Failed - vSAN 7 error due to EPD uses a ramdisk for the db file error when starting EPD service

Highlights

  • As you can see, the server on the left that is having issues with the liveness checks with the vSAN daemon contains a directory structure.
  • Also, upon looking at my host, I saw what looks to be other issues on the drive such as the productLocker link was dead due to the underlying directory not being available.
  • In this post, we will take a look at vSphere vSAN daemon liveness check failed with a failure starting the epd process.

***UPDATE*** Fix is detailed in the VMware KB below.

One of the great things about running the home lab is you get to triage issues just like you would see in production. I came across a new issue for me in the home lab running vSAN 7.0 on top of the latest build of vSphere ESXi 7.0b, build 16324942. This has to do with one of the Skyline health checks for vSAN calle vSAN daemon liveness. In this post, we will take a look at vSphere vSAN daemon liveness check failed with a failure starting the epd process. Let’s dive more into this issue and see the symptoms of the issue.

vSphere vSAN daemon liveness check failed

I came into the lab this morning with a “red bang” on the vSAN cluster. After looking at the Skyline health of the cluster itself, I saw the vSAN daemon check was the culprit.

The VMware KB here describes it this way:

“vSAN daemons may still have issues, but this test does a very basic check to make sure that they are still running. If this reports an error, the state of the CLOMD, EPD, and CMMDSD service(s) is not working as expected and needs to be checked on the relevant ESXi host. A good way to further probe into CLOMD health is to perform a virtual machine creation test (Proactive tests), as this involves object creation that will exercise and test CLOMD thoroughly. For more information about this issue, refer to the following article: vSAN CLOMD daemon may fail when trying to repair objects with 0 byte components (2149968)

Interestingly for me, this was an issue that was only with the EPD component of the vSAN daemon check.

vSphere-vSAN-daemon-liveness-check-failed
vSphere vSAN daemon liveness check failed

The other two checks were normal. The check performs checks on the following:

  • COMD
  • EPD
  • CMMDSD

From the KB:

Data node of stretched clusterWitness node of stretched clusterData node of metadata clusterMetadata node of metadata cluster
CLOMDYesNoYesYes
EPDYesNoYesNo
CMMDSDYesYesYesYes

Troubleshooting

According to the KB, you can try a few things for troubleshooting purposes including the following:

/etc/init.d/cmmdsd status && /etc/init.d/epd status && /etc/init.d/clomd status


If the daemon is not running, try run restart command on the ESXi host:


/etc/init.d/cmmdsd restart && /etc/init.d/epd restart && /etc/init.d/clomd restart

For me, I only had an issue with the EPD service. When I run the relevant commands for only that service, I received the following:

Performing-steps-to-bring-up-the-EPD-service-for-vSAN-liveness-checks
Performing steps to bring up the EPD service for vSAN liveness checks

As you can see, manually attempting to start the service fails with the error:

  • EPD uses a ramdisk for the db file

On the left, I have the server that is “unhealthy” and on the right, I have a server that is “healthy” from the vSAN daemon liveness checks. As you can see, the server on the right, there are two files:

  • epd-store.db
  • epd-store.db-journal
Comparing-the-scratch-directories-between-unhealthy-server-and-healthy-server
Comparing the scratch directories between unhealthy server and healthy server

As you can see, the server on the left that is having issues with the liveness checks with the vSAN daemon contains a directory structure. Also, upon looking at my host, I saw what looks to be other issues on the drive such as the productLocker link was dead due to the underlying directory not being available.

ProductLocker-sym-link-is-dead
ProductLocker sym link is dead

So, with a couple of these findings, is it corruption? I am running the ESXi 7.0 hosts booting from USB devices.

One weird thing, in Googling the error, I have turned up a couple of other posts related to vSAN 7.0 with ESXi 7.0b hosts. Take a look at those posts here:

As of today, I have not resolved the issue. However, definitely seeing some weird layout on my USB boot disk for sure. It makes me wonder however with others seeing the exact same error if this is a bug with 7.0b.

Another interesting note to make about this error, it does not prevent the vSAN host from running VMs. As soon as I brought this host up from multiple reboots troubleshooting the issue, each time, DRS was able to successfully place virtual machines on the server as you can see below. For now, I am going to leave the host in play to continue to tinker with the issue for a possible workaround.

Please make a comment if you have ran into this issue with vSAN 7.0 and the vSAN daemon liveness check error.

vSAN-daemon-liveness-error-host-is-still-able-to-run-VMs
vSAN daemon liveness error host is still able to run VMs

VMware KB detailing the fix

A new VMware KB article entitled Bootbank loads in /tmp/ after reboot of ESXi 7.0 Update 1 host (2149444) has been posted which addresses the issue. This appears to be a timing issue with the boot of ESXi.

Cause

The storage-path-claim service claims the storage device ESXi is installed on during startup. This causes the bootbank/altbootbank image to become unavailable and the ESXi host reverts to ramdisk.

Wrapping Up

The vSAN Daemon Liveness Check Failed error with vSAN 7 in my case was due to the EPD service failing with a ramdisk for the db file error message. I am still attempting to narrow this down to corruption on the boot disk or perhaps a bug with the 7.0b release due to finding two other cases of this being posted out in the community forum as well as blog posts. Let me know if you guys see this.

Subscribe to VirtualizationHowto via Email 🔔

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Brandon Lee

Brandon Lee is the Senior Writer, Engineer and owner at Virtualizationhowto.com and has over two decades of experience in Information Technology. Having worked for numerous Fortune 500 companies as well as in various industries, Brandon has extensive experience in various IT segments and is a strong advocate for open source technologies. Brandon holds many industry certifications, loves the outdoors and spending time with family.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.