Lost access to volume due to connectivity issues

4

Recently, I ran into an issue with a host where performance would tank from the client side.  VM consoles and RDP sessions would disconnect as well as the host would become unresponsive to any types of commands within the vCenter console or WebUI.  In looking into the logs, I discovered that there were many “lost access to volume due to connectivity issues” messages.  I wanted to detail a few of the findings on this particular host and where things stand now.

volume01 Lost access to volume due to connectivity issues

Lost access to volume due to connectivity issues

A few environmental notes about this particular host:

  • It was running as a development environment
  • It had tons of snapshots running as it was being abused for rinse/repeat operations
  • It is a whitebox supermicro server so, had some pieced together hardware for RAID, etc
  • Is running ESXi 6.0 U2 without any other patches

Things Tried

After having some serious issues where the host finally nose dived and became totally unresponsive, I had to power slam the box.  At this point I drained the flea power and brought the host back up – Upgrading and updating had to wait this time as there was no maintenance window other than getting things back up and running as quickly as possible.

The first thing that was done was to perform snapshot maintenance on the box.  Between the reboot and the deletion of several large and numerous snapshots, the host started performing much better and over the course of a couple of days, I only saw one or two isolated “lost access to volume due to connectivity issues” messages in the events log.

However, this is still not normal. There are heartbeats that take place to make sure the datastore is still present and healthy.  Every 3 seconds a write operation is made to the datastore by the heartbeat process.  The host will expect these write operations to finish within an 8 second window of time.  If it times out, another heartbeat operation is initiated.  However, if the heartbeat operations don’t complete within a 16 second window, the datastore is offlined and you will see the “lost access to volume due to connectivity issues” message in the event log.

So, long story short, something is causing the datastore write operations to fail periodically.  This could be a hardware issue or something else.

Other possible issues

There is a good VMware KB noting issues with VAAI ATS heartbeat which is the new heartbeat mechanism found in VMFS5 and is on by default.

If you see the following message in the vmkernel.log this KB may apply to you:

  • ATS Miscompare detected beween test and set HB images at offset XXX on vol YYY

For this particular host, I didn’t find this message in the vmkernel.log so I haven’t flagged this setting on as of yet.

Thoughts

So far, this is an open ended issue that I am dealing with as of late.  I will update everyone as to any new findings here on this issue of “lost access to volume due to connectivity issues”.  Stay tuned!