Software-defined solutions such as software-defined networking and storage have revolutionized the world of virtualization. Abstracting these core areas of architecture away from the actual physical hardware allows robust features and agility that simply can’t be achieved with a hardware-only solution. However, with software-defined solutions, it often brings about challenges from a troubleshooting standpoint since the layer of abstraction can add a layer of complexity in troubleshooting. Recently, VMware released a Troubleshooting vSAN Performance whitepaper to help with a framework for diagnoses and mitigation of performance issues in a production vSAN environment. In the whitepaper, there is a stepped approach to troubleshooting a vSAN environment that provides a logical way to troubleshoot. In this post, we will take a look at troubleshooting vSAN performance in five steps and see how these can be used to determine, isolated and troubleshoot performance issues with VMware vSAN.
Outlining the Five Steps in Troubleshooting vSAN Performance
The steps outlined include:
The steps outlined include:
- Identify and Quantify – As you would imagine, you have to first identify the real issue. Asking the right questions can be key here and helps to sort out real versus perceived issues focusing on the solution to performance issues.
- Discovery/Review – Environment – This step involves reviewing the current configuration. It can help flush out any basic misconfiguration that may be causing performance issues.
- Discover/Review Workload – A review of the applications and workflows. What is the application that is suffering a performance issue trying to do?
- Performance Metrics – Insight – In this step, the vSAN administrator will review key performance indicators and see if observations made coincide with perceived performance problems from the end user.
- Mitigation – After the problem has been determined, in this step, the administrator will look for any software or hardware changes that may be part of the resolution.
Identify and Quantify VMware vSAN Performance Issue
The first step in the approach seems obvious, however, I like how the whitepaper gives focus to this since ironically in troubleshooting a wide variety of issues, not simply isolated to vSAN, this can get missed or not done properly. Related to vSAN, identifying and quantifying the issue involves a few things:
- Determine if a true performance issue exists
- What is the magnitude or scope of the issue?
- Is performance outside the original design scope?
Who reported the performance issue? Is it an end user or perhaps another administrator? What evidence or data do you have to support the performance issue claim? Is this a repeating issue with performance or reproducable? Have any changes been made in the vSAN environment coinciding with the reported performance issues. What about updates, firmware upgrades, etc?
What about the scope of the environment. Is this issue related to the current workload of the vSAN cluster falling outside the original design? Have additional workloads or other demands been added to the cluster? Was the cluster orginally sized and provisioned with affordability/density in mind and not performance? Was the vSAN performance evaluation checklist followed prior to testing or placing into production? What about HCIBench? Were HCIBench synthetic tests ran against the cluster?
Determining and asking these very important questions are included in the first part of the vSAN performance troubleshooting framework and are crucial to understanding the problem and a potential solution.
Discover and Review the VMware vSAN Environment
The second step in the process is the discovery/review – environment step. This step involves taking a look at the vSAN environment as a whole and determining if there is an issue with the configuration that could potentially be impacting performance. There are two areas that need to be looked at closely:
vSAN enviornment health checks
Non-vSAN environment health checks
VMware vSAN has a great way to take a look at the performance of the solution, right from the vSphere client environment. Via built-in vSAN health checks, administrators can take a look at the current state of the vSAN environment and see if there are any outstanding issues with the underlying vSAN environment as these could certainly affect performance.
There are several non-vSAN environment health checks that need to be taken into consideration as well for an information gathering standpoint with vSAN performance. These include gathering the following:
Infrastructure design/switch topology
Disk group verification from hardware perspective
Host NIC verification
Looking at the cluster topology from the standpoint of such thins as is the vSAN cluster configured in a “stretched cluster” configuration? Looking at the impact of the data being written across both sites and ensuring the storage policies are designed correctly to take this into account.
Also, again from a design perspective, was the cluster designed for cost or for performance? If for cost, is the performance issue related to the designed purpose/performance of the cluster?
Discover and Review VMware vSAN Workloads
After the initial discovery phase from an environment perspective, the next step is a similar process from an application and workloads perspective. This part of the discovery phase has two categories as outlined by the whitepaper – application characteristics, and workflow/process characteristics. The first part of this, the application characteristics, involves the applications, the quantity, guest VM configuration settings, storage policy settings, and anything else that may impact performance. The workflow/process characteristics discovers what activity is actually being demanded of the application. The combination of the two processes makes up the actual “fingerprint” of what your environment actually looks like from a workload perspective.
From Applications perpsective, asking questions like the following:
What applications are not performing adequately? Is it SQL, etc? Does the application depend on other VMs? What other architectural considerations need to be made? What type of storage policy is associated with the VM? Number of vCPUs, memory, reservations, etc.
From Workflow and Processes standpoint:
What type of tasks or workflow is being performed by the application? Have workflow tasks been mapped? Is the workflow dependent on multiple VMs or single VM? Is the workload highly transactional, such as SQL Server?
It is important to note that overlooking automated workflows and processes inside an organization can often undermine well-designed infrastructure. This includes often highly inefficient legacy tools, platforms, and applications.
Taking Performance Metrics in VMware vSAN
Looking at performance metrics for VMware vSAN will be a bit different than looking at a traditional storage solution due to the distributed nature of the environment. Taking performance metrics can be the most helpful way to quantify performance of a VMware vSAN environment. This includes gathering performance for both point in time as well as time based metrics covering a period of time.
What, how, and where you measure can dramatically influence your conclusions when gathering data. Below are samples of each area to consider.
Latency is often the killer when it comes to performance. When it comes to viewing normalized performance metrics for VMware vSAN, vCenter Server provides an extremely accurate measure of possible performance bottlenecks and issues. There are several areas that you can easily view this type of performance from a vSAN perspective.
VMware vSAN Performance:
Related to vSAN, you can monitor vSAN performance from multiple levels, including cluster, host, and VM.
Cluster, Host, VM level – vSAN Performance, navigate to Monitor > vSAN > Performance to view.
Mitigating VMware vSAN Performance Issues
Mitigating issues is included in Step 5. Hopefully once you have gone through the information gathering phase of steps 1-4, you have been able to pinpoint issues that may need to be addressed and mitigated. Now, you are ready to mitigate issues that were found. These include:
VMware vSAN specific:
- Upgrade vSAN to latest version
- Resolve configuration issues
- Access options capacity and performance constraints
- Ensure firmware is at latest levels
- Switch gear configuration/class
- Use NIOC for uplinks that are not dedicated
- Jumbo frames
- BIOS settings
- VM storage policies
- Mutiple SCSI adapters
- Use Paravirtual controllers
- Increase queue depth of PVSCSI controllers
- New application versions?
Scale out the application
By following this very procedural and step-based approach, identifying underlying the culprit for any performance issues in a vSAN environment becomes much easier.
Troubleshooting vSAN Performance need not be difficult, even with the distributed and software-defined nature of the environment. The new whitepaper as discussed helps with Troubleshooting vSAN Performance in Five Steps and provides a very good writeup on how to approach this methodically and logically. This helps to take the complexity out of the process, or at least greatly reduce it. Check out the new whitepaper here: https://storagehub.vmware.com/section-assets/troubleshootingvsanperformance