Part 16 – VMware vSphere 8 Nested Home Lab – Configuring VMware vSphere HA

In this 16^th part of the VMware vSphere 8 Nested Home Lab series, we will learn to configure VMware vSphere HA.

VMware vSphere HA is a clustering feature that ensures automatic failover of virtual machines (VMs) in case of host failure. It monitors host and VM health, and if a failure is detected, it restarts affected VMs on healthy hosts within the cluster—minimizing disruption and preserving uptime.

How HA Works Behind the Scenes

Master-Slave Architecture: One host is elected as the master to monitor the cluster and coordinate failovers.
Heartbeat Monitoring: Hosts exchange heartbeats via management networks and datastores to detect failures.
Isolation Response: If a host loses network connectivity but remains powered on, HA can shut down or restart VMs based on configured policies.
Admission Control: Ensures enough resources are reserved to restart VMs in case of failure.

Core Features of vSphere HA

Automatic VM Restart

Detects host failures and restarts affected VMs on healthy hosts.
Ensures minimal disruption to services.

Host Monitoring

Uses heartbeats and network pings to monitor ESXi host health.
Declares a host failed if it stops responding within a set interval.

VM Monitoring

Tracks VM heartbeats via VMware Tools.
Restarts VMs if the guest OS becomes unresponsive.
Can be extended to Application Monitoring for deeper insight.

Admission Control

Reserves resources to guarantee failover capacity.
Policies include:
- Slot-based
- Dedicated failover hosts
- Cluster resource percentage

Proactive HA

Monitors hardware health (e.g., CPU, memory, power supply).
Migrates VMs off degraded hosts before failure occurs.
Integrates with hardware vendors for predictive alerts.

Isolation Response

Handles cases where a host loses network connectivity but isn’t fully down.
Options include:
- Power off and restart VMs
- Shut down and restart VMs
- Leave VMs powered on

Heartbeat Datastores

Adds redundancy to host monitoring using datastore heartbeats.
Helps distinguish between host failure and network partition.

Failure Detection & Response

Covers multiple failure types:
- Host failure
- Datastore APD (All Paths Down)
- Datastore PDL (Permanent Device Loss)
Configurable responses include issuing events or restarting VMs.

Restart Priorities & Dependencies

Define which VMs restart first after a failure.
Set dependencies so critical services boot before others.

Advanced Options

Fine-tune HA behavior using custom configuration strings.
Useful for complex or non-standard environments.

Pro Tip: Use Proactive HA to detect hardware degradation before failure occurs, allowing for preemptive VM migration.

Best Practices for HA Configuration

Practice	Benefit
Use redundant management networks	Improves failure detection reliability
Enable datastore heartbeats	Helps distinguish between host and network failures
Configure restart priorities	Ensures critical VMs recover first
Monitor VM Tools heartbeats	Enables granular VM-level failover
Regularly test failover scenarios	Validates your HA setup under real conditions

Step-by-Step: Enabling HA in vSphere 8

Let’s walk through the process of activating HA in a vSphere 8 cluster. Login to vCenter Server at https://vcenter01.virtshinobi.local/ui using admin credentials.

In the Hosts and Clusters view, select Cluster-01. On the right side pane navigate to Configure tab and select vSphere Availability under Services. Click on Edit.

Toggle vSphere HA to ON and make sure Enable Host Monitoring toggle is enabled.

Host Failure Response Options:

These settings determine what vSphere HA does when a host fails or becomes isolated from the cluster.

Option	Behavior
Restart	• Default and most common setting. • When a host fails, HA restarts affected VMs on other healthy hosts in the cluster. • VMs are restarted based on their restart priority (High, Medium, Low). • Ensures minimal downtime and automatic recovery.
Disabled	• Turns off host monitoring. • No VMs are restarted if a host fails. • Useful during maintenance or when HA is not required.

Select the default option to Restart VMs and move on to the next section.

Host Isolation Response Options:

This kicks in when a host loses management network connectivity but is still running.

Option	Behavior
Disabled	• VMs continue running on the isolated host. • No restart occurs unless the host completely fails. • Reduces false positives but risks split-brain scenarios if the host is truly isolated.
Power Off and Restart VMs	• VMs are forcibly powered off on the isolated host. • HA restarts them on another host. • Fast recovery, but may risk data loss if VMs weren’t gracefully shut down.
Shut Down and Restart VMs	• VMs are shut down gracefully using VMware Tools. • If shutdown isn’t completed within a timeout (default: 300 seconds), they’re powered off. • Preserves VM state and reduces risk of corruption.

Select the default option Disabled and move on to the next section.

Datastore with PDL Options:

Permanent Device Loss (PDL) occurs when a storage device (like a LUN) becomes permanently inaccessible to an ESXi host. This typically happens due to:

Hardware failure
Improper zoning or masking
Manual removal of a device

The storage array sends SCSI sense codes to indicate the device is gone. Once received, the ESXi host:

Stops all I/O to the device
Marks the device as lost
Closes VM I/O sessions

With VMCP (VM Component Protection) enabled, vSphere HA can detect PDL and take action based on the settings below:

Option	Behavior
Disabled	No action taken. VMs may hang or crash.
Issue Events	Alerts are generated, but VMs are not restarted.
Power Off and Restart VMs	Affected VMs are powered off and restarted on healthy hosts with access to the datastore.

Best Practice: Always set PDL response to Power Off and Restart VMs to ensure recovery and uptime continuity.

Datastore with APD Options:

All Paths Down (APD) occurs when an ESXi host loses all access paths to a storage device, but the device doesn’t report a permanent failure. It’s a transient issue—maybe caused by network hiccups, SAN misconfigurations, or temporary outages.

Unlike Permanent Device Loss (PDL), APD might resolve itself. That’s why VMware gives you nuanced control over how HA responds.

To protect VMs during APD events, you enable VM Component Protection (VMCP) in your cluster settings.

APD Response Options:

Option	Behavior
Disabled	No action taken. VMs may hang or become unresponsive.
Issue Events	Alerts are generated, but VMs are not restarted.
Power Off and Restart VMs – Conservative	VMs are restarted only if another host with datastore access is available.
Power Off and Restart VMs – Aggressive	VMs are restarted even if HA can’t confirm another host has access. Riskier, but faster recovery.

Select the default option Power Off and Restart VMs – Conservative Restart Policy and move on to the next section.

VM Monitoring Options:

In VMware vSphere HA, VM Monitoring is a powerful feature that goes beyond host-level protection by watching individual virtual machines for signs of failure.

How VM Monitoring Works

Heartbeat Detection: Uses VMware Tools to detect if the guest OS is responsive.
I/O Activity Check: If heartbeats fail, it checks for disk I/O to avoid false positives.
Reset Logic: If both heartbeat and I/O are absent, the VM is restarted.

Application Monitoring

Requires integration via SDK or supported apps.
Monitors app-specific heartbeats and restarts the VM if the app fails.

Option	Behavior
Disabled	No VM or application monitoring.
VM Monitoring Only	Monitors VM heartbeats via VMware Tools.
VM and Application Monitoring	Adds application-level heartbeat checks (requires SDK or supported apps)

You can configure the setting to your liking, we will keep the default value of Disabled selected and move on to next section.

VM Monitoring Sensitivity:

In VMware vSphere HA, VM Monitoring sensitivity settings determine how aggressively the system responds to signs of VM failure. You can choose from preset levels or define custom thresholds to fine-tune behavior.

Preset Sensitivity Levels

These are quick options available via a slider in the vSphere Client:

Sensitivity	Failure Interval	Minimum Uptime	Max Resets	Reset Time Window
Low	120 sec	480 sec	3	7 days
Medium	60 sec	240 sec	3	24 hours
High	30 sec	120 sec	3	1 hour

Failure Interval: Time without heartbeat or I/O before VM is considered failed.
Minimum Uptime: Delay after VM boots before monitoring starts.
Max Resets: Limits how often a VM can be restarted.
Reset Time Window: Timeframe for counting resets.

Custom Sensitivity Settings:

If presets don’t suit your environment, you can manually configure:

Failure Interval (e.g., 45 seconds)
Minimum Uptime (e.g., 180 seconds)
Maximum per-VM resets (e.g., 5)
Maximum resets time window (e.g., 12 hours)

This is ideal for workloads with unique responsiveness or recovery needs.

Leave the default settings and click Admission Control tab to move on to next section.

Admission Control:

HA Admission Control ensures that the cluster reserves enough resources to restart VMs in case of host failure. It’s a safeguard against overcommitting resources and losing availability guarantees.

Admission Control Policies

Policy Name	Description
Cluster Resource Percentage (Default)	• Reserves a percentage of CPU and memory for failover. • Automatically adjusts based on the number of tolerated host failures. • You can override the calculated percentage if needed.
Slot Policy	• Calculates slot size based on VM reservations. • Determines how many VMs can be restarted based on available slots. • Best for clusters with uniform VM resource reservations.
Dedicated Failover Hosts	• Assigns specific hosts for failover only. • These hosts don’t run VMs during normal operation. • Rarely used due to inefficiency.
Disabled	• No resource reservation for failover. • VMs can power on even if availability constraints are violated. • Not recommended for production environments.

Additional Settings:

Host Failures Cluster Tolerates: Defines how many host failures the cluster can recover from.
Performance Degradation Tolerance: Sets how much performance drop is acceptable during failover (e.g., 0% = no degradation allowed).
Override Calculated Failover Capacity: Lets you manually set CPU/memory reservation percentages.

Configure the settings as per the screenshots below and click on the Heartbeat Datastores tab to continue.

Heartbeat Datastores:

Heartbeat Datastores are a critical part of VMware HA’s ability to distinguish between a host that’s truly down and one that’s just network-isolated. When the management network fails, datastore heartbeats act as a backup signal to help HA make smarter decisions.

What Are Heartbeat Datastores?

Used when network heartbeats are lost.
Help the master host determine if a slave host is still alive.
Prevent unnecessary VM restarts due to false host failure detection.

Configuration Option

Option	Description
Automatically select datastores accessible from the host	vSphere HA picks shared datastores available to all hosts.
Use datastores only from the specified list	You manually select datastores. HA won’t use others even if these fail.
Use datastores from the specified list and complement automatically if needed	Preferred datastores are used, but HA can fall back to others if needed.

Best Practices

Ensure at least two shared datastores are accessible by all hosts.
Avoid using only one datastore—this can trigger HA warnings or errors.
If you must suppress warnings (not recommended), use the advanced option:
das.ignoreInsufficientHbDatastore = true

Select the option Use datastores from the specified list and complement automatically if needed, select the two datastores iSCSI_DS01 and iSCSI_DS02 and click OK. Monitor the progress in the Recent Tasks pane at the bottom.

We have not configured any Advanced Options. Feel free to explore below Advanced Isolation settings and configure to your liking.

Advanced Isolation Settings

You can fine-tune behavior using HA advanced options:

Option	Description
`das.isolationaddressX`	Specifies alternate IPs to test for isolation (e.g., gateway, DNS).
`das.isolationshutdowntimeout`	Timeout for graceful shutdown before forced power-off.
`das.config.fdm.isolationPolicyDelaySec`	Delay before executing isolation response.

Once the task is complete, you should see that the HA is enabled and a summary of the various configuration options we selected while configuring it.

That’s it for VMware vSphere HA. In the next part of the series we will configure VMware Distributed Resource Scheduler or DRS. So stay tuned.

Discover more from VirtShinobi.blog

Subscribe to get the latest posts sent to your email.

How HA Works Behind the Scenes

Core Features of vSphere HA

Best Practices for HA Configuration

Step-by-Step: Enabling HA in vSphere 8

Host Failure Response Options:

Host Isolation Response Options:

Datastore with PDL Options:

Datastore with APD Options:

VM Monitoring Options:

VM Monitoring Sensitivity:

Admission Control:

Heartbeat Datastores:

Advanced Isolation Settings

Share this:

Like this:

Discover more from VirtShinobi.blog

Related Posts

VMware vSAN ESA vs OSA: Which architecture fits your mission?

VMware vSAN: The Ninja of Hyperconverged Storage

Part 17 – VMware vSphere 8 Nested Home Lab – Configuring VMware vSphere DRS

Trending now

Discover more from VirtShinobi.blog