Part 16 – VMware vSphere 8 Nested Home Lab – Configuring VMware vSphere HA

In this 16th part of the VMware vSphere 8 Nested Home Lab series, we will learn to configure VMware vSphere HA.

VMware vSphere HA is a clustering feature that ensures automatic failover of virtual machines (VMs) in case of host failure. It monitors host and VM health, and if a failure is detected, it restarts affected VMs on healthy hosts within the cluster—minimizing disruption and preserving uptime.

How HA Works Behind the Scenes

  • Master-Slave Architecture: One host is elected as the master to monitor the cluster and coordinate failovers.
  • Heartbeat Monitoring: Hosts exchange heartbeats via management networks and datastores to detect failures.
  • Isolation Response: If a host loses network connectivity but remains powered on, HA can shut down or restart VMs based on configured policies.
  • Admission Control: Ensures enough resources are reserved to restart VMs in case of failure.

Core Features of vSphere HA

Automatic VM Restart

  • Detects host failures and restarts affected VMs on healthy hosts.
  • Ensures minimal disruption to services.

Host Monitoring

  • Uses heartbeats and network pings to monitor ESXi host health.
  • Declares a host failed if it stops responding within a set interval.

VM Monitoring

  • Tracks VM heartbeats via VMware Tools.
  • Restarts VMs if the guest OS becomes unresponsive.
  • Can be extended to Application Monitoring for deeper insight.

Admission Control

  • Reserves resources to guarantee failover capacity.
  • Policies include:
    • Slot-based
    • Dedicated failover hosts
    • Cluster resource percentage

Proactive HA

  • Monitors hardware health (e.g., CPU, memory, power supply).
  • Migrates VMs off degraded hosts before failure occurs.
  • Integrates with hardware vendors for predictive alerts.

Isolation Response

  • Handles cases where a host loses network connectivity but isn’t fully down.
  • Options include:
    • Power off and restart VMs
    • Shut down and restart VMs
    • Leave VMs powered on

Heartbeat Datastores

  • Adds redundancy to host monitoring using datastore heartbeats.
  • Helps distinguish between host failure and network partition.

Failure Detection & Response

  • Covers multiple failure types:
    • Host failure
    • Datastore APD (All Paths Down)
    • Datastore PDL (Permanent Device Loss)
  • Configurable responses include issuing events or restarting VMs.

Restart Priorities & Dependencies

  • Define which VMs restart first after a failure.
  • Set dependencies so critical services boot before others.

Advanced Options

  • Fine-tune HA behavior using custom configuration strings.
  • Useful for complex or non-standard environments.

Pro Tip: Use Proactive HA to detect hardware degradation before failure occurs, allowing for preemptive VM migration.

Best Practices for HA Configuration

PracticeBenefit
Use redundant management networksImproves failure detection reliability
Enable datastore heartbeatsHelps distinguish between host and network failures
Configure restart prioritiesEnsures critical VMs recover first
Monitor VM Tools heartbeatsEnables granular VM-level failover
Regularly test failover scenariosValidates your HA setup under real conditions

Step-by-Step: Enabling HA in vSphere 8

Let’s walk through the process of activating HA in a vSphere 8 cluster. Login to vCenter Server at https://vcenter01.virtshinobi.local/ui using admin credentials.

In the Hosts and Clusters view, select Cluster-01. On the right side pane navigate to Configure tab and select vSphere Availability under Services. Click on Edit.

Toggle vSphere HA to ON and make sure Enable Host Monitoring toggle is enabled.

Host Failure Response Options:

These settings determine what vSphere HA does when a host fails or becomes isolated from the cluster.

OptionBehavior
RestartDefault and most common setting.
When a host fails, HA restarts affected VMs on other healthy hosts in the cluster.
• VMs are restarted based on their restart priority (High, Medium, Low).
• Ensures minimal downtime and automatic recovery.
Disabled• Turns off host monitoring.
• No VMs are restarted if a host fails.
• Useful during maintenance or when HA is not required.

Select the default option to Restart VMs and move on to the next section.

Host Isolation Response Options:

This kicks in when a host loses management network connectivity but is still running.

OptionBehavior
Disabled• VMs continue running on the isolated host.
• No restart occurs unless the host completely fails.
• Reduces false positives but risks split-brain scenarios if the host is truly isolated.
Power Off and Restart VMs• VMs are forcibly powered off on the isolated host.
• HA restarts them on another host.
• Fast recovery, but may risk data loss if VMs weren’t gracefully shut down.
Shut Down and Restart VMs• VMs are shut down gracefully using VMware Tools.
• If shutdown isn’t completed within a timeout (default: 300 seconds), they’re powered off.
• Preserves VM state and reduces risk of corruption.

Select the default option Disabled and move on to the next section.

Datastore with PDL Options:

Permanent Device Loss (PDL) occurs when a storage device (like a LUN) becomes permanently inaccessible to an ESXi host. This typically happens due to:

  • Hardware failure
  • Improper zoning or masking
  • Manual removal of a device

The storage array sends SCSI sense codes to indicate the device is gone. Once received, the ESXi host:

  • Stops all I/O to the device
  • Marks the device as lost
  • Closes VM I/O sessions

With VMCP (VM Component Protection) enabled, vSphere HA can detect PDL and take action based on the settings below:

OptionBehavior
DisabledNo action taken. VMs may hang or crash.
Issue EventsAlerts are generated, but VMs are not restarted.
Power Off and Restart VMsAffected VMs are powered off and restarted on healthy hosts with access to the datastore.

Best Practice: Always set PDL response to Power Off and Restart VMs to ensure recovery and uptime continuity.

Datastore with APD Options:

All Paths Down (APD) occurs when an ESXi host loses all access paths to a storage device, but the device doesn’t report a permanent failure. It’s a transient issue—maybe caused by network hiccups, SAN misconfigurations, or temporary outages.

Unlike Permanent Device Loss (PDL), APD might resolve itself. That’s why VMware gives you nuanced control over how HA responds.

To protect VMs during APD events, you enable VM Component Protection (VMCP) in your cluster settings.

APD Response Options:

OptionBehavior
DisabledNo action taken. VMs may hang or become unresponsive.
Issue EventsAlerts are generated, but VMs are not restarted.
Power Off and Restart VMs – ConservativeVMs are restarted only if another host with datastore access is available.
Power Off and Restart VMs – AggressiveVMs are restarted even if HA can’t confirm another host has access. Riskier, but faster recovery.

Select the default option Power Off and Restart VMs – Conservative Restart Policy and move on to the next section.

VM Monitoring Options:

In VMware vSphere HA, VM Monitoring is a powerful feature that goes beyond host-level protection by watching individual virtual machines for signs of failure.

How VM Monitoring Works

  • Heartbeat Detection: Uses VMware Tools to detect if the guest OS is responsive.
  • I/O Activity Check: If heartbeats fail, it checks for disk I/O to avoid false positives.
  • Reset Logic: If both heartbeat and I/O are absent, the VM is restarted.

Application Monitoring

  • Requires integration via SDK or supported apps.
  • Monitors app-specific heartbeats and restarts the VM if the app fails.
OptionBehavior
DisabledNo VM or application monitoring.
VM Monitoring OnlyMonitors VM heartbeats via VMware Tools.
VM and Application MonitoringAdds application-level heartbeat checks (requires SDK or supported apps)

You can configure the setting to your liking, we will keep the default value of Disabled selected and move on to next section.

VM Monitoring Sensitivity:

In VMware vSphere HA, VM Monitoring sensitivity settings determine how aggressively the system responds to signs of VM failure. You can choose from preset levels or define custom thresholds to fine-tune behavior.

Preset Sensitivity Levels

These are quick options available via a slider in the vSphere Client:

SensitivityFailure IntervalMinimum UptimeMax ResetsReset Time Window
Low120 sec480 sec37 days
Medium60 sec240 sec324 hours
High30 sec120 sec31 hour
  • Failure Interval: Time without heartbeat or I/O before VM is considered failed.
  • Minimum Uptime: Delay after VM boots before monitoring starts.
  • Max Resets: Limits how often a VM can be restarted.
  • Reset Time Window: Timeframe for counting resets.

Custom Sensitivity Settings:

If presets don’t suit your environment, you can manually configure:

  • Failure Interval (e.g., 45 seconds)
  • Minimum Uptime (e.g., 180 seconds)
  • Maximum per-VM resets (e.g., 5)
  • Maximum resets time window (e.g., 12 hours)

This is ideal for workloads with unique responsiveness or recovery needs.

Leave the default settings and click Admission Control tab to move on to next section.

Admission Control:

HA Admission Control ensures that the cluster reserves enough resources to restart VMs in case of host failure. It’s a safeguard against overcommitting resources and losing availability guarantees.

Admission Control Policies

Policy NameDescription
Cluster Resource Percentage (Default)• Reserves a percentage of CPU and memory for failover.
• Automatically adjusts based on the number of tolerated host failures.
• You can override the calculated percentage if needed.
Slot Policy• Calculates slot size based on VM reservations.
• Determines how many VMs can be restarted based on available slots.
• Best for clusters with uniform VM resource reservations.
Dedicated Failover Hosts• Assigns specific hosts for failover only.
• These hosts don’t run VMs during normal operation.
• Rarely used due to inefficiency.
Disabled• No resource reservation for failover.
• VMs can power on even if availability constraints are violated.
• Not recommended for production environments.

Additional Settings:

  • Host Failures Cluster Tolerates: Defines how many host failures the cluster can recover from.
  • Performance Degradation Tolerance: Sets how much performance drop is acceptable during failover (e.g., 0% = no degradation allowed).
  • Override Calculated Failover Capacity: Lets you manually set CPU/memory reservation percentages.

Configure the settings as per the screenshots below and click on the Heartbeat Datastores tab to continue.

Heartbeat Datastores:

Heartbeat Datastores are a critical part of VMware HA’s ability to distinguish between a host that’s truly down and one that’s just network-isolated. When the management network fails, datastore heartbeats act as a backup signal to help HA make smarter decisions.

What Are Heartbeat Datastores?

  • Used when network heartbeats are lost.
  • Help the master host determine if a slave host is still alive.
  • Prevent unnecessary VM restarts due to false host failure detection.

Configuration Option

OptionDescription
Automatically select datastores accessible from the hostvSphere HA picks shared datastores available to all hosts.
Use datastores only from the specified listYou manually select datastores. HA won’t use others even if these fail.
Use datastores from the specified list and complement automatically if neededPreferred datastores are used, but HA can fall back to others if needed.

Best Practices

  • Ensure at least two shared datastores are accessible by all hosts.
  • Avoid using only one datastore—this can trigger HA warnings or errors.
  • If you must suppress warnings (not recommended), use the advanced option:
    das.ignoreInsufficientHbDatastore = true

Select the option Use datastores from the specified list and complement automatically if needed, select the two datastores iSCSI_DS01 and iSCSI_DS02 and click OK. Monitor the progress in the Recent Tasks pane at the bottom.

We have not configured any Advanced Options. Feel free to explore below Advanced Isolation settings and configure to your liking.

Advanced Isolation Settings

You can fine-tune behavior using HA advanced options:

OptionDescription
das.isolationaddressXSpecifies alternate IPs to test for isolation (e.g., gateway, DNS).
das.isolationshutdowntimeoutTimeout for graceful shutdown before forced power-off.
das.config.fdm.isolationPolicyDelaySecDelay before executing isolation response.

Once the task is complete, you should see that the HA is enabled and a summary of the various configuration options we selected while configuring it.

That’s it for VMware vSphere HA. In the next part of the series we will configure VMware Distributed Resource Scheduler or DRS. So stay tuned.


Discover more from VirtShinobi.blog

Subscribe to get the latest posts sent to your email.

Discover more from VirtShinobi.blog

Subscribe now to keep reading and get access to the full archive.

Continue reading