What is VMware Fault Tolerance and How to Use It?

Virtual machines need high uptime. VMware Fault Tolerance offers zero downtime by running a secondary VM in lockstep. This article explains basics, setup, testing, monitoring, and compares FT to HA so you can decide if FT fits your needs.

download-icon
Free Download
for VM, OS, DB, File, NAS, etc.
iris-lee

Updated by Iris Lee on 2025/07/01

Table of contents
  • What is VMware Fault Tolerance?

  • How VMware Fault Tolerance Works?

  • How to Set and Use Fault Tolerance in VMware?

  • VMware Fault Tolerance vs High Availability

  • Backup VMware VM with Vinchin

  • VMware Fault Tolerance FAQs

  • Conclusion

VMware Fault Tolerance (FT) provides continuous availability for virtual machines by maintaining a live secondary instance. It guards against host failures with no downtime. This feature uses synchronous record/replay to mirror VM execution. It needs careful setup for CPU, networking, and storage. This guide builds from basics to deep details. It covers prerequisites, workings, setup, testing, monitoring, and trade-offs versus High Availability (HA).

What is VMware Fault Tolerance?

VMware Fault Tolerance delivers zero downtime by running a secondary VM in lockstep with the primary. It captures non-deterministic events on the primary and applies them to the secondary before execution. The secondary stays passive for external I/O yet active in replay. If the primary host fails, the secondary takes over instantly. FT uses vLockstep technology for instruction-level replication of CPU and memory states. This ensures zero data loss and uninterrupted service.

How VMware Fault Tolerance Works?

FT relies on synchronous state replication at the instruction level. It records CPU registers, memory updates, and non-deterministic inputs such as interrupts and network packets on the primary. These records flow over a dedicated FT logging network to the secondary. The secondary replays instructions in lockstep but blocks any external I/O until failover. Only the primary issues storage writes or network sends. This prevents split-brain by ensuring a single active I/O source. FT uses atomic locks (e.g., SCSI-3 Persistent Reservations) on shared storage to coordinate failover so only one copy runs as primary after a failure.

FT uses a VMkernel FT logging adapter. The primary captures events, sends them to the secondary before instruction execution. The secondary replays them in the same order. Network and storage I/O complete on the primary; the secondary waits. If the primary host fails (power loss, PSOD, management network isolation beyond timeout, or vmx process crash), heartbeat stops. The secondary immediately assumes the primary role using its replayed state. A new secondary spawns on another host automatically. FT avoids data gaps by atomic storage locks and synchronous state sync.

⚫vLockstep and Non-Deterministic Events

vLockstep captures events that can change execution paths: interrupts, I/O completion, and time-based instructions (e.g., RDTSC). It logs these events on the primary, sends them over FT logging. It injects them into the secondary’s execution before running those instructions. This ensures both VMs see identical inputs. Recording only non-deterministic inputs keeps logging data small compared to full state dumps. The secondary replays CPU and memory state in sync, but waits on external I/O until takeover.

⚫Synchronous Replication of State

FT replicates CPU registers and memory state at instruction granularity. This is not block-level replication. It ensures the secondary's internal state matches the primary's at each instruction point. The FT logging network must deliver records with minimal latency. Any delay can stun VMs if buffers fill. Thus, dedicated bandwidth and low-latency paths are critical. FT logging traffic can reach hundreds of Mbps for CPU-heavy VMs.

⚫I/O Handling and Split-Brain Prevention

Only the primary VM performs external I/O: writes to disk, sends network packets. The secondary is passive for I/O until failover. This avoids two active I/O sources. Shared storage uses atomic file locking, often via SCSI-3 Persistent Reservations, to prevent both VMs from accessing disks as primaries after failure. When failover happens, the secondary claims the reservation and continues. The original primary, if it comes back, spawns a new secondary rather than resuming as primary.

⚫Failover Triggers

FT failover triggers when the primary host becomes unreachable or fails. Conditions include host power loss, PSOD, vmx process crash (e.g., killed via esxcli), or management network isolation exceeding HA timeout. FT monitors heartbeat over the FT logging channel. When heartbeat stops, the secondary immediately assumes primary role. vCenter logs the event. A new secondary is placed on a compatible host automatically.

How to Set and Use Fault Tolerance in VMware?

This section guides through prerequisites, enabling FT, testing failover, and maintenance. It assumes familiarity with vSphere concepts like EVC, DRS, and HA.

Prerequisites and Configuration

Meet CPU compatibility by enabling EVC before FT. Ensure hosts share a baseline EVC that covers CPU instructions used by VMs. If you need to raise EVC later, disable FT on affected VMs first. Use CPUs that support hardware virtualization (Intel EPT or AMD RVI) such as Intel Sandy Bridge or later and AMD Bulldozer or later.

Configure networking with low latency. Use a dedicated FT logging network, ideally 10GbE or higher. VMware recommends RTT below 10ms, ideally under 1ms, to avoid replay delays and stuns. Use separate physical NICs or VLANs to isolate FT traffic. Enable Jumbo Frames (MTU 9000) end-to-end if supported. Dedicate bandwidth to prevent logging channel saturation.

Ensure shared storage meets latency needs. Sustained storage I/O latency should stay below about 15ms for FT sync to keep pace. Use Fibre Channel, iSCSI, or vSAN with consistent performance. Avoid peaks that can delay I/O acknowledgement on the primary. Low storage latency reduces divergence risk. Monitor datastore latency metrics to detect issues.

Configure vMotion network separately. The vMotion network handles initial secondary VM placement and migrations during maintenance. Ensure vMotion paths have adequate bandwidth and low latency. FT logging does not replace vMotion traffic; both need reliable networks. Use DRS to place secondary on suitable host. Resource Pools must not starve FT VMs. Avoid limits or competing reservations that could throttle replay or logging. Reserve CPU and memory on hosts for FT workloads.

Set up VMkernel adapters: one for management, one for vMotion, and one dedicated to FT logging. Assign FT logging VMkernel adapter to a physical NIC with minimal contention. Verify network paths between hosts use minimal hops. Configure HA settings to handle network partitions carefully; avoid isolating hosts that run FT VMs.

1. Enabling Fault Tolerance on a Virtual Machine

Before enabling FT, confirm sufficient resources on both primary and potential secondary hosts. Check CPU, RAM, and network bandwidth. In the vSphere Client, right-click the VM and select Turn On Fault Tolerance. The system creates a secondary VM template matching CPU, memory, and disk settings. FT logging begins between both VMs. Watch the status indicator: it should show Protected. If not, check network, EVC, or resource constraints. FT may disable DRS for that VM; plan accordingly.

Ensure guest OS and virtual hardware versions are supported. Remove unsupported devices: avoid snapshots, no paravirtual SCSI/Net, no NPIV, no RDMA passthrough. Check VMware Compatibility Guide. Confirm licensing supports FT for the desired vCPU count. Common limits: vSphere 8 Enterprise Plus may allow up to 2 vCPUs; check current docs for exact values.

2. Testing Fault Tolerance

Test FT to build confidence. Use safe methods beyond simply powering off the host. For example: kill the vmx process on the primary via esxcli system process kill -t force -p <vmx-pid>; simulate network partition to isolate the host; or kill a critical process inside the guest to test application continuity. Observe that the secondary continues without service break.

Verify failover via PowerCLI: check Get-VM | Select Name, FaultToleranceState. Look for state changes to indicate takeover. Inspect vCenter events for FT failover entries. Use logging in applications to confirm session persistence and service continuity. After failover, verify vSphere spawns a new secondary: check Protected status returns. Only when a new secondary syncs can you consider the test complete.

Check network connections: ensure TCP sessions remain intact if possible. Some applications may require session-aware failover. Review application-specific health checks. Document the test results. Use scheduled maintenance windows if testing in production.

3. Monitoring and Maintenance

Monitor FT health continuously. Check FT logging traffic volume (MBps), latency, and packet drops on FT VMkernel ports. Use vSphere performance charts or PowerCLI (Get-VM | Get-FaultToleranceVM) to view heartbeat and catchup states. Watch for repeated stuns or buffer overflows indicating network issues.

Set alarms for FT-related events. Alert on FT disablement or repeated failover triggers. Review host compatibility changes: when adding hosts or updating firmware, ensure they meet FT EVC and hardware requirements.

When patching hosts, follow this procedure: place the FT-protected host into Maintenance Mode. DRS migrates other VMs away. The FT secondary migrates via vMotion to another compatible host before entering maintenance. Patch and reboot the host. Exit Maintenance Mode. vSphere may migrate the secondary back or spawn a new secondary automatically. This relies on DRS and HA being enabled. Confirm synchronization resumes.

Maintain consistent firmware and driver levels across hosts. Align CPU microcode versions to avoid EVC drift. Keep storage multipathing and network paths uniform. Test changes in a lab when possible. Document all FT configurations.

VMware Fault Tolerance vs High Availability

Fault Tolerance (FT) and High Availability (HA) both aim to reduce downtime but differ in RTO, overhead, and complexity. Both ensure RPO=0 for VM state at failure moment since FT replicates state continuously, and HA can restart VMs quickly but may lose in-memory state—though some data loss can occur if not writing to disk at crash. The key difference is RTO: FT offers near-zero RTO (milliseconds) as the secondary takes over instantly; HA requires VM restart, causing minutes of downtime.

FT incurs more overhead: it runs a secondary VM in replay mode consuming CPU cycles equal to the primary. It effectively doubles CPU reservations. FT logging traffic can hit hundreds of Mbps for CPU/memory-intensive VMs. Network latency must stay low. Storage I/O runs only on the primary, but logging adds overhead. FT is thus best for small VMs (1-2 vCPUs) or very critical workloads. HA uses fewer resources: it restarts VMs on another host, causing a brief reboot. Use HA for larger or less critical VMs where downtime of a few minutes is acceptable.

Consider complexity: FT demands strict requirements and careful monitoring. HA requires shared storage and HA cluster setup but is simpler. Plan FT only when zero downtime is mandatory and resource cost is justified. Ask: Can your application tolerate a short reboot? If yes, HA may suffice. If no, FT may be worth the extra cost.

Backup VMware VM with Vinchin

Fault Tolerance guards against host failures. But backups protect data from corruption, human error, or site disasters. Vinchin offers enterprise-grade VM backup tailored for VMware environments. It integrates smoothly with vSphere. It ensures your VMs remain restorable beyond FT protection.

Vinchin Backup & Recovery is a professional, enterprise-grade VM backup solution supporting VMware and over 15 other platforms like Hyper-V, Proxmox, oVirt, OLVM, RHV, XCP-ng, XenServer, OpenStack, ZStack, and more. It offers a rich feature set.

Vinchin provides forever incremental backup to save time and storage. It applies data deduplication and compression to reduce backup size. V2V migration helps move VMs between hosts or platforms. It supports CBT to capture only changed blocks. It offers instant recovery for rapid VM restoration. Plus, it includes data encryption, multi-thread transmission, backup verification, granular restore, cloud/tape archiving, throttling policies, and GFS retention. These are just a few of many features Vinchin delivers.

The web console is intuitive. To back up a VM, follow four steps:

1. Select the VMware VM to back up.

Select the VMware VM to back up

2. Choose backup storage.

Choose backup storage

3. Configure backup strategies.

Configure backup strategies

4. Submit the job.

Submit the job

This simple flow helps admins protect their VMware workloads efficiently. Vinchin's global customer base and high ratings reflect trust in its performance. Enjoy a 60-day full-featured free trial to test all features in your environment. Download Installer and deploy easily to secure your VMs today.

VMware Fault Tolerance FAQs

Q1: What limits exist for VMware Fault Tolerance? 

A1: vCPUs per FT VM depend on vSphere version and license; e.g., vSphere 7 may allow up to 8 vCPUs, vSphere 8 often limits 2 vCPUs on Enterprise Plus; FT disallows snapshots, Storage vMotion, paravirtual devices, NPIV, and RDMA passthrough.

Q2: How do I add FT logging network? 

A2: In vSphere Client select host > Configure > Networking > VMkernel adapters, click Add Networking, choose Fault Tolerance Logging, assign port, bind to a dedicated physical NIC (10GbE or better), enable Jumbo Frames if path supports MTU 9000.

Q3: How to handle maintenance without disrupting FT? 

A3: Use DRS: migrate non-FT VMs away, allow FT secondary to vMotion to another host, patch the host, exit Maintenance Mode, verify new secondary sync; ensure DRS and HA are enabled for automation.

Q4: How does FT impact VM performance? 

A4: FT adds overhead from logging traffic and secondary replay; expect 5–20% penalty on primary based on workload and latency; test under load to verify impact before production rollout.

Conclusion

VMware Fault Tolerance offers zero-downtime protection by running a passive secondary VM in lockstep with the primary. It demands precise setup: CPU compatibility via EVC, dedicated low-latency networks, and low storage latency. Testing and monitoring ensure reliability, while FT vs HA trade-offs hinge on RTO needs and resource cost. 

Combining FT with Vinchin backups covers both host failures and data-level risks. Vinchin’s advanced features like forever incremental backup and deduplication add resilience. Test FT regularly, schedule backups, and review metrics to maintain a robust VMware environment. Trust Vinchin for comprehensive VM protection.

Share on:

Categories: VM Tips