VMWare HA Important Points
- HA maintains the high availability of virtual machines when an event of host failure / isolation occurs by powering on them on running hosts.
- Every host in cluster exchanges its heartbeat with other hosts to notify them that it is alive.
- A host is declared as isolated when its heartbeat is not received within 12 seconds.
- A host is declared as dead when its heartbeat is not received within 15 seconds, we can increase this duration to avoid false positives by defining an advanced setting das.failuredetectioninterval in vCenter.
- If we set das.failuredetectioninterval to 60 seconds we can avoid false isolations, which means if an isolated host comes back within 60 seconds VM’s will continues to run on the same host, which means HA will never interfere.
- When a host is declared as isolated after defined interval isolation response will be executed on that host.
- If isolated response is set to “Leave Powered on” the vm’s will continues to run on the isolated, however if another host tries to power on the same vm on other host it may not be possible to do so, because of underlying lock mechanism provided by vmfs.
- If isolated response is set to “Shutdown” the vm’s will undergo clean shutdown by the host and then they will be powered on other host.
- If isolated response is set to “Power off” the vm’s will powered off immediately by the host and then they will be powered on other host. At the same time if the host has VM’s which are shutdown gracefully, they will never powered on new host. Only abnormally powered off vm’s will be powered on other hosts to avoid service interruptions.
- When a host is declared as dead all the vm’s running on it will be powered on immediately on other hosts following admission control policies and restart priorities.
- Admission control policies are defined to ensure sufficient resources are available to virtual machines when a host failure/isolation occurs, in other words HA will reserve some resources to provide room for virtual machines in worst case scenario. There are three policies available to make these reservations.
- Host failures a cluster can tolerate
- If we select this option, HA will reserve resources based on a concept called slots. “A slot is a logical representation of highest CPU and Memory reservations that satisfy requirements for any powered on vm in the cluster”. When slots are calculated a host with highest slots will be removed out of equation, this decision ensures resources for all the VM’s if any other host fails throughout the cluster.
- A slot is defined from highest CPU and memory reservations in the cluster, if a VM has 4GHZ CPU and 1GB RAM and other VM has 2GHZ CPU and 4GBRAM, so 4GHZ and 4GB is defined as slot.
- Number of slots available on a host will be calculated based on most restrictive number, for example if a host has 256GB of RAM and 16GHZ of CPU, it has 64 slots for memory and 4 slots for CPU, so the host will be considered to have 4 slots for failovers, which means it can accommodate 4 VM’s in an event of failure.
- A custom slot size also can be defined using advanced options to increase number of slots and avoid resource wastage in case of a cluster has a VM with high amount of resource reservations.
- An example calculation of slots, Example 1
- Host A -> 12GHZ + 12 GB
- Host B -> 8 GHZ + 8 GB
- Host C -> 12GHZ + 12 GB
- VM 1 -> 2GHZ + 4 GB
- VM 2 -> 1GHZ + 2 GB
- VM 3 -> 4GHZ + 2 GB
- VM 4 -> 4GHZ + 1 GB
- VM 5 -> 2GHZ + 3 GB
- VM 6 -> 3GHZ + 3 GB
- Host failures a cluster can tolerate
So this cluster has 3 hosts with different set of resources, it has 32 GHZ CPU and 32 GB RAM throughout the cluster. As discussed earlier slot is calculated from the largest reservations of CPU and RAM assigned to the VM’s in this cluster. In our case the slot size would be 4GHZ and 4GB, it can satisfy the requirements for any powered on VM in this cluster.
So Host A has 3 slots, Host B has 2 slots and Host C also has 3 slots. A total of 8 slots available in this cluster.
If Host B fails this cluster will have only 5 slots available, this is called current capacity. Which does not satisfy to power on all VM’s so VM’s will be never powered or migrated to these hosts.
- Another example 2
- Host A -> 9GHZ + 9 GB
- Host B -> 9GHZ + 6 GB
- Host C -> 6GHZ + 6 GB
- VM 1 -> 2GHZ + 1 GB
- VM 2 -> 2GHZ + 1 GB
- VM 3 -> 1GHZ + 2 GB
- VM 4 -> 1GHZ + 1 GB
- VM 5 -> 1GHZ + 1 GB
So this cluster has 3 hosts with different set of resources, it has 24GHZ CPU and 21GB RAM throughout the cluster. As discussed earlier slot is calculated from the largest reservation of CPU and RAM assigned to the VM’s in this cluster. In our case the slot size would be 2GHZ and 2GB, it can satisfy the requirements for any powered on VM in this cluster.
So HA has 4 slots, Host B has 3 slots and Host C also has 3 slots. A total of 10 slots available in this cluster.
If Host B fails this cluster will have 7 slots available, this is called current capacity. And we have 5 VM’s to power on so all can be powered on without any problem as they are not violating resource constrains.
- Percentage of resources to be reserved
- This is the most flexible mechanism to reserve resources for HA failovers.
- This does not use slots concept to reserve resources, available resources will be calculated based on following formula
- (Total available resources – Total reserved resources)/(Total available resources)
- Using this option we can reserve resources at cluster level, not at host level.
- There are no advanced options needs to be configured.
- Dedicating a host for fail over
- A dedicated host will be designated for failover purposes, this wastes resources as we need to dedicate a host and it is sitting idle all the time.
- Restart priorities are followed in the below order
- Agent based VM’s
- FT enabled VM’s
- VM’s with high priority option defined
- VM’s with medium priority option defined
- VM’s with low priority option defined
- When a host is declared as dead all the VM’s running on it will be powered on on other hosts running in that cluster based on restart priority defined above.
HA in vSphere 4.1
- It is called as Automated Availability Manager in this version.
- When we configure HA on vSphere 4.1 cluster, the first 5 hosts will be designated as Primary nodes, out of these 5 one node will act as “Master Primary” and which will handle restarts of VM’s in the event of a host failure.
- All the remaining hosts will join as Secondary Nodes.
- Primary nodes maintain information about cluster settings and secondary node states.
- All these nodes exchange their heartbeat with each other to know the health status of other nodes.
- Primary nodes sends their heart beats to all other primary and secondary nodes
- Secondary nodes sends their heart beats to primaries only
- Heart beats will be exchanged between all nodes every second.
- In case of a primary failure, other primary node will take the responsibility of restarts.
- If all primaries goes down at same point, no restarts will be initiated, in other words to initiate reboots at least one primary is required.
- Election of primary happens only during following scenarios
- When a host is disconnected
- When a host is entered into maintenance mode
- When a host is not responding
- And when cluster is reconfigured for HA.
HA in vSphere 5.0
- It is called as Fault Domain Manager in this version
- HA is completely re-designed in this version, ha agent directly communicates with hostd instead of using a translator to communicate to vpxa. So every host has information of other host resources, which helps in an event of VC failure.
- DNS dependency has been completely lifted off
- When we configure HA on vSphere 5.0 cluster, the first node will be elected as master and all other nodes will be configured slaves.
- Master node will be elected based on number of data stores it is connected to, and if all the hosts in cluster are connected to same number of data stores, host’s managed id will be taken into consideration. Host with highest managed id will be elected as master.
- All hosts exchanges their heartbeats with each other to know about their health states.
- Host Isolation response has been enhanced in this version, by introducing data store heart beating. Every host creates a hostname-hb file on the configured data stores and keeps it updated at specific interval. Two data stores will be selected for this purpose.
- If we want to know who is master and who slaves are, just need to go to vCenter and click on Cluster Status from homepage in HA area.