In this exercise you will explore and configure AHV’s High Availability (HA) and Acropolis Dynamic Scheduling (ADS) features. You will also fail a node with active connections to your virtual desktops and observe the environment’s behavior during failure.
Note
If you’re interested in additional real world system testing, Nutanix has produced an automated system test suite, X-Ray. X-Ray is designed to evaluate hyperconverged infrastructure platforms in a variety of scenarios, including: Database co-location (workload interference), snapshot impact, rolling upgrades, node failures, and workload simulations.
Learn more and get started with X-Ray at http://www.nutanix.com/xray/
In order to access the internal Curator diagnostic page from outside of the Controller VM subnet, we’ll need to open TCP port 2010 in the CVM firewall.
Using an SSH client, execute the following:
> ssh nutanix@<NUTANIX-CLUSTER-IP>
> allssh "sudo iptables -t filter -A WORLDLIST -p tcp -m tcp --dport 2010 -j ACCEPT"
In Prism > VM > Table, select DC. In the Host column, observe the host the DC VM is currently running on. Click Update.
Under VM Host Affinity, click Set Affinity.
Select Node 1 and Node 2 and click Save.
Click Save.
If your VM wasn’t already running on Node 1 or 2, observe that it was automatically live migrated after updating the affinity policy.
Repeat these steps for XD and any other applicable non-AFS or non-desktop VMs (NSVPX, etc.).
By default, Nutanix AHV will protect VMs in the event of a node failure on a best effort basis, presuming there is adequate memory availability to restart VMs from the failed host. Enabling HA forces a dynamic memory reservation to ensure memory availability in the event of a node failure. It will also validate there are no affinity rules that would prevent HA from being enabled, for instance a VM with a host affinity policy only containing a single host.
In Prism, click the Settings icon and select Manage VM High Availability.
Select Enable HA and click Save > OK.
Note
Increased failures to tolerate, such as in the case of an RF3 cluster wanting to tolerate 2 node failures, can be defined via acli.
> acli ha.update num_host_failures_to_tolerate=2
Restart priority for individual VMs can be defined via acli. A negative value disables HA restart for that VM.
> acli vm.update <VM NAME> ha_priority=1000
Log in to Citrix StoreFront as USER2 and launch both your Pooled and Personal desktops.
In Citrix Studio > Search > Sessions, note the VM Names of USER2’s sessions
In Prism > VM > Table, search for each of USER2’s VMs. Validate whether or not the VM is currently running on Node 3, if not, select the VM and click Migrate.
Select Node 3 from the Host drop down menu and click Migrate.
Repeat these steps to ensure the second VM is also on Node 3.
Note
If you’re using a non-NX Nutanix platform you will need to consult manufacturer documentation for your hardware platform for instruction on accessing the out-of-band management and powering off the node.
In Prism > Hardware > Table, select Node 3. In the Host Details table, click the IPMI IP link.
Log in with the default credentials (ADMIN/ADMIN)
Select Power Control from the Remote Control drop down menu.
Select Power Off Server - Immediate and click Perform Action.
Immediately you’ll observe that both of your Citrix Receiver sessions have been interrupted. Close both of them.
Log in to Citrix StoreFront again as USER2 and launch a Pooled desktop. You will be able to connect to another desktop immediately, complete with your profile and user data if configured.
In Citrix Studio, verify that the desktop to which you’re now connected is not the same VM to which you were previously connected.
In Prism > Tasks, the node failure has been detected and VMs have already begun to power on on the remaining nodes in the cluster. In the screenshot below we can see our Personal Windows 10 Desktop has already been powered on on Node 2.
In Citrix Studio > Search > Desktop OS Machines, verify your Personal Windows 10 Desktop VM now appears as Registered with the Delivery Controller.
Return to Citrix StoreFront and launch your Personal Windows 10 Desktop. Verify that the desktop logs in successfully.
In Prism > Home, verify that the cluster is in Critical Status and that a rebuild is in progress.
Open https://<NUTANIX-CLUSTER-IP>:2010 in your browser and click the Curator Master link.
Verify that Node 3 is down and that a Partial Scan due to a Node Failure has generated many background tasks. Click the Execution ID link associated with this job for more details.
The majority of the jobs associated with the scan are to replicate missing extents.
In your browser, return to the out-of-band management (IPMI) console > Remote Control > Power Control. Select Power On Server and click Perform Action.
After several minutes, giving time for the host and CVM to boot, verify in Prism > Home that Data Resiliency Status has returned to OK.
Health still appears as critical, this is normal following a CVM reboot as an unexpected CVM reboot could be indicative of an issue with the cluster. After a short period of time the Health will update itself.
In Prism > VM > Table, filter by the Node 3 hostname and note that the majority of VMs that had previously been running on Node 3 have returned to running on this node.
Restore CVM firewall to default configuration:
> ssh nutanix@<NUTANIX-CLUSTER-IP>
> allssh "sudo service iptables start"; done
Verify you’re no longer able to access the Curator page from your browser.