Resolving a Nutanix CE cluster crash due to disk failure during node removal

Warning: deleting any file from a Nutanix CVM is a dangerous thing that can lead to a complete loss of a cluster and all the VMs and data on it. If you have Nutanix support, CALL THEM! Only attempt this sort of thing on a cluster that you can afford to rebuild from scratch that doesn’t contain any data you care about!

The Nutanix CE (Community Edition) cluster that I run on a collection of old HPE DL360 servers is a little finicky, and of course, I don’t have official Nutanix support for it. The CE forums can be a little hit or miss, particularly for obscure problems, so when I have problems I often find myself doing a lot of googling for error messages that apparently have never been documented, and often as not, simply end up rebuilding the cluster from scratch and restoring the VMs via Data Protection.

My cluster had been having a lot of Cassandra crashes, leading to nodes being tagged as “degraded,” which often led to VMs panicking and rebooting. While trying to clear things up, I had been removing nodes one by one (a multi-hour process as the cluster migrated hundreds of GB of data and metadata to the other nodes) and adding them back in, which seemed to stabilize them.

A smaller issue I had was that the storage in the cluster was very imbalanced; four of the nodes have 4 disks, and one has 8, and because of the way disk failures have occurred (leading to me replacing older disks with newer, larger ones), the node with 8 disks has about 4TB of available storage. while the others have between 1.5 and 2.5. I hadn’t removed the 4TB node yet, so I decided what I would do is detach it from the cluster, pull two 1TB disks out of it, add it back in, and then replace some of the old 146GB disks in the other nodes with the bigger ones to get everything within a few hundred GB of 2TB. I initiated the removal of the large node, which required the system to move nearly 1.5TB of data to the other nodes in the cluster. A few minutes afterwards, Prism reported that a disk had failed on another node, which worried me, but the node detach tasks were continuing and the size of the used storage on the removed node was shrinking, so I left it alone. I watched for a while, but knowing it would take 6-8 hours to complete, eventually I wandered off and allowed Prism to log me out.

When I came back, any attempt to log into Prism returned “Server Error.” Hm. I ssh’d into one of the CVMs and noticed that none of the ncli, acli, or ecli tools worked, all returning “Error: Connection refused (Connection refused)”. They rely on Prism for communication with the cluster, so it was clear that Prism was wonked. I tried restarting it:

nutanix@NTNX-28af434a-A-CVM:~$ allssh "genesis stop prism"
nutanix@NTNX-28af434a-A-CVM:~$ cluster start

The cluster would restart Prism services, but couldn’t communicate with them. I checked the prism_gateway.log in data/logs and saw a lot of failed connections, but couldn’t determine what was going on. An old blogspot page I found turned up the possibility that the node removal task and the failed disk removal had come into conflict, and everything was stuck until those tasks either were completed (which couldn’t happen while Prism was down) or cancelled. Luckily, the progress_monitor_cli command still worked.

nutanix@NTNX-28af434a-A-CVM:~$ progress_monitor_cli --fetchall | egrep "entity_id|entity_type|operation"
<snip>
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461664"
  operation: kRemove
  entity_type: kNode
  entity_id: "26461552"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461665"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461662"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461663"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461661"
  operation: kRemove
  entity_type: kDisk
  entity_id: "114231585"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461668"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461666"
  operation: kRemove
  entity_type: kDisk
  entity_id: "26461667"

Annoyingly, the example in the website was not accurate as far as the “entity_type” or “operation” strings, which needed to have the leading “k” removed, and be lower case. The actual command I needed to run was:

progress_monitor_cli --entity_id="26461667" --entity_type="disk" --operation="remove" -delete

Once I had run that on all the disks, I was able to run it on the node removal task, which got everything cleared up, but still, Prism wouldn’t respond. I did a complete cluster restart, but no dice; I rebooted all the nodes and started the cluster again, still no Prism. I tried powering off the detached node; nope. I powered that back on and powered off the node that had reported the failed disk; still no luck. I powered them both off, but while the cluster would start, Prism was not available. So I brought everything back online and started digging into the logs.

One thing I noticed was that things really seemed to be having difficulty talking to Zookeeper. Genesis reported that zookeeper was online, but when I dug into the logs on the node I had attempted to detach, I noticed it seemed to keep trying to replay logs, and failing with I/O errors. When I looked at the contents of /home/nutanix/data/zookeeper/version-2, I noticed that there were “epoch” files that had a number different than other nodes in the cluster, and also the files in that directory had not been touched in several days. When I tried to use “tar” to back them up, I got odd I/O errors. On the assumption that at least two ZK nodes were still online and happy, I stopped the service on the wonky nodes and simply deleted all the “log” and “snapshot” files in that directory, and rebooted the CVM to bring Zookeeper back up. It happily came online and began recovering its data from the other 2 nodes. That seemed to eliminate the errors I was seeing related to ZK. Prism still wouldn’t come online. What the ham?!?

The other service that seemed to be complaining a lot was Cassandra, and the nodetool command was giving me frustrating messages. On the node with the reported failed disk, any nodetool command returned a lengthy Java exception trace. On nodes other than that one, they’d report that the other node was “down” (and the one I’d tried to detach was in forwarding mode, which wasn’t unexpected):

nutanix@NTNX-fe8d4a5e-A-CVM:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          sFUzzzzz0000000000000000000000000000000000000000000000000000
baddisknode     Down   Normal     ? bytes         26.37%  8b3ODYsE0000000000000000000000000000000000000000000000000000
goodnode1     Up     Normal     8.62 GB         26.37%  OwbmR7kU0000000000000000000000000000000000000000000000000000
goodnode2     Up     Normal     8.64 GB         23.63%  db3ODYsE0000000000000000000000000000000000000000000000000000
removednode     Up     Forwarding 7.04 GB         23.63%  sFUzzzzz0000000000000000000000000000000000000000000000000000

Restarting Cassandra didn’t do much to make the down node happy. The logs mentioned a lot of issues replaying logs after restarting, throwing similar input/output errors as Zookeeper, so I guessed maybe they were corrupted. Figuring that if I couldn’t get the cluster back online I was going to have to rebuild it and recover the VMs via Data Protection (from a “remote site” cluster in the same rack), I planned to tar up the Cassandra logs in /home/nutanix/data/stargate-storage/disks/6005*/metadata/cassandra/commitlog/ and delete them. However, when I attempted the tar, I got:

tar: /home/nutanix/data/stargate-storage/disks/600508b1001c1f6d/metadata/cassandra/commitlog/CommitLog-3-1683565787792.log: File shrank by 53925107 bytes; padding with zeros

Well, that’s got to be the bad guy, right? I stopped Cassandra on the node, deleted that file, and ran “cluster start” to get Cassandra started. After that, the logs looked far healthier, and I could run nodetool on the wonky node and it reported as being up:

nutanix@NTNX-8f724455-A-CVM:~$ nodetool -h 0 ring
Address         Status State      Load            Owns    Token
                                                          sFUzzzzz0000000000000000000000000000000000000000000000000000
fixednode     Up     Normal     6.02 GB         26.37%  8b3ODYsE0000000000000000000000000000000000000000000000000000
goodnode1     Up     Normal     9.31 GB         26.37%  OwbmR7kU0000000000000000000000000000000000000000000000000000
goodnode2     Up     Normal     8.88 GB         23.63%  db3ODYsE0000000000000000000000000000000000000000000000000000
detachednode     Up     Forwarding 7.04 GB         23.63%  sFUzzzzz0000000000000000000000000000000000000000000000000000

After that, I restarted Prism on all the nodes once more:

nutanix@NTNX-8f724455-A-CVM:~$ allssh "genesis stop prism"
nutanix@NTNX-8f724455-A-CVM:~$ cluster start

After that, Prism became available, and the ncli, acli, and ecli commands got back to work! Interestingly, despite my having deleted the removal tasks earlier, Prism seems to be trying to detach disks and the node, so I’m going to leave it be and hope it finishes on its own.

Update, 14 hours later: the node finally detached! And another disk failed on another node, which is wild, though not totally unexpected with hardware that’s close to 15 years old.

Recovering the “install” user on a Nutanix CE node

I run a four-node Nutanix Community Edition cluster at home, on a batch of old HP DL360s scavenged from the office when they were throwing things out some years ago. It’s generally reliable; occasionally there’s a network glitch that takes out a bunch of VMs, but most of the time it stays up, happy and hale.

Hardware failures happen, however, particularly to disks. The cluster has 28 disks, so it’s inevitable that they wear out, particularly since my dusty basement is hardly a pristine data center environment. The cluster is also reliant on SSDs, where most of the I/O is directed from the VMs, and which also host the Controller VMs, so on occasion the SSDs wear out and fail. It’s a huge hassle to repair those, because it essentially requires you to reinstall the CVM and hope that the cluster recognizes and readmits it.

It’s doubly hard if the “install” user has disappeared, which apparently happens during the cleanup process that occurs after a cluster software upgrade. I had installed the cluster using a 2017 image, and then upgraded it in mid-2019; when I went back to try and repair a failed SSD, none of the install scripts were there, and even the install user had been removed from all four hosts, making it impossible to rebuild the CVM on the new drive.

To get around this, I went and found the full install image (which normally you stick on a USB or SD card to use as a boot disk), which contained the necessary files, but figuring out how to mount it and get the files off without actually burning a USB stick was a bit of a trial.

Mounting on Windows proved nearly impossible, so I copied the file to where one of my Linux VMs could get at it. The issue is that the disk image is partitioned, just like a regular physical or virtual disk would be, but of course the operating system doesn’t recognize a file as a disk device. Luckily, the “mount” command has an “offset” feature that allows you to tell it where to start looking for an actual filesystem. You can figure out that offset value using fdisk:

[root@mybox Nutanix]# fdisk -l ce-2019.02.11-stable.img

Disk ce-2019.02.11-stable.img: 7444 MB, 7444889600 bytes, 14540800 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000d3ba3

                   Device Boot      Start         End      Blocks   Id  System
ce-2019.02.11-stable.img1   *        2048    14540799     7269376   83  Linux

It even sticks a “1” on the end of the image name, just like a disk partition, though if you tried “mount ce-2019.02.11-stable.img1 /mnt” it would have no idea what you were talking about.

To figure out the offset number, note that the partition starts at sector 2048, and each sector is 512 bytes; 2048 * 512 = 1048576, so that’s the offset. You can then mount it read-only:

 mount -o ro,loop,offset=1048576 ce-2019.02.11-stable.img /mnt/loop

After that I just used rsync to copy the data over to the affected host, recreated the install user and group (both using uid/gid 1000), and was able to rebuild my CVM. Unfortunately I ran ~install/cleanup.sh first, which seemed to wipe all the configuration from the host, and made the cluster forget all about it, after which I was unable to add the host back in, possibly because of the failing SD card on another host…my cluster is a bit FUBAR at the moment. But at least I got the disk replaced!

Over-subscription of thin-provisioned virtual disks leading to Nutanix cluster issues

An issue with both physical and virtual environments is unused resources, particularly memory and storage. If you allocate 250GB to a VM, and it’s only using 50GB of it, that’s a lot of wasted disk. Multiply that by dozens of VMs and you’ve got a lot of storage not being used. The good news is that modern hypervisors offer “thin-provisioned” virtual disks: thin-provisioning means that, even if you set the disk size to 250GB, if the OS is only using 50GB of it, you’re truly only using 50GB of it at the physical layer. Which means that you can “over-subscribe” your storage by allocating more of it in virtual disks than you actually have, and then the VMs can grab some of that additional 200GB if they need it for something, and then when they’re done with it, the hypervisor takes it back. You don’t need to tightly monitor each VM’s utilization; you just give it as much as it’s likely to need for regular operations, and just monitor the overall physical storage utilization of your environment.

The problem is that, if you over-subscribe your storage (i.e., allocate 15TB of “thin-provisioned” storage to VMs in a physical environment containing only 10TB of storage), and your VMs do use it all up at once, you’re in trouble. So generally you don’t want to oversubscribe by more than, say, 25%, depending on how much your VM storage requirements fluctuate.

Another issue that I discovered recently relates to how VMware recovers unused space from the VMs, and how Red Hat Enterprise Linux communicates to VMware that virtual storage is no longer needed, particularly in an environment with Nutanix infrastructure.

A short primer on Nutanix, if you aren’t familiar with it: you take a number of physical hosts and combine them into a cluster, combining the storage into a large pool. The VMs can live on any host in the cluster, using the CPU and memory resources available there, and their data is spread throughout the cluster and replicated for reliability. If a host fails, the VMs get restarted on other hosts, and any datablocks that were living on that host are restored from the copies available on other hosts. You can use a number of different hypervisors; if, for example, you want to use VMware, the storage containers created at the Nutanix level are presented to all the cluster hypervisors as regular NFS storage.

I support a Nutanix cluster that is dedicated to a large application with a lot of data constantly being written and cleaned up, several hundred GB every day. The cluster has a total of about 290TB of storage; with a “replication factor” of 2 (each block has a copy on another host in the cluster), that gives us about 145TB of “logical” space. We have about 15 very large VMs, each of which has a bit over 8TB of disk attached, along with some other smaller VMs, totalling about 114TB of storage, all thin-provisioned. We also use “protection domains” to create and replicate snapshots of the VMs to send to a “backup” cluster in another data center, and the delta disks for those snapshots take up several additional TB in storage.

I started getting alerts that we were using more than 75% of the available storage in the cluster, and it continued to grow, approaching 90%. Nutanix requires a certain amount of free space in order to operate, an in order to survive a host failure, since a host failure obviously reduces the overall available storage. If we went much beyond 90%, the cluster would begin to have lots of difficult to resolve issues (VMs freezing, filesystems becoming “read-only”, etc.). But wait…the VMs are only using about 35% of their virtual disks! Why is Nutanix saying we’re using almost 90%?

I took a look at our vSphere client, and it reported similar numbers to Nutanix: each VM was using 7+ TB of its 8TB storage. How was this possible?

We initially thought that the issue was with the protection domain snapshots. The way that snapshots work in Nutanix is similar to the way they work in VMware; when a snapshot is created, new data being written goes to “delta” disk files, and the original disk files are left quiesced so that they can be safely copied somewhere else, or restored in the event of a failure. Eventually, when the snapshot is deleted, the delta disk files get merged back into the main disk files. We had recently enabled protection domains for all the VMs, and the storage had been growing ever since, so it seemed a likely cause. Was Nutanix simply failing to clean up delta disks, and they were just growing and growing as time went on from the creation of the initial snapshots? The fact that the “.snapshots” directory within the datastore seemed to be using a remarkable 126TB of storage seemed to bear this out.

We called Nutanix, who explained that no, the .snapshot directory wasn’t actually using that much storage, or at least, not exclusive of the actual data disks. The data disks were stored there, and the files that appeared to be data disks in the usual VM directories were just pointing to them. So that turned out to be a red herring.

The actual issue seems to be that the operating system on the VMs, in this case Red Hat Enterprise Linux 6 (RHEL6), and VMware, don’t communicate very well about what’s actual reclaimable space. When a file gets deleted in RHEL6, the data doesn’t actually get cleaned up right away; what gets removed is the “inode,” essentially a pointer that describes where on the disk to find the data. The data itself doesn’t get removed; it just gets overwritten later on by new data as needed. Over time, as the application wrote many GB a day to its filesystems, that data continued to stay on the disks, even though the inodes pointing to the data had been deleted, and the OS reported that the space was available. VMware saw all those ones and zeroes and assumed that this was actual live data that could not be cleaned up by the hypervisor processes that maintain the thin-provisioning.

In short, in an environment like this with lots and lots of data coming in and getting “deleted,” you can’t actually oversubscribe the storage because the OS and VMware don’t talk to each other very well about what data is actually needed.

The fun part was the trick I did in the short term to clean things up until I could replace the virtual disks with smaller ones and “de-over-subscribe” the storage. It turns out if you write a very large file to the filesystem containing nothing but zeroes, and then delete it, VMware will see that as reclaimable space. It took a few days to run on all the filesystems in the cluster, but I used the “dd” command to output from /dev/zero to a file in the directory, and then removed it; VMware would eventually knock back the reported storage utilization, and Nutanix’s “Curator” scans would reclaim the space as well.

dd if=/dev/zero of=/datastore01/file_full_of_zeroes bs=1M count=1000000
rm /datastore01/file_full_of_zeroes

The first command creates a 1TB file containing nothing but zeroes; the second removes it. Doing that to each of 4 datastores on each VM got the reported utilization from 7TB to 4TB on each VM, and the Nutanix-reported utilization back under 60%. Some information I googled up seemed to indicate that operations might be necessary on the virtual disks from within the hypervisor, or possibly having to do a storage migration to get VMware to reclaim the space, but it did it automatically (we’re running ESX 6.0.0, which may have different space reclamation techniques than previous versions).