A few days ago, I learned that Proxmox automated backups can actually cost uptime.
For this post to make any sense, I should start by explaining how my servers are set up. I will only talk about my Proxmox servers, because the other servers are irrelevant to this story. I have a Proxmox cluster which consists of two hosts. One of the hosts contains an HBA, which is connected to about 20TB worth of disks. The HBA is passed through to a VM, which runs an NFS server, among other things. Both of my Proxmox hosts connect to the NFS server, which is used only for backups. It sounds pretty janky, but it's honestly not that bad.
Every night at midnight, a scheduled backup runs for most VMs. The way that it works in Proxmox, each host will back up 1 VM at a time to the NFS server. For me, that means 2 VMs are backed up at a time. Unfortunately, on Friday night the disk used for backups in the NFS server stopped responding. This shouldn't be a big deal - the backup would fail, of course, but it's a non-mission-critical disk and should absolutely not affect the VMs.
Except, the backup didn't fail. It stopped transferring data, but it hung. At that point in time, it was in the middle of backing up the NFS server on host A and my production Matrix server on host B. I noticed the disk I/O on both of these VMs dropped to 0 immediately. I believe this is because the backup process needs to keep track of I/O during the backup so that it can update the backup image at the very end. Because the I/O dropped to 0, the VM stopped responding. All other VMs were working fine, but I couldn't get the Matrix server to respond.
My first thought was that I would just kill the backup. I always have 3 previous days of backups on my system, so it wouldn't be the end of the world to miss one. I found the command vzdump -stop, which just gave me an unhelpful and misspelled error message: "stoping backup process <pid> failed". I tried kill -9 <pid> as well, which didn't do anything either. I've seen this happen before - the process is waiting for I/O of some sort, and until it gets a response it can't be killed. I tried force unmounting the NFS share, which also didn't help.
At this point in time, I also didn't have access to the NFS server(since it was also unresponsive) and therefore couldn't see what was wrong. I opted to reboot it. However, the lack of disk I/O means I couldn't even connect to the console, which I needed to get past a warning screen that's a result of the PCI-e passthrough. I admit, that part is my fault. I'm skeptical that rebooting the VM would have helped anyway, since if the console couldn't load, the VM probably still wasn't getting any disk throughput.
I opted to reboot the host. This freed it from its backup, and allowed the NFS server to start back up. The disk that was causing problems was working again, for some reason. The second host reconnected to the NFS share, and the backup finally failed. Immediately, the VM started working again - no reboot was required.
This is, in my opinion, un-ideal behaviour. When auxiliary infrastructure goes down(e.g. storage for backups) it should not kill VMs so readily. However, this seems to be more of a qemu quirk more than anything, so I cannot blame Proxmox. Hopefully this experience will serve as a warning for someone else!
