If you’ve been a systems administrator for awhile, then you know it’s best practice to have security updates to install automatically—and you also know that this breaks things from time to time. This happened to use when EL 7.3 came out a few months ago and caused unexpected issues with systems running KVM, libvirt, and LVM2 with large quantities of snapshots (4,480 and counting!).
The first issue that we discovered was virtual machine lockup during live migration. This is related to an MSR_TSC_AUX update that Redhat pushed into 7.3, but for which the Linux 4.1.y stable branches had not yet merged the kernel update to support this. While I’ve not yet tested 4.1.39, it appears to have those patches. Most users will not experience this particular bug if they are using the vendor provided EL7 kernel—but if you are using 4.1 in order to have stable bcache support, then you might run into this. You can read more details on the patches here: https://patchwork.kernel.org/patch/9538171/
Shortly after we discovered the first issue (but before we had time to fix it), we discovered that LUKS passthrough crashes libvirt unless you are using libvirt’s keystore. Since we pass encrypted volumes directly into the virtual machine and let the virtual machine unlock the volume, this was causing endless segmentation faults of libvirtd as systemd restarted it after failure. After much troubleshooting and inspection with GDB to figure out where the problem actually was, we discovered that libvirt was assuming that all LUKS volumes have a key in their keystore. This has been fixed in the latest version, and more information about this is available here: https://bugzilla.redhat.com/show_bug.cgi?id=1411394
Not to be outdone, the 7.2 to 7.3 upgrade was also causing segmentation faults of dmeventd. At the time, we did not know that it was a bug in LVM2—but having a third issue compounded with the two above, it was time for more drastic measures: Revert the packages! After installing the EL7.2 version of libvirt, KVM, and LVM2 (and their dependencies), we were back up and running.
Feeling brave, we decided to try the 7.3 upgrade again today since the first two issues were fixed. At the time, we didn’t really know the third issue was an issue independent of the others, so this was our first opportunity to investigate. This issue is still outstanding, and the actual problem is unclear. We have found the first bad commit (9156c5d dmeventd rework locking code) in LVM2 and posted to the lvm-devel list, so hopefully this will be fixed soon. For the moment we are holding back LVM2 updates which seems to be working fine with the rest of the system packages upgraded to 7.3. You can read more about the beginning of this fix here: https://lvm-devel.redhat.narkive.com/xxKNaNG6/bisect-regression-segv-9156c5d-dmeventd-rework-locking-code
So is it time to 7.3 from 7.2? Yes! But only if you hold back LVM2. The easiest way to do this is to add the following to your /etc/yum.repo.d/CentOS-Base.repo in the [base] and [updates] sections:
Update: Tue Apr 4 16:25:39 PDT 2017
The LVM problem was related to the reserved_stack value in /etc/lvm/lvm.conf being too high on our system. Somehow this introduced a regression in LVM2 since it certainly worked before in EL7.2 .
So, if you get an error like this, shrink your reserved_stack and see if it fixes the problem:
kernel: dmeventd: segfault at 7f9477240ea8 ip 00007f9473f24617 sp 00007f9477240eb0 error 6 in liblvm2cmd.so.2.02[7f9473e83000+191000]