Issues Upgrading CentOS/RHEL/Scientific Linux 7.2 to 7.3

If you’ve been a systems administrator for awhile, then you know it’s best practice to have security updates to install automatically—and you also know that this breaks things from time to time. This happened to use when EL 7.3 came out a few months ago and caused unexpected issues with systems running KVM, libvirt, and LVM2 with large quantities of snapshots (4,480 and counting!).

The first issue that we discovered was virtual machine lockup during live migration. This is related to an MSR_TSC_AUX update that Redhat pushed into 7.3, but for which the Linux 4.1.y stable branches had not yet merged the kernel update to support this. While I’ve not yet tested 4.1.39, it appears to have those patches. Most users will not experience this particular bug if they are using the vendor provided EL7 kernel—but if you are using 4.1 in order to have stable bcache support, then you might run into this. You can read more details on the patches here: https://patchwork.kernel.org/patch/9538171/

Shortly after we discovered the first issue (but before we had time to fix it), we discovered that LUKS passthrough crashes libvirt unless you are using libvirt’s keystore. Since we pass encrypted volumes directly into the virtual machine and let the virtual machine unlock the volume, this was causing endless segmentation faults of libvirtd as systemd restarted it after failure. After much troubleshooting and inspection with GDB to figure out where the problem actually was, we discovered that libvirt was assuming that all LUKS volumes have a key in their keystore. This has been fixed in the latest version, and more information about this is available here: https://bugzilla.redhat.com/show_bug.cgi?id=1411394

Not to be outdone, the 7.2 to 7.3 upgrade was also causing segmentation faults of dmeventd. At the time, we did not know that it was a bug in LVM2—but having a third issue compounded with the two above, it was time for more drastic measures: Revert the packages! After installing the EL7.2 version of libvirt, KVM, and LVM2 (and their dependencies), we were back up and running.

Feeling brave, we decided to try the 7.3 upgrade again today since the first two issues were fixed. At the time, we didn’t really know the third issue was an issue independent of the others, so this was our first opportunity to investigate. This issue is still outstanding, and the actual problem is unclear. We have found the first bad commit (9156c5d dmeventd rework locking code) in LVM2 and posted to the lvm-devel list, so hopefully this will be fixed soon. For the moment we are holding back LVM2 updates which seems to be working fine with the rest of the system packages upgraded to 7.3. You can read more about the beginning of this fix here: https://www.redhat.com/archives/lvm-devel/2017-March/msg00354.html

So is it time to 7.3 from 7.2? Yes! But only if you hold back LVM2. The easiest way to do this is to add the following to your /etc/yum.repo.d/CentOS-Base.repo in the [base] and [updates] sections:

exclude=lvm2* device-mapper*

Update: Tue Apr 4 16:25:39 PDT 2017

The LVM problem was related to the reserved_stack value in /etc/lvm/lvm.conf being too high on our system. Somehow this introduced a regression in LVM2 since it certainly worked before in EL7.2 .

So, if you get an error like this, shrink your reserved_stack and see if it fixes the problem:

kernel: dmeventd[28383]: segfault at 7f9477240ea8 ip 00007f9473f24617 sp 00007f9477240eb0 error 6 in liblvm2cmd.so.2.02[7f9473e83000+191000]

-Eric

CentOS6 initrd says “already mounted or /sysroot busy”

If you are booting a CentOS 6 system after having migrated its root filesystem to a new volume, you might get the following errors if /proc or /sys is missing:

EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: 
mount: /dev/mapper/vg0-root already mounted or /sysroot busy
mount: according to mtab, /dev/mapper/vg0-root is already mounted on /sysroot
dracut: Remounting /dev/mapper/vg0-root with -o relatime,ro
EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: 
mount: /dev/mapper/vg0-root already mounted or /sysroot busy
mount: according to mtab, /dev/mapper/vg0-root is already mounted on /sysroot
dracut: Remounting /dev/mapper/vg0-root with -o relatime,ro
EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: 
dracut Warning: Can't mount root filesystem

To fix this, all you need to do is mount the root filesystem and “mkdir proc/ sys/”. You can even do this from inside of dracut if you add the “rdshell” argument to the end of your kernel command line:

dracut:/# mount -o remount,rw /sysroot
dracut:/# mkdir /sysroot/proc /sysroot/sys
dracut:/# mount -o remount,ro /sysroot
dracut:/# exit
(You may need to reboot the server)

-Eric

 

 

Forcing insserv to start sshd early

Many distributions are using the `insserv` based dependency following at boot time.  After a bit of searching, I found very little actual documentation on the subject.  Here’s the process:

  1. Add override files to /etc/insserv/override/
  2. The files must contain ‘### BEGIN INIT INFO’ and ‘### END INIT INFO’, else insserv will ignore them.
  3. Some have indicated that you can override missing LSB fields with this method, however, it does require the Default-Start and Default-Start options even though you wouldn’t expect to need to override those.
  4. The name of the file in /etc/insserv/override must be equal to the name in /etc/init.d *not* the name it “Provides:”.  In an ideal world, the name would be the same as provides—but in this case that isn’t always so.

For my purpose, I created overrides for all of my services in rc2.d with this script.  Note that the overrides are just copies of the content  from the /etc/init.d/ scripts:

cd /etc/rc2.d
# This is one long line; $f is filename, $p is the Provides value.
grep Provides * | cut -f1,3 -d: | tr -d : | while read f p; do perl -lne '$a++ if /BEGIN INIT INFO/; print if $a; $a-- if /END INIT INFO/' $f > /etc/insserv/overrides/$p;done

Note that the script writes the filename from the “Provides” field so you may need to change the filename if you have initscripts where /etc/init.d/script doesn’t match the Provides field.  Notably, Debian Wheezy does not follow this for ssh.  Provides is sshd, but the script is named ssh.

Next, I append sshd to the Require-Start line of all of my overrides:

cd /etc/insserv/overrides/
perl -i -lne 's/(Required-Start.*)$/$1 sshd/; print' *

This of course creates a cyclic dependency for ssh, so fix that one up by hand.  Feel free to make any other boot-order preferences while you’re in the overrides directory.  For this case, ssh  was made dependent on netplug.

Finally, run `insserv` and double-check that it did what you expected:

# cat /etc/init.d/.depend.start
TARGETS = rsyslog munin-node killprocs motd sysfsutils sudo netplug rsync ssh mysql openvpn ntp wd_keepalive apache2 bootlogs cron stop-readahead-fedora watchdog single rc.local rmnologin
INTERACTIVE =
netplug: rsyslog
rsync: rsyslog
ssh: rsyslog netplug
mysql: rsyslog ssh
ntp: rsyslog ssh
[...snip...]

Viola!  Now I can ssh to the host far earlier, and before services that can take a long time to start to troubleshoot in case of a problem.  In my opinion, ssh should always run directly after the network starts.

-Eric

 

Linux Kernel bug from 2002?

Really Old Bugs

Apparently there is a bug from kernels as old as 2.5.44 that pop up every so often causing hours of work for developers to hunt down.  Hopefully it can be fixed upstream, or maybe this is a “won’t fix” for some very good reason that I am unaware of:  http://osdir.com/ml/linux.enbd.general/2002-10/msg00176.html .  In my opinion, an issue like this should give some meaningful error rather than causing deadlock.

 

The fix

Basically add_disk (and therefore register_disk() where the problem actually resides) must be called *before* set_capacity() in Linux block device drivers.  This is backwards of the way I would think, as I would configure the device parameters before publishing it into userspace—but that is backwards in the Linux kernel and can (will?) cause deadlock.

Upstream

Recently I encountered this issue/bug in a zfs-git (zfsonlinux) build.  I’ve resolved the kernel hang and I’m working on a minimal patch for ZFS.  For now follow this ZFS ZVOL issue on github: https://github.com/zfsonlinux/zfs/issues/1488 .

Update: a pull request is pending here: https://github.com/zfsonlinux/zfs/pull/1491 and a patch has been listed on the issues page.

 

-Eric

Linux Software RAID, disk-0 failed. Will my server still boot?

First, it is my opinion that you shouldst use hardware RAID of some form.  Software RAID, in my opinion, is best used to stripe volumes between multiple hardware RAID controllers which do not support spanning.

My opinion aside, will the server still boot?  Yes!  … if it is configured correctly.

The Multiple Disk (md) infrastructure in Linux is quite flexible, and there are many articles available for its use.  When configuring a server to recover from a failed disk-0 in a RAID mirror, your boot partition should be mirrored using metadata version 1.0.  Version 1.0 places metadata at the end of the device, whereas 1.1 metadata is at the front of the device.  Since metadata is at the end of the disk, GRUB (or whatever bootloader you prefer) can still read your boot images.

Lets say you have a server with (at least) three disk bays.  Disk-0 in bay1 fails,  so you add a 3rd disk in bay3 and rebuild the volume.  The process might look something like thos:

## Transfer the boot sector
# dd if=/dev/sdb count=1 of=/dev/sdc 

## Reread the partition table
# blockdev --rereadpt /dev/sdc

## Add the hot spare
# mdadm --add /dev/md0 /dev/sdc1
# mdadm --add /dev/md1 /dev/sdc2

## Fail the bad disk:
# mdadm --fail /dev/sda1
# mdadm --fail /dev/sda2

Now disk-0 (sda) is failed, and the mirror is rebuilding on sdb/sdc.  If you boot, what will happen?  Will it load the mirror correctly?  Will the kernel respect which disk is in the mirror?  We recently had a real-life scenario where a CentOS 6 server was in production and could not be rebooted, but we needed to know if the server would come up if there was a reboot.  Disk-0 was dying (but not completely dead yet).

Test to make certain

  1. If disk1 may not contain the right boot sector, so when disk0 is removed, will the server boot?
  2. If disk0 isn’t removed and the server is rebooted, will it boot?  If it does come up, will the kernel respect that disk0 is, indeed, failed?

The answer to both of these questions, at least in theory, is yes.

To be sure, I simulated the two failure scenarios above and everything worked without intervention.  This was the order of things, disks are named disk0, disk1, disk2:

  1. Install CentOS 6 on mirrored boot and lvm partitions across two disks.
  2. Add disk2, copy the bootsector over, and add as a hot spare.
  3. Fail disk0, let the hot spare rebuild.
  4. reboot!
  5. The system loads the bootsector from disk0 because it is the first physical disk serviced by BIOS.
  6. The kernel boots and auto-detects the RAID1 mirrors on disk1 and disk2, ignoring disk0 which we failed (good!)  This verifies question #1.
  7. Physically remove disk0; BIOS will see disk1 as the first BIOS drive.
  8. reboot.
  9. The system loads the bootsector from disk1 because it is the first physical disk serviced by BIOS.
  10. The kernel boots and initrd auto-configures the RAID1 mirrors on disk1 and disk2.  Thus validates question #2.

Of course you would expect the above to work—but its always best to test and understand exactly how your disk-volume software will act in various failure scenarios when working in a production environment.  “I think so” isn’t good enough to go on—you must know.

So, if you’re booting from software RAID, you can usually trust that your data is safe.   Sometimes a failed disk will hang IOs to the device.  I have seen servers completely freeze when this happens while it attempts to retry the IO over-and-over-and-over.  This is where hardware RAID can really save you; the hardware controller would have timed-out the RAID member disk, failed it, and continued with very little (if any) interruption.

Linux Raid controller tips

  • Be careful of “softraid” chipsets out there, not all RAID is real-RAID.  My favorite controllers in order or preference are 3ware, Areca, and LSI.  The PERC 7xx series are ok too, but I wouldn’t trust a PERC 2xx.
  • If you use LSI go with a higher-end controller for better performance and less fuss.  Generally speaking, I’ve had great success with LSI controllers that have onboard cache memory (even if you don’t use it in write-back mode).  Cacheless LSI controllers have created problems more times than I care to recall.
  • Check the RAID-levels that the card supports.  If the controller supports RAID-5 or -6, it is probably a better controller even if you only use the RAID-10 functionality.
  • Also of note, LSI now owns 3ware and uses LSI chips in 3ware’s hardware.  I have since used LSI-built 3ware cards and they still have the simple and robust 3ware feel.  I have a feeling that LSI will keep the 3ware brand for some time to come.

-Eric

 

Recovering an overflowed LVM volume configured with –virtualsize

/dev/vg/somevolume: read failed after 0 of 4096 at nnnnn: Input/output error

If you’ve ever seen the above error, this usually means you have run out of disk space on the CoW-volume of a snapshot volume.

…but there is another uses for snapshots, and that is thin provisioning for sparse data use.  If you create an LVM volume using the –virtualsize option, you can provide a logical size that is much larger than the actual underlying volume.  If you exceed the space for such a volume, you will get the same error above—and all data on the volume will be invalidated and inaccessible.

LVM silently uses the ‘zero’ devicemapper target as the underlying volume.  Thus, even though the data is invalidated nothing is lost.  By overlaying the lost data over the top of a zero device, we can resurrect the data.

We have prepared our example file with the following:

lvcreate -L 100m --virtualsize 200m -n virtual_test vg
mkfs.ext4 /dev/vg/virtual_test
 [...]
mount /dev/vg/virtual_test /mnt/tmp/

And now we fill the disk:

dd if=/dev/zero of=/mnt/tmp/overflow-file
dd: writing to `/mnt/tmp/overflow-file': Input/output error

Message from syslogd@backup at Aug 27 15:17:27 ...
 kernel:journal commit I/O error
272729+0 records in
272728+0 records out
139636736 bytes (140 MB) copied
[I had to reboot here.  The kernel still thought
 the filesystem was mounted and I could not continue.
 Obviously we are working near the kernel's limits on
 this CentOS 6.2 2.6.32-based kernel]

Now we have a 200MB volume with 100MB allocated to it, which is now full.  LVM has marked the volume as invalid and the data is no longer available.

First, resize the volume so we have room after resizing.  Otherwise, the first byte written to the volume would, again, invalidate the disk:

lvresize -L +100m /dev/vg/virtual_test
 [errors, possibly, just ignore them]
  Extending logical volume virtual_test to 200.00 MiB
  Logical volume virtual_test successfully resized

Now we edit the -cow file directly with a short perl script.  The 5th byte is the ‘valid’ flag (see http://www.redhat.com/archives/linux-lvm/2006-September/msg00132.html) so all we need to is set it to ‘1’:

 perl -e 'open(F, ">>", "/dev/mapper/vg-virtual_test-cow"); seek(F, 4, SEEK_SET); syswrite(F,"\x01",1); close(F);'

Now have lvm re-read the CoW metadata and you’re in business:

lvchange -an /dev/backup/virtual_test
  [ignore errors]
lvchange -ay /dev/backup/virtual_test
  [shouldn't have any errors]
lvs
  LV                    VG       Attr     LSize   Pool Origin               Data% 
  virtual_test          vg   swi-a-s- 200.00m      [virtual_test_vorigin]   33.63

At this point you should probably fsck your filesystem, it may be damaged—or at least nead a journal-replay since it stopped abruptly at the end of its allocated space.  And as you can see, the “overflow” file is there up until the point of filling the disk.

[root@backup mapper]# e2fsck /dev/vg/virtual_test
e2fsck 1.41.12 (17-May-2010)
/dev/vg/virtual_test: recovering journal
/dev/vg/virtual_test: clean, 12/51200 files, 66398/204800 blocks
[root@backup mapper]# mount /dev/vg/virtual_test /mnt/tmp/
[root@backup mapper]# ls -lh /mnt/tmp/
total 54M
drwx------. 2 root root 12K Aug 27 15:16 lost+found
-rw-r--r--. 1 root root 54M Aug 27 15:17 overflow-file

-Eric