Recover Deleted MegaRAID Volume with Linux

Recently a customer with a 4-disk RAID5 array backed by a MegaRAID controller came to us because their RAID volume was missing. The virtual disk (VD) exported by the RAID controller had disappeared!

The logs indicated that it was deleted, but as far as we can tell no one was on the system at the time. Maybe this was a hardware/firmware error, or user error, but either way the information that defines the RAID volume was no longer available.

We were able to use the Linux RAID 5 module to recover the missing data!  Read on:

Hardware RAID Volumes

When a volume is deleted, the RAID controller removes the volume descriptor on the physical disks but it does not destroy the data itself; the data is still there. There are ways to recover the data using the controller itself. MegaRAID volume recovery documentation suggests that you to re-create the volume using the same parameters and disk order as the original RAID. For example, if you know that it was a 256k stripe then you can recreate the array with the same disk ordering and same stripe size. However, for a raid volume that has been in service for years, how can this possibly be known unless someone wrote it down?

While the controller would then stamp the disks with a new header and leave the data alone, theoretically there will be no data loss and the array will continue as it had originally. This procedure comes with quite a bit of risk. if the parameters are off then you can introduce data corruption. Thus, you should back up each disk individually as a raw full disk image. Unfortunately this takes a long time and as far from convenient. If you get the parameters wrong then the data should be restored before trying again to guarantee consistency, and that takes even longer.

Guarantee Recovery Without Data Loss

Recovery in place was the only option because it would take too long to do full backups during each iteration. However, we also had to guarantee that the recovery process would not introduce failures that could lead to data corruption. For example, if we had chosen a 64k stripe size but it was formatted with a 256k stripe size, then the data would be incorrect. While it is probably safe to try multiple stripe sizes without initialization by the raid controller, there is the risk of technician error causing data loss during this process.

You should always avoid single-shot processes with the risk of corrupting data. In this case terabytes of scientific data were at riskand I certainly did not trust the opaque behavior of a hardware raid controller card that might try to initialize an array with invalid data when I did not want it to. Using a procedure provided by the RAID controller making several attempts, each with an additional risk of losing data, is unnerving to say the least!

I would much prefer a method that is more likely to succeed, and certainly with the guarantee that it is impossible to lose data. When working with customer data it is imperative that every test we make along the way during the process of data recovery is guaranteed not to make things worse.

How to use Linux for RAID Recovery

This is where Linux comes in: using the Linux loopback driver we can require read-only access to the disks. By using the Linux dm-raid5 module we can attempt to reconstruct the array by guessing it’s parameters. This allows us a limitless number of tries and the guarantee that the process of trying recover the data never cause corruption, whether or not it succeeds.

First we started by exporting the disks on the RAID controller as jbod volumes. This allows Linux to see the raw volumes as they were without being modified by the RAID controller firmware:

storcli64 /c0 set jbod=on

Once the drives are available to the operating system as raw disks, we used the Linux loopback driver to configure them as read only volumes.  For example:

losetup --read-only /dev/loop0 /dev/sdX
losetup --read-only /dev/loop1 /dev/sdY
losetup --read-only /dev/loop2 /dev/sdZ
losetup --read-only /dev/loop3 /dev/sdW

Now that the volumes are read only we can attempt to construct them using the Linux RAID 5 device mapper target. There are several major problems:

  1. We do not know the stripe size
  2. We do not know the on-disk format used by the RAID controller.
  3. Worse than that, we do not know the disk ordering that the RAID controller selected when it built the array.

This is makes for quite a few unknown variables–and only one combination will be correct. In our case there were only 4 disks so the possible disk ordering is 24 (4! = 4*3*2*1 = 24).

Now it is a matter of trial and error and we can use the computer to help us a bit to minimize the amount of typing we have to do, but there is still manual inspection to review and make sure that the ordering it found is useful and correct:

[0,1,2,3], [0,1,3,2], [0,2,1,3], [0,2,3,1], [0,3,1,2], [0,3,2,1],
[1,0,2,3], [1,0,3,2], [1,2,0,3], [1,2,3,0], [1,3,0,2], [1,3,2,0],
[2,0,1,3], [2,0,3,1], [2,1,0,3], [2,1,3,0], [2,3,0,1], [2,3,1,0],
[3,0,1,2], [3,0,2,1], [3,1,0,2], [3,1,2,0], [3,2,0,1], [3,2,1,0]

We permuted all possible disk orderings and passed them through to the `dm-raid` module to create the device mapper target:

dmsetup create foo --table '0 23438817282 raid raid5_la 1 128 4 - 7:0 - 7:1 - 7:2 - 7:3'

For each possible permutation, we used `gfdisk` to determine if the partition table was valid and found that disk-3 and disk-1 presented a valid partition table if they were the first drive in the list; thus, we were able to rule out half of the permutations in a short amount of time:

gdisk -l /dev/foo
...
Number Start (sector) End (sector) Size Code Name
1 2048 23438817279 10.9 TiB 0700 primary

we did not know the exact sector count of the original volume, so we had to estimate based on the ending sector reported by gdisk. We knew that there were 3 data disks so the sector count had to be a multiple of three.

The next step was to use the file system checker (e2fsck / fsck.ext4) to determine which permutation has the least number of errors. Many of the permutations that we tested fail the minute immediately the file system checker did not recognize the data at all. However, in a few of the permutations the file system checker understood the file system enough to spew 1000s of lines of errors on the screen. We knew we were getting closer, but none of the file system checks seemed to complete with a reasonable number of errors. This caused us to speculate that our initial guess of a 64k stripe size was incorrect. The next stripe size we tried was 256k and we began to see better results. Again many of the file system checks failed altogether but the file system checker seems to be doing better on some of the permutations. however, it still was not quite right. We had only been trying the default raid5_la module format, but the dm-raid module has the following possible formats:

  • raid5_la RAID5 left asymmetric – rotating parity 0 with data continuation
  • raid5_ra RAID5 right asymmetric – rotating parity N with data continuation
  • raid5_ls RAID5 left symmetric – rotating parity 0 with data restart
  • raid5_rs RAID5 right symmetric – rotating parity N with data restartwe added a 2nd loop to test every raid format for every permutation and when it reached raid5_ls on the 23rd permutation, the file system checker became silent and took a very long time. Only rarely did it spit out a benign warning about some structure problem that it found which was probably already in the valid RAID array to begin with. We had found the correct configuration to recover this raid volume!While we had to figure this out initially by trial and error using a simple Perl script to discover the configuration, you know that the MegaRAID controller uses the raid5_ls RAID type. This was the correct configuration for our drive:
  • RAID disk ordering: 3,2,0,1
  • RAID Stripe size: 256k
  • RAID on-disk format: raid5_ls – left symmetric: rotating parity 0 with data restart

Now that the raid volume was constructed we wanted to test to make sure it would mount and see if we can access the original data. Modern file systems have journals that will replay at mt time and we needed to keep this read only because a journal replay of invalid data could cause corruption. Thus, we used the “noload” option while mounting to prevent replay:

mount -o ro,noload /dev/foo /mnt/tmp

The volume was read only because we used a read-only loopback device, so it was safe, but when trying to mount without the noload option it would refuse to mount because the journey journal replay failed.

Automating the Process

whenever there is a lot of work to do, we always do our best to automate the process to save time and minimize user error. Below you can see the Perl script that we used to inspect the disk.

The script does not have any specific intelligence, it just spits out the result of each test for human inspection; of course it needs to be tuned to the specific environment. All tests are done in a read-only mode and loopback devices were configured before running the script.

When the file system checker for a particular permutation would display 1000s of lines of errors and we would have to kill that process from another window so it would proceed and try the next permutation. In these cases there was so much text displayed on the screen that we would pipe the output through less or directed into a file to inspect it after the run.

This script is for informational use only, use it at your own risk! If you are in need of RAID volume disk recovery on a Linux system than we may be able to help with that. Let us know if we can be of service!

#!/bin/perl

#   This program is free software; you can redistribute it and/or modify
#   it under the terms of the GNU General Public License as published by
#   the Free Software Foundation; either version 2 of the License, or
#   (at your option) any later version.
# 
#   This program is distributed in the hope that it will be useful,
#   but WITHOUT ANY WARRANTY; without even the implied warranty of
#   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#   GNU Library General Public License for more details.
# 
#   You should have received a copy of the GNU General Public License
#   along with this program; if not, write to the Free Software
#   Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#
#   Copyright (c) 2023 Linux Global, all rights reserved.

use strict;

my @p = (
	[0,1,2,3], [0,1,3,2], [0,2,1,3], [0,2,3,1], [0,3,1,2], [0,3,2,1],
	[1,0,2,3], [1,0,3,2], [1,2,0,3], [1,2,3,0], [1,3,0,2], [1,3,2,0],
	[2,0,1,3], [2,0,3,1], [2,1,0,3], [2,1,3,0], [2,3,0,1], [2,3,1,0],
	[3,0,1,2], [3,0,2,1], [3,1,0,2], [3,1,2,0], [3,2,0,1], [3,2,1,0]);

my $stripe = 256*1024/512;
my $n = 0;

foreach my $p (@p)
{
        next unless $p->[0] =~ /3|1/;

        for my $fmt (qw/raid5_la raid5_ra raid5_ls raid5_rs raid5_zr raid5_nr raid5_nc/)
        {
                activate($p, $fmt);
        }
        $n++
}

sub activate
{
        my ($p, $fmt) = @_;

        system("losetup -d /dev/loop7; dmsetup remove foo");
        print "\n\n========= $n $fmt: @$p\n";

        my $dmsetup = "dmsetup create foo --table '0 23438817282 raid $fmt 1 $stripe "
               . "4 - 7:$p->[0] - 7:$p->[1] - 7:$p->[2] - 7:$p->[3]'";
        
        print "$dmsetup\n";
        system($dmsetup);
        system("gdisk -l /dev/mapper/foo |grep -A1 ^Num");
        system("losetup -r -o 1048576 /dev/loop7 /dev/mapper/foo");
        system("file -s /dev/loop7");
        system("e2fsck -fn /dev/loop7 2>&1");
}

 

RedHat/CentOS/RHEL 7 does not copy mdadm.conf into Dracut

Force MD and LUKS Auto-Detection

There is a bug in RedHat 7 releases for some systems when md is used that prevents booting. For some reason it does not copy mdadm.conf into the initrd generated by dracut. The fix recommended on the bug page (https://bugzilla.redhat.com/show_bug.cgi?id=1015204) recommends adding rd.md.uuid=<UUID> but that can be alot of work if you have many volumes. In addition, if you cannot paste the UUID then it is hard to type.

To automatically enable md and luks detection, add “rd.auto=1” to the kernel command line. You can see other command line options in the dracut documentation here: https://www.man7.org/linux/man-pages/man7/dracut.cmdline.7.html

LSI Megaraid Storage Manager Does Nothing

Installing Broadcom MSM for LSI Megaraid Cards

On a minimal CentOS install I found that MSM would refuse to load when I ran “/usr/local/MegaRAID\ Storage\ Manager/startupui.sh”.  It would just exit without an error.  If you cat the script you will notice java running into /dev/null, thus hiding useful errors—so remove the redirect!  At least then we can see the error.

Since this was a minimal install, I was missing some of the X libraries that MSM wanted.  This fixed it:

yum install libXrender libXtst

-Eric

 

librsync error: “RS_DEFAULT_STRONG_LEN” undeclared

We needed to compile an old version of rdiff-backup on CentOS 7 but got the following error:

 _librsyncmodule.c: In function â_librsync_new_sigmakerâ:
_librsyncmodule.c:63:17: error: âRS_DEFAULT_STRONG_LENâ undeclared (first use in this function)
 (size_t)RS_DEFAULT_STRONG_LEN);
 ^
_librsyncmodule.c:63:17: note: each undeclared identifier is reported only once for each function it appears in
_librsyncmodule.c:63:9: error: too few arguments to function ârs_sig_beginâ
 (size_t)RS_DEFAULT_STRONG_LEN);
 ^
In file included from _librsyncmodule.c:25:0:
/usr/include/librsync.h:370:11: note: declared here
 rs_job_t *rs_sig_begin(size_t new_block_len,
 ^
error: command 'gcc' failed with exit status 1

The librsync library changed the colling convention of `rs_sig_begin` so if you get an error like that, then a patch like this might help:

]# diff -uw _librsyncmodule.c.ORIG _librsyncmodule.c
--- _librsyncmodule.c.ORIG 2006-11-11 23:32:01.000000000 -0800
+++ _librsyncmodule.c 2018-02-20 11:22:06.529111816 -0800
@@ -59,8 +59,8 @@
 if (sm == NULL) return NULL;
 sm->x_attr = NULL;

- sm->sig_job = rs_sig_begin((size_t)blocklen,
- (size_t)RS_DEFAULT_STRONG_LEN);
+ sm->sig_job = rs_sig_begin((size_t)blocklen, 8,
+ (size_t)RS_MD4_SIG_MAGIC);
 return (PyObject*)sm;
 }

-Eric

PDFtk works on CentOS 7 and RHEL 8!

Installing PDFtk on CentOS/RHEL/Scientific Linux 7

Update: This procedure works in RHEL/Rocky/Alma/Oracle Linux 8 and Amazon Linux 2023!

In the transition to CentOS 7, the GNU compiler for the Java programming language libgcj was discontinued. This is partially due to it being dropped by the GCC suite. As it turns out, shared library linking of libgcj.so.10 from CentOS 6 is binary compatible with PDFtk. We use PDFtk in our office for collating documents that have been scanned, and it works great!

After searching online and finding lots of links with various levels of success and server admins having used PDFtk for over a decade, we decided to package it and provide it to the community.

Installation is simple, depending on your architecture:

x86_64

yum localinstall https://www.linuxglobal.com/static/blog/pdftk-2.02-1.el7.x86_64.rpm

i686

yum localinstall https://www.linuxglobal.com/static/blog/pdftk-2.02-1.el7.i686.rpm

After  Updating

If you are reading this article, then you probably just upgraded your system. Now would be a great time to consider security for your application and server infrastructure. There are many services that we offer including support, security, maintenance, monitoring, backups, and live SQL backup replication. We even offer a security hardened hosting environment!

Please let us know if you have any issues with these packages or if we may be of service!

Update: 2017-01-26

Some have asked for the .spec that we are using. Really are we are doing is repacking libgcj.so.10* which we pulled out of CentOS 6 libgcj-4.4.7-17.el6. PDFtk was downloaded as an RPM from their site unmodified except that we converted it to a tar and added libgcj. You may need to edit the spec to make it build on your system, but it works in our build environment: https://www.linuxglobal.com/static/blog/pdftk.spec

-Eric

Block Device Replication with rdiff

I’ve written a few articles on rdiff-backup, and if you need an increment history to go back in time, rdiff-backup is your tool. 

But what if you just want to replicate a large block device over the Internet? Well, then we turn to the utility that inspired rdiff-backup: rdiff

For our example, we wil assume you are using LVM to create device snapshots—but really, this could be any snapshot or SAN flash implementation. I’ve just written it for Linux’s LVM.

  • /dev/remote-vg0/source will be the device we are replicating from
  • /dev/local-vg0/dest will be the device we are replicating to
  • remotehost is the system that hosts /dev/remote-vg0/source
  • This script is being executed on the destination system.
# Define our source and destination
# (Note: spaces in these paths could break the script)
SOURCE=/dev/remote-vg0/source
DEST=/dev/local-vg0/dest
SSHUSER=root@remotehost

# Choose a size large enough for the remote write-activity during
# replication
SOURCE_SNAPSHOT_SIZE=4G

# Must be the same size as $DEST, because rdiff writes in sequential
# order (it thinks the destination is an empty file, so it re-writes
# everything.)
#
# See Feb 18, 2011 update notes below.  This can be much smaller now if you 
# use the patch below, since writes are avoided unless necessary.
#DEST_SNAPSHOT_SIZE=50G

# This is probably safe with the librsync patch discussed below
DEST_SNAPSHOT_SIZE=$SOURCE_SNAPSHOT_SIZE

# Enable compression
SSHOPTS='-C'

# 32k I/O buffers, and 16k blocksize.
RDIFF_OPT='-I 32768 -O 32768 -b 16384 -s'

SOURCE_NAME=`basename "$SOURCE"`
SOURCE_SNAP="`dirname $SOURCE`/$SOURCE_NAME-snap"
SOURCE_SNAP_NAME="$SOURCE_NAME-snap"

DEST_NAME=`basename "$DEST"`
DEST_SNAP="`dirname $DEST`/$DEST_NAME-snap"
DEST_SNAP_NAME="$DEST_NAME-snap"

# remove the previous snapshots, if any
ssh $SSHOPTS "$SSHUSER" "lvremove -f '$SOURCE_SNAP'"
lvremove -f "$DEST_SNAP"

# Snapshot the remote host:
ssh $SSHOPTS "$SSHUSER" "lvcreate -s -n '$SOURCE_SNAP_NAME' -L $SOURCE_SNAPSHOT_SIZE '$SOURCE'"

# Snapshot the local destination host:
lvcreate -s -n "$DEST_SNAP_NAME" -L $DEST_SNAPSHOT_SIZE "$DEST"

rdiff $RDIFF_OPT -- signature "$DEST_SNAP" - | \
  ssh $SSHOPTS "$SSHUSER" "rdiff $RDIFF_OPT -- delta - '$SOURCE_SNAP' -" | \
  rdiff $RDIFF_OPT -- patch "$DEST_SNAP" - "$DEST"

# Compare the volumes, if you like
ssh $SSHOPTS "$SSHUSER" "md5sum '$SOURCE_SNAP'"
md5sum $DEST

# cleanup, remove the snapshots.
ssh $SSHOPTS $SSHUSER "lvremove -f '$SOURCE_SNAP'"
lvremove -f "$DEST_SNAP"

This is a convenient single-pipe process for replication, and it uses the librsync rolling-checksum process, using minimal bandwidth on the network that ssh traverses.

Executing this script yields something like this on a 50GB volume; note that the md5sum’s match perfectly.

  Logical volume "source-snap" created
  Logical volume "dest-snap" created
rdiff: signature statistics: signature[3276800 blocks, 16384 bytes per block]
rdiff: loadsig statistics: signature[3276800 blocks, 16384 bytes per block]
rdiff: delta statistics: literal[27842 cmds, 805715968 bytes, 83492 cmdbytes] copy[1462034 cmds, 52881375232 bytes, 372798 false, 10746386 cmdbytes]
rdiff: patch statistics: literal[27842 cmds, 805715968 bytes, 83492 cmdbytes] copy[1462034 cmds, 52881375232 bytes, 0 false, 10746386 cmdbytes]
7fddc578cdbf5f4e30b7f815e72acebd  /dev/local-vg0/dest
7fddc578cdbf5f4e30b7f815e72acebd  /dev/remote-vg0/source-snap
  Logical volume "source-snap" successfully removed
  Logical volume "dest-snap" successfully removed

Since rdiff does not know the destination is a snapshot of the basis-file, it rewrites the whole thing. Indeed, it could simply seek instead of copy from the basis-file, but the stock rdiff tool does not support this. If I write a patch, it will get posted here—and if you write a patch, please let me know! (see update below!)

Until then, keep double the space free in your volume group that you need to run a snapshot and it should work great!

Update: Fri Feb 18 12:39:51 PST 2011 I just wrote patch for rdiff (within the librsync package) that updates in place, by patching the file-stream-sink code in buf.c. Basically, it reads before writing. If the data read in is the same that it would have written, it skips the write and advances the write pointer; otherwise, it writes as normal. Since this avoids writing to the device that had a snapshot except where necessary, much less snapshot-backing-store is required. This code passes all of the ‘make check’ tests that come with librsync, and I believe it to be stable. On my system, rdiff syncs are about 2x faster due to the much reduced write-overhead of the original implementation.

  • The patch is here
  • and the patched code, ready to compile, is here

-Eric

BlockFuse to the Rescue: rdiff-backup of LVM Snapshots and Block Devices

Over the years I have used rdiff-backup as an incremental backup solution. It works really well on many platforms, supports files >4GB, ACLs, and much much more.

Unfortunately, rdiff-backup does not support backing up block device content; instead, it will replicate the block device inode’s major/minor numbers on the destination system (doesn’t backup the internal content). If you are backing up all of your root filesystem (/), this is probably what you want. But, what if you’re backing up large virtual machine LVM snapshots?

Not finding a solution on the web, I wrote my own using the Linux FUSE filesystem.

BlockFuse takes two arguments:

	# ./block-fuse
	usage: ./block-fuse /dev/directory /mnt/point

For example:

	# # Take an LVM snapshot:
	# lvm -s /dev/vgBoot/asterisk -n _snap-asterisk -L 1G
	   ...
	# # Mount /dev/mapper as /mnt/block-devices:
	# ./block-fuse /dev/mapper /mnt/block-devices
	# ls -l /mnt/block-devices
	-r-------- 1 root root  10G 2010-12-21 16:07 vgBoot-_snap--asterisk
	-r-------- 1 root root 1.0G 2010-12-21 16:07 vgBoot-_snap--asterisk-cow
	   ...
	# # Perform your backup:
	# rdiff-backup --include '/mnt/block-devices/*_snap*' --exclude '*' \
		/mnt/block-devices \
		/mnt/backup/lvm-snapshots
	# #
	# ls -l /mnt/backup/lvm-snapshots/
	drwx------ 3 root root         4096 2010-12-21 16:19 rdiff-backup-data
	-r-------- 1 root root  10737418240 1969-12-31 16:00 vgBoot-_snap--asterisk
	-r-------- 1 root root   1073741824 1969-12-31 16:00 vgBoot-_snap--asterisk-cow

Thus, rdiff-backup is able to backup block-device content, including LVM snapshots using BlockFuse. BlockFuse is quite simple: it enumerates the content of the mount-source directory, and exports all block devices with non-zero size as a file with 0400 permissions, owned by your fuse user (probably root for this).

Notes:

  • BlockFuse does not support writing, so your data is read-only-safe. In a catastrophic recovery where you cannot restore a snapshot and must recover from rdiff- backup, just use rdiff-backup’s –restore-as-of argument, and ‘dd’ the recovered “file” back onto the original block device.
  • BlockFuse uses the mount-time as the modification time (st_mtime) for the mounted filesystem. This will force rdiff-backup to scan the block devices for changes. Therefore you must unmount and re-mount your BlockFuse filesystem after updating your snapshots. If you do not, rdiff-backup will skip the “files” because their modification timestamp had not changed since the last backup. (It would be easy to write a SIGHUP handler for this, so send me a patch if you do!)

Incidentally, I have this working in production, backing up snapshots as large as 350GB, so this is well tested. Still, this software is TO BE USED AT YOUR OWN RISK! Patches are welcome if you have a novell idea or change to add to BlockFuse.

Wed Dec 22 15:19:06 PST 2010: BlockFuse v0.01 initial release
Tue Dec 21 16:39:41 PST 2010: BlockFuse v0.02 now uses mmap’ed IO!
Tue Jan 14 10:53:54 PST 2014: BlockFuse v0.03 now follows symlinks and supports i386 architectures.

Download BlockFuse v0.03.

2014-01-14: Thank you for your patience waiting for the current version to be uploaded.  If someone would like to maintain BlockFuse and open a public git repo to maintain the package I would greatly appreciate it.

Cheers,

-Eric

Perl script-fu: Downloading UPS Invoices

UPS provides their shipping invoices and tracking history via their website, and you can even download .CSV and .XML files. Unfortunately, they do not have an easy way to automate this process. After scouring the web for something someone else has written, I decided to write my own: 

pull-ups-invoices.pl

Its a terse 100 line program, and your config options are at the top of the script. It depends on WWW::Mechanize, so make sure its installed. Oh, and be forewarned: this script does not validate UPS’s SSL certificate, so I hope you trust your link to UPS 😉

-Eric