Recovering an overflowed LVM volume configured with –virtualsize

/dev/vg/somevolume: read failed after 0 of 4096 at nnnnn: Input/output error

If you’ve ever seen the above error, this usually means you have run out of disk space on the CoW-volume of a snapshot volume.

…but there is another uses for snapshots, and that is thin provisioning for sparse data use.  If you create an LVM volume using the –virtualsize option, you can provide a logical size that is much larger than the actual underlying volume.  If you exceed the space for such a volume, you will get the same error above—and all data on the volume will be invalidated and inaccessible.

LVM silently uses the ‘zero’ devicemapper target as the underlying volume.  Thus, even though the data is invalidated nothing is lost.  By overlaying the lost data over the top of a zero device, we can resurrect the data.

We have prepared our example file with the following:

lvcreate -L 100m --virtualsize 200m -n virtual_test vg
mkfs.ext4 /dev/vg/virtual_test
 [...]
mount /dev/vg/virtual_test /mnt/tmp/

And now we fill the disk:

dd if=/dev/zero of=/mnt/tmp/overflow-file
dd: writing to `/mnt/tmp/overflow-file': Input/output error

Message from syslogd@backup at Aug 27 15:17:27 ...
 kernel:journal commit I/O error
272729+0 records in
272728+0 records out
139636736 bytes (140 MB) copied
[I had to reboot here.  The kernel still thought
 the filesystem was mounted and I could not continue.
 Obviously we are working near the kernel's limits on
 this CentOS 6.2 2.6.32-based kernel]

Now we have a 200MB volume with 100MB allocated to it, which is now full.  LVM has marked the volume as invalid and the data is no longer available.

First, resize the volume so we have room after resizing.  Otherwise, the first byte written to the volume would, again, invalidate the disk:

lvresize -L +100m /dev/vg/virtual_test
 [errors, possibly, just ignore them]
  Extending logical volume virtual_test to 200.00 MiB
  Logical volume virtual_test successfully resized

Now we edit the -cow file directly with a short perl script.  The 5th byte is the ‘valid’ flag (see http://www.redhat.com/archives/linux-lvm/2006-September/msg00132.html) so all we need to is set it to ‘1’:

 perl -e 'open(F, ">>", "/dev/mapper/vg-virtual_test-cow"); seek(F, 4, SEEK_SET); syswrite(F,"\x01",1); close(F);'

Now have lvm re-read the CoW metadata and you’re in business:

lvchange -an /dev/backup/virtual_test
  [ignore errors]
lvchange -ay /dev/backup/virtual_test
  [shouldn't have any errors]
lvs
  LV                    VG       Attr     LSize   Pool Origin               Data% 
  virtual_test          vg   swi-a-s- 200.00m      [virtual_test_vorigin]   33.63

At this point you should probably fsck your filesystem, it may be damaged—or at least nead a journal-replay since it stopped abruptly at the end of its allocated space.  And as you can see, the “overflow” file is there up until the point of filling the disk.

[root@backup mapper]# e2fsck /dev/vg/virtual_test
e2fsck 1.41.12 (17-May-2010)
/dev/vg/virtual_test: recovering journal
/dev/vg/virtual_test: clean, 12/51200 files, 66398/204800 blocks
[root@backup mapper]# mount /dev/vg/virtual_test /mnt/tmp/
[root@backup mapper]# ls -lh /mnt/tmp/
total 54M
drwx------. 2 root root 12K Aug 27 15:16 lost+found
-rw-r--r--. 1 root root 54M Aug 27 15:17 overflow-file

-Eric

5 thoughts on “Recovering an overflowed LVM volume configured with –virtualsize

  1. Oh well – the bad news is that this recipe apparently fails with regular lvm snapshots. I have been using one for experimental purposes (OS upgrade) and accidentially overflowed it. I cant do ANYTHING on the respective volume now, it is completely unreadable and unwritable. Lvextend also does not work. I’d be REALLY thankful for any hints on what I could do to get that data back.

    • I have to correct myself.
      Faults:

      1) I tried lvresize before rebooting and after that on a read-only recovery shell. Both tries failed. You have to reboot after the overflow AND / must be writable for lvresize to work.

      2) lvchange did not work as advertised. But after another reboot the “fixed” snapshot was accessible again and could be fscked.

      Kudos for saving me several hours of work!

  2. I found this after a desperate search of how to fix a full snapshot and can confirm that this worked for me.
    Thanks to Eric and Jorg, especially Jorg for posting his update that he was successful after his first fail.

    Just short summary on my setup. A full snapshot with 50 GB after I forgot to merge the snapshot after an upgrade test of our Jenkins installation. I would have lost a couple of months updates and Jobs in Jenkins if I couldn’t access the data of the snapshot.

    Only addition to this “solution” is that I was scared to run the perl command without testing first. So I copied the first 512 bytes of the cow device to a file and was wondering why the perl command APPENDS the byte to the end of this test file instead of replacing it.
    Because I didn’t understood the perl behavior I used dd to frobnicate this byte

    dd if=/tmp/1.txt count=1 bs=1 seek=5 of=vgxen-jenkins–disk–new–snap-cow conv=notrunc
    1.txt contains only binary 1

    funny side note “file test.dump” identifies the snapshot valid flag
    file /tmp/test.dump
    /tmp/test.dump: LVM Snapshot (CopyOnWrite store) – valid, version 1, chunk_size 8
    file will state invalid of a full snapshot

    cheers

Leave a Comment