DRBD and Nested LVM

DRBD is a opensource block-level storage system designed to provide distributed storage in cluster environments. Nodes of a cluster write changes locally and DRBD replicates those changes to other nodes. It supports both Primary/Secondary (master/slave) and Primary/Primary (master/master) configurations. I’m mostly familiar with DRBD through our virtualization infrastructure where I implemented DRBD in a two-node cluster to provide persistent storage among virtual appliances.

DRBD can be used on top of a variety backing devices such as single disks, mdadm RAIDs, hardware RAIDs, and LVMs. While configuring DRBD directly on any of these devices is relatively straightforward, using an LVM device for backing has two primary advantages that make the added complexity worth while: use of LVM tools for backing-device management and convenient online backing-device expansion. The familiar LVM tools enable us to easily resize our backing device if we want to add disks to our array later.

I should mention that a great deal of this article is based on this very excellent guide on two-node clusters; it’s a great crash course in clustering and incredibly useful. Check it out!

Let’s walk through the configuration process for a simple two-node, dual-primary cluster. I’ll assume your nodes have identical local storage configured already. I’ll also assume you have DRBD installed and the ‘drbd’ module loaded on your nodes already; LinBit, the organization that develops DRBD, has a commercial repository for subscribers while the EPEL repository has packages for RHEL/CentOS users.

Backing LVM Configuration

First, on each node, we create an LVM physical volume and volume group on our local storage device.

# On Node 1
pvcreate /dev/sda
vgcreate --clustered n node1_array0 /dev/sda

# On Node 2
pvcreate /dev/sda
vgcreate --clustered n node2_array0 /dev/sda

Next, we need to create volumes for each DRBD resource. A resource is a chunk of storage managed by DRBD. In a two-node cluster, we want a minimum of two resources, each of which will be designated to primarily operate on one node or another. A configuration like this is necessary to minimize data loss during a split-brain scenario. If only one resource is configured and a node failure occurs followed by unsuccessful fencing, both appliances will continue to write data to the same resource independent of one another; we now have a split-brain cluster. When it comes time to resolve this scenario, we will have to choose one node’s data over another, forcing us to discard data written by one node. We want to minimize the chance of data loss as much as possible and using a dedicated resource for each node helps.

In my cluster, I’ve configured three resources: one resource provides shared storage to each node for configuration files and installation images while the other resources provide the block-level storage for cluster appliances.

# On Node 1
lvcreate --name r0 --size 20GB node1_array0
lvcreate --name r1 --size 30GB node1_array0
lvcreate --name r2 --size 30GB node1_array0

# On Node 2
lvcreate --name r0 --size 20GB node2_array0
lvcreate --name r1 --size 30GB node2_array0
lvcreate --name r2 --size 30GB node2_array0

DRBD Configuration

With backing storage partitioned, we can start configuring DRBD itself. First, we perform the global configuration by editing global_common.conf in /etc/drbd.d:

global {
  # Don't report stats to LinBit
  usage-count no;
}

common {
  handlers {
    # Reboot on emergencies
    pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
    # Action to take to fence a peer.
    # "Fencing" means isolating problematic nodes in the cluster to maintain cluster consistency.
    fence-peer	"/usr/lib/drbd/crm-fence-peer.sh";
  }
  syncer {
    # Rate limit synchronizations.
    # Synchronization is performed when one node is detected to have out-of-date data
    # Synchronization is uncommon during normal operation. It's counterpart is replication.
    rate 20M;
    al-extents 3389;
  }

  disk {
    # Fencing policy. This tells DRBD to fence the node when the primary resource fails.
    # If the fencing action fails, shoot the other node in the head (poweroff)
    fencing resource-and-stonith;
    # DRBD won't flush "cached" data to the disk, our hardware RAID controller will handle this
    disk-flushes no;
    # Disable disk-flushing on our meta device, the device that stores DRBD metadata
    # In this case, our MD device is our DRBD device
    md-flushes no;
    # Disable disk barriers, not supported after kernel 2.6.32
    disk-barrier no;
  }

  net {
    # Use TCP send buffer auto-tuning
    sndbuf-size 0;

    # Protocol "C" tells DRBD not to tell the operating system that
    # the write is complete until the data has reach persistent
    # storage on both nodes. This is the slowest option, but it is
    # also the only one that guarantees consistency between the
    # nodes. It is also required for dual-primary, which we will 
    # be using.
    protocol C;

    # Tell DRBD to allow dual-primary meaning both nodes may actively
    # use DRBD resources simultaneously.
    allow-two-primaries yes;

    # This tells DRBD what to do in the case of a split-brain when
    # neither node was primary, when one node was primary and when
    # both nodes are primary. In our case, we'll be running
    # dual-primary, so we can not safely recover automatically. The
    # only safe option is for the nodes to disconnect from one
    # another and let a human decide which node to invalidate. Of 
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;
  }
}

Next, we define our resources. Again, I’m configuring three resources:

  • r0 for sharing files between nodes
  • r1 for storage on node 1
  • r2 for storage on node 2
resource r0 {
  # This sets the device name of this DRBD resouce.
  device /dev/drbd0;

  # This tells DRBD what the backing device is for this resource.
  # This value will change from host to host
  # Uncomment on node1
  disk /dev/n1_array0/r0;
  # Uncomment on node2
  # disk /dev/n2_array0/r0;

  # This controls the location of the metadata. When "internal" is used,
  # as we use here, a little space at the end of the backing devices is
  # set aside (roughly 32 MB per 1 TB of raw storage). External metadata
  # can be used to put the metadata on another partition when converting
  # existing file systems to be DRBD backed, when there is no extra space
  # available for the metadata.
  meta-disk internal;

  net {
    # How DRBD will verify blocks have replicated without error.
    verify-alg md5;

    # COMMENT OUT FOR PRODUCTION AND TEST
    # Uncommenting this will make DRBD verify every single block after
    # it's received instead of relying on TCP integrity checks
    # It's not recommended to use this in production or for heavy loads
    # as it will generate false errors.
    #data-integrity-alg md5;
  }

  # Tell DRBD where other nodes are
  # The name used here must match what is returned by "uname -n".
  on node1 {
    address 10.255.254.1:7788;
  }
  on node2 {
    address 10.255.254.2:7788;
  }
}
resource r1 {
  device /dev/drbd1;
  # Uncomment on node1
  disk /dev/n1_array0/r1;
  # Uncomment on node2
  # disk /dev/n2_array0/r1;
  meta-disk internal;

  net {
    verify-alg md5;
    #data-integrity-alg md5;
  }

  on node1 {
    address 10.255.254.1:7789;
  }
  on node2 {
    address 10.255.254.2:7789;
  }
}
resource r2 {
  device /dev/drbd2;
  # Uncomment on node1
  disk /dev/n1_array0/r2;
  # Uncomment on node2
  # disk /dev/n2_array0/r2;
  meta-disk internal;

  net {
    verify-alg md5;
    #data-integrity-alg md5;
  }

  on node1 {
    address 10.255.254.1:7790;
  }
  on node2 {
    address 10.255.254.2:7790;
  }
}

We can use the ‘drbdadm dump’ command to verify our configuration:

# From node1
# Result of 'drbdadm dump'

# /etc/drbd.conf
global {
  usage-count no;
  cmd-timeout-medium 600;
  cmd-timeout-long 0;
}

common {
  net {
    sndbuf-size        0;
    protocol           C;
    allow-two-primaries yes;
    after-sb-0pri    discard-zero-changes;
    after-sb-1pri    discard-secondary;
    after-sb-2pri    disconnect;
  }
  disk {
    resync-rate      20M;
    al-extents       3389;
    fencing          resource-and-stonith;
    disk-flushes      no;
    md-flushes        no;
    disk-barrier      no;
  }
  handlers {
    pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
    local-io-error   "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
    fence-peer       /usr/lib/drbd/crm-fence-peer.sh;
  }
}

# resource r0 on node1: not ignored, not stacked
# defined at /etc/drbd.d/r0.res:3
resource r0 {
  on node1 {
    device           /dev/drbd0 minor 0;
    disk             /dev/n1_array0/r0;
    meta-disk        internal;
    address          ipv4 10.255.254.1:7788;
  }
  on node2 {
    device           /dev/drbd0 minor 0;
    disk             /dev/n1_array0/r0;
    meta-disk        internal;
    address          ipv4 10.255.254.2:7788;
  }
  net {
    verify-alg       md5;
  }
}

# resource r1 on node1: not ignored, not stacked
# defined at /etc/drbd.d/r1.res:2
resource r1 {
  on node1 {
    device           /dev/drbd1 minor 1;
    disk             /dev/n1_array0/r1;
    meta-disk        internal;
    address          ipv4 10.255.254.1:7789;
  }
    on node2 {
    device           /dev/drbd1 minor 1;
    disk             /dev/n1_array0/r1;
    meta-disk        internal;
    address          ipv4 10.255.254.2:7789;
  }
  net {
    verify-alg       md5;
  }
}

# resource r2 on node1: not ignored, not stacked
# defined at /etc/drbd.d/r2.res:2
resource r2 {
  on node1 {
    device           /dev/drbd2 minor 2;
    disk             /dev/n1_array0/r2;
    meta-disk        internal;
    address          ipv4 10.255.254.1:7790;
  }
  on node2 {
    device           /dev/drbd2 minor 2;
    disk             /dev/n1_array0/r2;
    meta-disk        internal;
    address          ipv4 10.255.254.2:7790;
  }
  net {
    verify-alg       md5;
  }
}

Now, we must initialize and start the resources on each node:

# On Node 1
drbdadm create-md r{0,1,2}

# On Node 2:
drbdadm create-md r{0,1,2}

# On Node 1:
drbdadm up r0
drbdadm up r1
drbdadm up r2

# On Node 2:
drbdadm up r0
drbdadm up r1
drbdadm up r2

# Bring each node to the primary state, allowing simultaneous access
# On Node 1:
drbdadm primary r{0,1,2}

# On Node 2:
drbdadm primary r{0,1,2}

We can check the status of DRBD by looking at the ‘/proc/drbd’ file. After your resources have been synchronized and promoted, the file should look like this:

version: 8.4.10-1 (api:1/proto:86-101)
GIT-hash: a4d5de01fffd7e4cde48a080e2c686f9e8cebf4c build by mockbuild@, 2017-09-15 14:23:22
 0: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:1073710108 nr:0 dw:0 dr:1073712236 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:1073710108 nr:0 dw:0 dr:1073712236 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0
 2: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r-----
    ns:1073710108 nr:0 dw:0 dr:1073712236 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:0

Clustered LVM Configuration

At this point, we’ve finished configuring DRBD and have three empty yet synchronized resources. Now, we will add clustered LVM partitioning on top of each resource. We must use cLVM because DRBD is only in charge of replication and low-level integrity of our data- it does not manage file locks which prevent the same file from being updated at the same time; these functions must come from layers above DRBD.

First, we edit ‘/etc/lvm.conf’ to enable clustering and ignore nested LVM partitions that may be created by our cluster appliances.

# Set filters so as to avoid duplicate detection.
# Accept drbd* and sd* devices, reject everything else.
# This step is often overlooked so don't skip it!
# Reject the rest
filter = [ "a|/dev/sd.*|", "a|/dev/drbd.*|", "r|.*|" ]

# locking_type 3 uses DLM, Distributed Lock Management
locking_type = 3

# Because we're using nested LVM, we need to fallback to local locking.
# Otherwise, our backing LVM for DRBD won't be available when we need it.
fallback_to_local_locking = 1

# Disable caching because it confuses the LVM when using nested LVM
write_cache_state = 0

# Disabled lvmetad since it's not cluster-aware
use_lvmetad = 0

Next, we must disable and enable some services for cLVM:

# Disable lvmetad since we're going to replace it with
# cluster-aware locking
systemctl disable lvm2-lvmetad.service
systemctl disable lvm2-lvmetad.socket
systemctl stop lvm2-lvmetad.service

# Start the DLM and cLVM daemons
systemctl start dlm.service
clvmd

All that’s left for us to do is create the LVM volumes on top of DRBD.

# Perform these commands on a single node

# Create physical volumes
pvcreate /dev/drbd0 # shared FS
pvcreate /dev/drbd1 # node1 resource
pvcreate /dev/drbd2 # node2 resource

# Create clustered volume groups
vgcreate --clustered y cluster_r0 /dev/drbd0
vgcreate --clustered y cluster_r1 /dev/drbd1
vgcreate --clustered y cluster_r2 /dev/drbd2

GFS Configuration

For r1 and r2, we can think of the LVM partitioning as a filesystem that provides block-level storage to cluster services; in my case, that means raw disks used by VMs. For r0 to share files between nodes, we need more than replicated block-level storage; we need fully functional filesystem that allows our nodes to access files. For this, we’ll use GFS2.

GFS2, Global Filesystem 2, is designed for concurrent access on shared storage, exactly what we want on top of our r0 resource. We setup GFS2 in a similar fashion as EXT4 or XFS. First, we create our clustered volume group using LVM and then format the volume with GFS2:

lvcreate -L 40G -n shared cluster_r0
mkfs.gfs2 -j 2 -p lock_dlm -t cluster:shared /dev/cluster_r0/shared

Here’s what are our mkfs.gfs2 arguments mean:

  • -j 2 ⇒ How many journals GFS will create and keep. We need one journal for every node that will mount the filesystem concurrently.
  • -p lock_dlm ⇒ Tells GFS to use the DLM for locking
  • -t cluster:shared ⇒ Tells GFS the lockspace name. ‘cluster’ is the namespace and ‘shared’ is the name of the partition.

All that’s left is to is mount our new filesystem:

# On each node:
mkdir /shared
mount /dev/cluster_r0/shared /shared/
df -hP

# Result:
Filesystem                 Size  Used Avail Use% Mounted on
...
/dev/mapper/cluster_r0-shared   20G  259M   40G   1% /shared

We can test GFS2 by creating some files on one node and verifying the data on another.

Summary

DRBD isn’t very useful by itself; it’s designed to be used in conjunction with a cluster manager running other services. The manager handles starting prerequisite services in the proper order as well as handling the promotion of resources from secondary to primary. A cluster manager will also help us during failure scenarios by fencing nodes and restarting critical services. This adds some automated intelligence to how the cluster operates, freeing up the administrator to do other things.

Keep in mind that setting up DRBD in production is not to be taken lightly even if the configuration is fairly simple. Plenty of hardware and use-case testing is required before deciding on a stable configuration that can be supported in the long term. It’s essential to have the proper hardware and networking capability that can support high rates of data transfer with low latency. Perhaps the most important consideration is the speed of the backing storage devices; LinBit recommends highly performant 10K or 15K RPM disks. Less expensive 7.2K RPM disks can be used, but they’ll likely be a bottleneck in your configuration, limiting the overall performance of your cluster. Choose wisely!