www.technotes.se » Blog Archive » Setting up a RAID volume in Linux with >2TB disks
Nov
13

A RAID volume is useful when building a file server. A failing disk must not result in a total breakdown of the service. When dealing with large disks over 2 TB new tools are needed. This article describes how to setup this kind of storage volume.

Overview
Linux provides a feature rich set of software RAID services. It is often recommended over the “fake” RAID hardware solutions often included in many mother boards (not using dedicated hardware but instead a operating system driver). Linux software RAID also gives more flexibility. Dedicated high end hardware however is the best solution as it offloads the CPU and is very strong performance wise. But for many purposes the Linux solution is more than enough.

N+1 or 1+1 redundancy is common in this context. In this particular case a mirroring RAID-1 solution is selected which permits one failing disk at a given point in time. One issue with these large disks with a capacity greater than 2 TB is that the traditional parition tables used the last decades lack support for such partitions. Also the physical sector sizes have changed from 512 bytes to e.g. 4096 bytes. This puts new requirements on memory alignment between physical sectors and the file system used.

Enabling relevant drivers in the Linux kernel
The following drivers are needed (expressed in ‘make menuconfig’ style):

  • Device Drivers —> Multiple devices driver support (RAID and LVM) —> RAID support (CONFIG_BLK_DEV_MD)
  • Device Drivers —> Multiple devices driver support (RAID and LVM) —> RAID-1 (mirroring) mode (CONFIG_MD_RAID1)
  • File systems —> Partition Types —> EFI GUID Partition support (CONFIG_EFI_PARTITION)
  • File systems —> The Extended 4 (ext4) filesystem (CONFIG_EXT4_FS)

Partitioning the disks
This article assumes that the RAID volume isn’t the root volume of the system (the kernel isn’t booted from it).

Two 3 TB disks have been chosen for the array. In this case exposed as /dev/sdb and /dev/sdc by the kernel. They need to have identical partition tables so that the RAID volume can be defined. Regular system tools such as fdisk and cfdisk does not support the GUID Partition Table (GPT) that is required for disks over 2 TB. Therefore a tool called GNU Parted is used. It stores the majority of the partition information at the end of the disk. MBR storage is also used for compatibility with the ways of old.

One big partition covering the whole disk will be created. Memory alignment is a performance concern and this needs to be taken into account. Actually since RAID is involved, physical sector sizes, RAID block sizes and file system fragment sizes needs to be aligned. As most of this is in software the main restriction is put on the physical hard drive. To check the real sector size the sys file system can be consulted (a logical 512 byte sector is exposed by the kernel to make things backwards compatible):

$ cat /sys/block/sdb/queue/logical_block_size
512

$ cat /sys/block/sdc/queue/logical_block_size
512

$ cat /sys/block/sdb/queue/physical_block_size
4096

$ cat /sys/block/sdc/queue/physical_block_size
4096

There is a trick in GNU Parted to specify 4096 byte alignment. By specifying a offset of 1M, the alignment will be correct. If the alignment is wrong, GNU Parted will print a message to inform about this. The commands below define a GPT partition table with a partition entry covering the whole disk:

$ parted /dev/sdb

$ (parted) mklabel gpt
Warning: The existing disk label on /dev/sdb will be destroyed and all data on
this disk will be lost. Do you want to continue?
Yes/No? Yes

$ (parted) mkpart primary 1M 3001GB

$ (parted) p
Model: ATA WDC WD30EZRX-00M (scsi)
Disk /dev/sdb: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  3001GB  3001GB               primary

$ (parted) q
Information: You may need to update /etc/fstab.

The same procedure as above is repeated for /dev/sdc.

Creating the RAID-1 volume
The partitions have been defined. The next step is to create the RAID volume. The system tool mdadm is used.

$ mdadm --verbose --create /dev/md0 --level=raid1 --raid-devices=2 /dev/sd[bc]1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: size set to 2930263928K
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

Query operations
To check the current state of the RAID array, a couple of commands can be used. It is important to see that all disks are included in the array:

fileborm ~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Sat Nov  5 20:25:33 2011
     Raid Level : raid1
     Array Size : 2930263928 (2794.52 GiB 3000.59 GB)
  Used Dev Size : 2930263928 (2794.52 GiB 3000.59 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Sat Nov  5 20:25:33 2011
          State : clean, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 0% complete

           Name : fileborm:0  (local to host fileborm)
           UUID : 4b4a858f:adafd25f:1eec6390:b2e40f75
         Events : 0

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       1       8       33        1      active sync   /dev/sdc1

As you can see above, the two disks are not yet synchronized (Rebuild Status). This may be performed very slowly to avoid performance degradation for the user. The speed can be changed by updating the files ‘sync_speed_min’ and ‘sync_speed_max’ in the /sys/devices/virtual/block/md0/md directory (unit: KiB/s).

UUIDs are used to identify resources in the array as well as the array itself. It is relevant to be aware of this, especially if the system will host several arrays:

$ mdadm --examine /dev/sdb1
/dev/sdb1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 4b4a858f:adafd25f:1eec6390:b2e40f75
           Name : fileborm:0  (local to host fileborm)
  Creation Time : Sat Nov  5 20:25:33 2011
     Raid Level : raid1
   Raid Devices : 2

 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB)
     Array Size : 5860527856 (2794.52 GiB 3000.59 GB)
  Used Dev Size : 5860527856 (2794.52 GiB 3000.59 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 18b2bfa5:3bbf7a65:aa48ca77:b1536d37

    Update Time : Sat Nov  5 20:27:00 2011
       Checksum : 829673eb - correct
         Events : 1

   Device Role : Active device 0
   Array State : AA ('A' == active, '.' == missing)

The –examine option can also be used when analyzing unknown disks on a system. This information can then be used to configure the system properly.

An non-mdadm centric description of the volumes is given by reading the proc file system:

$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid1 sdc1[1] sdb1[0]
      2930263928 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.3% (11368832/2930263928) finish=460.0min speed=105734K/sec

unused devices: 

One good usage of the command above is to get an overview of what volumes actually exist in the system without polling each device that may have been activated.

Creating the file system
When the md-device is up and running it can be accessed like any other block device. The ext4 file system has been selected because it supports large volumes and has good performance. The tool will complain if improper alignment is detected. The default fragment size of 4096 bytes works in this case.

$ mkfs.ext4 /dev/md0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=1 blocks, Stripe width=1 blocks
183148544 inodes, 732565982 blocks
36628299 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
22357 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

Starting and stopping the storage service
This article tries to avoid assumtions of a certain Linux distribution. Since every one out there differs it is better to discuss the operations themselves for start and stop cases. The commands are placed in a relevant script file that the init process triggers when switching run levels.

Starting the service
I had the problem of getting an unwanted md device started during boot (/dev/md127). First I though the kernel was the perpetrator, but after disabling CONFIG_MD_AUTODETECT the problem remained. Then looking closely in the kernel log, the device start is associated with the udev driver. After googling a lot on the net I finally found out that the udev rule file /lib/udev/rules.d/64-md-raid.rules should be removed from the system or edited to disable the activation. Then the problem did go away. With that fixed, the commands below are used to start up the md0 device without being blocked by an already started device stealing /dev/sdb1 for its own purposes.

The RAID volume is started or assembled with the –assemble option to mdadm:

$ mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1
mdadm: /dev/md0 has been started with 2 drives.

Then the volume needs to be mounted:

$ mkdir -p /mnt/md0
$ mount -t ext4 /dev/md0 /mnt/md0

Stopping the service
First we umount the file system:

$ umount /mnt/md0

Then the RAID array is stopped:

$ mdadm --stop /dev/md0
mdadm: stopped /dev/md0

Note that the full paths in your system to the commands mount, umount and mdadm may need to be supplied in these scripts. The PATH environment variable is quite minimalistic in these runlevels.

Now we have the RAID service up and running and it will be ready to serve also after a reboot. It’s time to create files!

Tips and Trix
In cases you want to re-create a RAID-volume the superblocks describing the old volume should be cleared before continuing. Then mdadm won’t warn about this during creation.

$ mdadm --zero-superblock /dev/sdb1
$ mdadm --zero-superblock /dev/sdc1

If performance is very important together with redundancy, it is possible to create a persistent RAM disk. Getting immense speed and keeping data after system reboot is the advantage. Linux has support for exposing a portion of the RAM memory by using the /dev/ram devices. This kind of block device can be used together with a regular disk partition (e.g. /dev/sdb1) when creating a RAID volume. The RAM disk gives the speed and the other disk gives persistence. What’s important is that the array’s primary device is the RAM disk, otherwise the speed boost isn’t achieved. The primary device is the device that at a given point in time defines what data the other disk should replicate (the first device specified to the mdadm tool below).

The default RAM disk size in Linux is 4096 kB. That is too small so it needs to be changed in the kernel. I tested this with a 1000MB RAM disk (1024000 kB). The kernel option CONFIG_BLK_DEV_RAM_SIZE can be found here:
Device Drivers —> Block devices —> Default RAM disk size (kbytes)

Creating the persistent RAM disk (assuming /dev/ram0 and /dev/sdb1 have the same size):

mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/ram0 /dev/sdb1

After reboot the RAM disk is blank. This means that the array needs to be started with only a single member. Then when adding the RAM disk to the array the system restores its data. To get the order right (to get proper speed), the regular disk is removed making the RAM-disk the primary. Finally it is added back again and becomes a secondary device.

The –run option to mdadm tells the tool to start the array even if only a single device is specified:

$ mdadm --assemble --run /dev/md1 /dev/sdb1
$ mdadm /dev/md1 --add /dev/ram0
$ sleep 60 # wait for the RAM disk device to become active in the raid
$ mdadm /dev/md1 --fail /dev/sdb1
$ mdadm /dev/md1 --remove /dev/sdb1
$ mdadm /dev/md1 --add /dev/sdb1

Performance result with incorrect primary device:

$ hdparm -t /dev/md1
/dev/md1:
 Timing buffered disk reads: 378 MB in  3.00 seconds = 125.88 MB/sec

Performance result with correct primary device:


$ hdparm -t /dev/md1

/dev/md1:
 Timing buffered disk reads: 996 MB in  0.58 seconds = 1721.02 MB/sec

Additional RAID volumes
Additional RAID volumes can be created with the same commands as described in the article. Only device names will differ. The md driver supports devices such as /dev/md0, /dev/m1, /dev/md2 and so on. Creating several volumes using the same disks with more than a single partition on each disk is also possible (e.g. using /dev/sdb1 for one volume and /dev/sdb2 for another) but it makes things a bit more complicated in failure scenarios. Using dedicated disks is therefore recommended instead.

Further reading

(3) Comments    Read More   

Comments

Maurice Seniw on 5 March, 2012 at 03:35 #

Thankyou for the article about linux and bigger blocks of data used and the possibility of having immense speed with big RAM size.
I wonder if there is a way to make the EEPROM microprosser on cars into Linux or fast or a superfast RAM.???
Maybe you know?????

Thankyou very much for a very interesting article on RAID and Linux and RAM.
Respectfully, Maurice Seniw


Engineer on 20 April, 2012 at 23:45 #

> unit: kb/s

Isn’t it kB/s?


Lonezor on 28 April, 2012 at 16:39 #

Thanks for the comment about the unit. Looked up the kernel documentation. The unit is kibibyte. The article has been updated.


Post a Comment
Name:
Email:
Website:
Comments:
*