A RAID volume is useful when building a file server. A failing disk must not result in a total breakdown of the service. When dealing with large disks over 2 TB new tools are needed. This article describes how to setup this kind of storage volume.
Linux provides a feature rich set of software RAID services. It is often recommended over the “fake” RAID hardware solutions often included in many mother boards (not using dedicated hardware but instead a operating system driver). Linux software RAID also gives more flexibility. Dedicated high end hardware however is the best solution as it offloads the CPU and is very strong performance wise. But for many purposes the Linux solution is more than enough.
N+1 or 1+1 redundancy is common in this context. In this particular case a mirroring RAID-1 solution is selected which permits one failing disk at a given point in time. One issue with these large disks with a capacity greater than 2 TB is that the traditional parition tables used the last decades lack support for such partitions. Also the physical sector sizes have changed from 512 bytes to e.g. 4096 bytes. This puts new requirements on memory alignment between physical sectors and the file system used.
Enabling relevant drivers in the Linux kernel
The following drivers are needed (expressed in ‘make menuconfig’ style):
Partitioning the disks
This article assumes that the RAID volume isn’t the root volume of the system (the kernel isn’t booted from it).
Two 3 TB disks have been chosen for the array. In this case exposed as /dev/sdb and /dev/sdc by the kernel. They need to have identical partition tables so that the RAID volume can be defined. Regular system tools such as fdisk and cfdisk does not support the GUID Partition Table (GPT) that is required for disks over 2 TB. Therefore a tool called GNU Parted is used. It stores the majority of the partition information at the end of the disk. MBR storage is also used for compatibility with the ways of old.
One big partition covering the whole disk will be created. Memory alignment is a performance concern and this needs to be taken into account. Actually since RAID is involved, physical sector sizes, RAID block sizes and file system fragment sizes needs to be aligned. As most of this is in software the main restriction is put on the physical hard drive. To check the real sector size the sys file system can be consulted (a logical 512 byte sector is exposed by the kernel to make things backwards compatible):
$ cat /sys/block/sdb/queue/logical_block_size 512 $ cat /sys/block/sdc/queue/logical_block_size 512 $ cat /sys/block/sdb/queue/physical_block_size 4096 $ cat /sys/block/sdc/queue/physical_block_size 4096
There is a trick in GNU Parted to specify 4096 byte alignment. By specifying a offset of 1M, the alignment will be correct. If the alignment is wrong, GNU Parted will print a message to inform about this. The commands below define a GPT partition table with a partition entry covering the whole disk:
$ parted /dev/sdb $ (parted) mklabel gpt Warning: The existing disk label on /dev/sdb will be destroyed and all data on this disk will be lost. Do you want to continue? Yes/No? Yes $ (parted) mkpart primary 1M 3001GB $ (parted) p Model: ATA WDC WD30EZRX-00M (scsi) Disk /dev/sdb: 3001GB Sector size (logical/physical): 512B/4096B Partition Table: gpt Number Start End Size File system Name Flags 1 1049kB 3001GB 3001GB primary $ (parted) q Information: You may need to update /etc/fstab.
The same procedure as above is repeated for /dev/sdc.
Creating the RAID-1 volume
The partitions have been defined. The next step is to create the RAID volume. The system tool mdadm is used.
$ mdadm --verbose --create /dev/md0 --level=raid1 --raid-devices=2 /dev/sd[bc]1 mdadm: Note: this array has metadata at the start and may not be suitable as a boot device. If you plan to store '/boot' on this device please ensure that your boot-loader understands md/v1.x metadata, or use --metadata=0.90 mdadm: size set to 2930263928K Continue creating array? y mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started.
To check the current state of the RAID array, a couple of commands can be used. It is important to see that all disks are included in the array:
fileborm ~ # mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Sat Nov 5 20:25:33 2011 Raid Level : raid1 Array Size : 2930263928 (2794.52 GiB 3000.59 GB) Used Dev Size : 2930263928 (2794.52 GiB 3000.59 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Sat Nov 5 20:25:33 2011 State : clean, resyncing Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Rebuild Status : 0% complete Name : fileborm:0 (local to host fileborm) UUID : 4b4a858f:adafd25f:1eec6390:b2e40f75 Events : 0 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 33 1 active sync /dev/sdc1
As you can see above, the two disks are not yet synchronized (Rebuild Status). This may be performed very slowly to avoid performance degradation for the user. The speed can be changed by updating the files ‘sync_speed_min’ and ‘sync_speed_max’ in the /sys/devices/virtual/block/md0/md directory (unit: KiB/s).
UUIDs are used to identify resources in the array as well as the array itself. It is relevant to be aware of this, especially if the system will host several arrays:
$ mdadm --examine /dev/sdb1 /dev/sdb1: Magic : a92b4efc Version : 1.2 Feature Map : 0x0 Array UUID : 4b4a858f:adafd25f:1eec6390:b2e40f75 Name : fileborm:0 (local to host fileborm) Creation Time : Sat Nov 5 20:25:33 2011 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 5860528128 (2794.52 GiB 3000.59 GB) Array Size : 5860527856 (2794.52 GiB 3000.59 GB) Used Dev Size : 5860527856 (2794.52 GiB 3000.59 GB) Data Offset : 2048 sectors Super Offset : 8 sectors State : active Device UUID : 18b2bfa5:3bbf7a65:aa48ca77:b1536d37 Update Time : Sat Nov 5 20:27:00 2011 Checksum : 829673eb - correct Events : 1 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing)
The –examine option can also be used when analyzing unknown disks on a system. This information can then be used to configure the system properly.
An non-mdadm centric description of the volumes is given by reading the proc file system:
$ cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md0 : active raid1 sdc1 sdb1 2930263928 blocks super 1.2 [2/2] [UU] [>....................] resync = 0.3% (11368832/2930263928) finish=460.0min speed=105734K/sec unused devices:
One good usage of the command above is to get an overview of what volumes actually exist in the system without polling each device that may have been activated.
Creating the file system
When the md-device is up and running it can be accessed like any other block device. The ext4 file system has been selected because it supports large volumes and has good performance. The tool will complain if improper alignment is detected. The default fragment size of 4096 bytes works in this case.
$ mkfs.ext4 /dev/md0 mke2fs 1.41.14 (22-Dec-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=1 blocks, Stripe width=1 blocks 183148544 inodes, 732565982 blocks 36628299 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 22357 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 36 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
Starting and stopping the storage service
This article tries to avoid assumtions of a certain Linux distribution. Since every one out there differs it is better to discuss the operations themselves for start and stop cases. The commands are placed in a relevant script file that the init process triggers when switching run levels.
Starting the service
I had the problem of getting an unwanted md device started during boot (/dev/md127). First I though the kernel was the perpetrator, but after disabling CONFIG_MD_AUTODETECT the problem remained. Then looking closely in the kernel log, the device start is associated with the udev driver. After googling a lot on the net I finally found out that the udev rule file /lib/udev/rules.d/64-md-raid.rules should be removed from the system or edited to disable the activation. Then the problem did go away. With that fixed, the commands below are used to start up the md0 device without being blocked by an already started device stealing /dev/sdb1 for its own purposes.
The RAID volume is started or assembled with the –assemble option to mdadm:
$ mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1 mdadm: /dev/md0 has been started with 2 drives.
Then the volume needs to be mounted:
$ mkdir -p /mnt/md0 $ mount -t ext4 /dev/md0 /mnt/md0
Stopping the service
First we umount the file system:
$ umount /mnt/md0
Then the RAID array is stopped:
$ mdadm --stop /dev/md0 mdadm: stopped /dev/md0
Note that the full paths in your system to the commands mount, umount and mdadm may need to be supplied in these scripts. The PATH environment variable is quite minimalistic in these runlevels.
Now we have the RAID service up and running and it will be ready to serve also after a reboot. It’s time to create files!
Tips and Trix
In cases you want to re-create a RAID-volume the superblocks describing the old volume should be cleared before continuing. Then mdadm won’t warn about this during creation.
$ mdadm --zero-superblock /dev/sdb1 $ mdadm --zero-superblock /dev/sdc1
If performance is very important together with redundancy, it is possible to create a persistent RAM disk. Getting immense speed and keeping data after system reboot is the advantage. Linux has support for exposing a portion of the RAM memory by using the /dev/ram devices. This kind of block device can be used together with a regular disk partition (e.g. /dev/sdb1) when creating a RAID volume. The RAM disk gives the speed and the other disk gives persistence. What’s important is that the array’s primary device is the RAM disk, otherwise the speed boost isn’t achieved. The primary device is the device that at a given point in time defines what data the other disk should replicate (the first device specified to the mdadm tool below).
The default RAM disk size in Linux is 4096 kB. That is too small so it needs to be changed in the kernel. I tested this with a 1000MB RAM disk (1024000 kB). The kernel option CONFIG_BLK_DEV_RAM_SIZE can be found here:
Device Drivers —> Block devices —> Default RAM disk size (kbytes)
Creating the persistent RAM disk (assuming /dev/ram0 and /dev/sdb1 have the same size):
mdadm --create /dev/md1 --level=raid1 --raid-devices=2 /dev/ram0 /dev/sdb1
After reboot the RAM disk is blank. This means that the array needs to be started with only a single member. Then when adding the RAM disk to the array the system restores its data. To get the order right (to get proper speed), the regular disk is removed making the RAM-disk the primary. Finally it is added back again and becomes a secondary device.
The –run option to mdadm tells the tool to start the array even if only a single device is specified:
$ mdadm --assemble --run /dev/md1 /dev/sdb1 $ mdadm /dev/md1 --add /dev/ram0 $ sleep 60 # wait for the RAM disk device to become active in the raid $ mdadm /dev/md1 --fail /dev/sdb1 $ mdadm /dev/md1 --remove /dev/sdb1 $ mdadm /dev/md1 --add /dev/sdb1
Performance result with incorrect primary device:
$ hdparm -t /dev/md1 /dev/md1: Timing buffered disk reads: 378 MB in 3.00 seconds = 125.88 MB/sec
Performance result with correct primary device:
$ hdparm -t /dev/md1 /dev/md1: Timing buffered disk reads: 996 MB in 0.58 seconds = 1721.02 MB/sec
Additional RAID volumes
Additional RAID volumes can be created with the same commands as described in the article. Only device names will differ. The md driver supports devices such as /dev/md0, /dev/m1, /dev/md2 and so on. Creating several volumes using the same disks with more than a single partition on each disk is also possible (e.g. using /dev/sdb1 for one volume and /dev/sdb2 for another) but it makes things a bit more complicated in failure scenarios. Using dedicated disks is therefore recommended instead.
Thankyou for the article about linux and bigger blocks of data used and the possibility of having immense speed with big RAM size.
I wonder if there is a way to make the EEPROM microprosser on cars into Linux or fast or a superfast RAM.???
Maybe you know?????
Thankyou very much for a very interesting article on RAID and Linux and RAM.
Respectfully, Maurice Seniw
> unit: kb/s
Isn’t it kB/s?
Thanks for the comment about the unit. Looked up the kernel documentation. The unit is kibibyte. The article has been updated.