Lustre file system

A Lustre file system was built in a correlation room of the first floor of the Jangyoung-Sil Hall. The prime usage case of the lustre file system is for the correlation of VLBI data from KVN (Korean VLBI Network) or EAVN (East Asia VLBI Network). In fact, the FPGA-based, Daejeon correlator writes its output to an Isilon with 100 TB, which is now quite old and has a limited storage space. The new lustre file system will replace the Isilon machine. The lustre file system will also be used as a scratch space for the polaris cluster.

JK wanted to make a note of his experience of building the lustre file system, so he made these web pages. He also think that informaiton in the pages might be useful to someone who is willing to build a decent lustre system.

JK would like to thank Dong-Sik Kim who has devoted his time and energy for seting up the current Lustre file system.

If you have any questions on the lustre sytem, please contact Dr. Jongsoo Kim, (jskim@kasi.re.kr).


Hardware

A Lustre system is composed of five nodes (Dell PowerEdge R730 with two Intel Xeon E5-2699 V3 @ 2.3GHz), 240*3TB disks (20 boxes of Dell PowerVault MD 1200), 4 RAID controller cards (LSI MegaRAID SAS 9286CV-8eCC), one FDR infiniband switch (Mellanox SX6025), and 5 FDR IB (infiniband) adapter cards (Mellanox ConnextX-3 VPI MCX353A-FCBT). Except the five nodes, all other components are from the pre-existing Polaris cluster. One node is reserved for both MGS/MDS, and the four nodes are for OSSs. There is no failover capability right now. But JK has a plan to include the capability later.

MGS/MDS

The MGS/MDS node has 36 Intel Xeon CPU E5-2699 v3@2.3GHz cores, 256 GB memory, two 200 SSDs, and two 1.8TB SSDs. The two small SSDs (RAID 1) and large SSDs (RAID 1) were configured to be used for OS and MDT, respectively.

OSS

Each OSS node has 36 Intel Xeon CPU E5-2699 v3@2.3GHz cores, 64 GB memory, two 500 GB SATA disks (RAID 1, OS), one RAID controller card, and five Dell PowerVault MD 1200. Each MD 1200 has 12 3TB disks. So the total number and space of disks are 60 and 180TB, respectively.

IB

Each node has one IB card and is connected to one IB switch. The Polaris cluster nodes are also connected to the same IB switch. In fact, the cluster nodes will mount the lustre file sytem as a lustre client.


RAID configuration

The RAID 6 (8+2 disks) has been used for all OSTs. In each OSS, there are five MD 1200 boxes and each box has 12 disks. The total number of disks is 60. We can make 6 OSTs, each of which is compsoed of 10 disks, two disks from each box. Because of RAID 6 configuration, even in the case where one box is unavailable, the raid volume will be operating.

The following are important properties of a RAID 6 logical volume:

  • Stripe Size: 128BiB
  • Disk Cache Policy: Disabled
  • Read Policy: Always Read Ahead
  • IO Policy: Direct IO
  • Write Policy: Write Back with BBU
  • Data Protection: Disabled

It is quite important to pick up the right “strip size” for a logical RAID 6 volume composed of 10 disks. Since the default stripe-size of a lustre file system is 1MiB, 128KiB was chosen for the strip (chunk) size for the raid volume. In fact, the 8*128KiB=1MiB, which is not only the stripe-size of the lustre file system but also the strip-width of RAID 6 volume.

We observed that the “Write Back” option gave us better performance the “Write Through” option. If we choose “Write Through” option, the OSS node has quite much disk IO loads.


Prerequisites

Here we assume that there are one MGS/MDS node and five OSS nodes. There host names are mds0 and oss[0-3]. All nodes are installed with the Lustre server and client software. The ip addresses of IB for mds0 and oss[0-3] are 192.168.1.99-103, respectively.

OS installation

As of February 2015, the Lustre 2.6 was available. We chose the CentOS 6.5, which has a matching kervel version with the Lustre 2.6.

Disabling SELinux (the Security Enhanced Linux)

On each node, set SELINUX=disabled option in /etc/selinux/config.

[root@mds0]# pdsh -w mds1,oss[0-3] sestatus mds1: SELinux status: disabled oss0: SELinux status: disabled oss1: SELinux status: disabled oss2: SELinux status: disabled oss3: SELinux status: disabled

ntpd installation

On each node, the following commands install ntpd.

yum install ntp ntpdate ntp-doc chkconfig ntpd on ntpdate pool.ntp.org /etc/init.d/ntpd start

Infiniband OFED installation

JK tried to install Mellanox OFED. But due to the issue of kernel version match between OS and OFED, the installation procedure is quite time-consuming. Instead, he decided to install OFED in RHEL6. The latter choice is much simpler.

JK installed OFED in CentOS 6.5 using the folloing commands.

yum groupinstall “Infiniband Support” yum install infiniband-diags perftest qperf opensm chkconfig rdma on reboot

One problem of the RHEL6 OFED is that the system was failed to unload rdma_cm, ib_cm, and iw_cm, when it shut down. JK found a solution to this problem, and added “lustre_rmmod” to the stop section of the same script. The solution is not nice but works very well.


Lustre installation

Here we assume that there are one MGS/MDS node and five OSS nodes. There host names are mds0 and oss[0-3]. All nodes are installed with the Lustre server and client software. The ip addresses of IB for mds0 and oss[0-3] are 192.168.1.99-103, respectively.

Lustre Networking (LNET)

Right now IB is the only network for the LNET. JK created a new file, lustre.conf, in /etc/modprobe.d, and added the following line.

options lnet networks=o2ib(ib0)

MGS/MDS

JK built MGS/MDT in a server with the following command. The /dev/sdb device has a RAID 1 volume, which is composed of two 1.8TB SSD disks.

mkfs.lustre –fsname=scratch –mgs –mdt –index=0 /dev/sdb mkdir /mnt/scratch-mdt0 mount -t lustre /dev/sdb /mnt/scratch-mdt0

And JK included the following line in /etc/fstab.

LABEL=scratch-MDT0000 /mnt/scratch-mdt0 lustre defaults,_netdev 0 0

OSS0

On OSS0 node, there are six RAID 6 volumes from /dev/sdc to /dev/sdh. The folloing commands create six OSTs.

mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=0 /dev/sdc mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=1 /dev/sdd mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=2 /dev/sde mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=3 /dev/sdf mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=4 /dev/sdg mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=5 /dev/sdh

JK made six mount-points, named from /mnt/scratch-ost00 to /mnt/scratch-ost05.

mkdir /mnt/scratch-ost00 mkdir /mnt/scratch-ost01 mkdir /mnt/scratch-ost02 mkdir /mnt/scratch-ost03 mkdir /mnt/scratch-ost04 mkdir /mnt/scratch-ost05

He also added the following lines in /etc/fstab,

LABEL=scratch-OST0000 /mnt/scratch-ost00 lustre defaults,_netdev 0 0 LABEL=scratch-OST0001 /mnt/scratch-ost01 lustre defaults,_netdev 0 0 LABEL=scratch-OST0002 /mnt/scratch-ost02 lustre defaults,_netdev 0 0 LABEL=scratch-OST0003 /mnt/scratch-ost03 lustre defaults,_netdev 0 0 LABEL=scratch-OST0004 /mnt/scratch-ost04 lustre defaults,_netdev 0 0 LABEL=scratch-OST0005 /mnt/scratch-ost05 lustre defaults,_netdev 0 0

And then mount the six lustre file systems using the following command.

mount -a -t lustre

OSS1

On OSS1, JK also created six OSTs using the folloing commands,

mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=6 /dev/sdc mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=7 /dev/sdd mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=8 /dev/sde mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=9 /dev/sdf mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=10 /dev/sdg mkfs.lustre –fsname=scratch –mgsnode=192.168.1.99@o2ib –ost –index=11 /dev/sdh

JK made six mount-points.

mkdir /mnt/scratch-ost06 mkdir /mnt/scratch-ost07 mkdir /mnt/scratch-ost08 mkdir /mnt/scratch-ost09 mkdir /mnt/scratch-ost0a mkdir /mnt/scratch-ost0b

Then he added the following lines in /etc/fstab.

LABEL=scratch-OST0006 /mnt/scratch-ost06 lustre defaults,_netdev 0 0 LABEL=scratch-OST0007 /mnt/scratch-ost07 lustre defaults,_netdev 0 0 LABEL=scratch-OST0008 /mnt/scratch-ost08 lustre defaults,_netdev 0 0 LABEL=scratch-OST0009 /mnt/scratch-ost09 lustre defaults,_netdev 0 0 LABEL=scratch-OST000a /mnt/scratch-ost0a lustre defaults,_netdev 0 0 LABEL=scratch-OST000b /mnt/scratch-ost0b lustre defaults,_netdev 0 0

And then mount the six lustre file systems using the following command.

mount -a -t lustre

On both OSS2, JK also created 6 OSTs using the same commands. The total number of OSTs is 18.

Clients

After the installation of the Lustre client software, make a mount-point and then mount using the folloing commands.

mkdir /scratch mount -t luster -o defaults,_netdev 192.168.1.99@o2ib:/scratch /scratch

JK also added the following line in each client to mount /scratch file system, when a client boots up.

192.168.1.99@o2ib:/scratch /scratch lustre defaults,_netdev 0 0

The output of the folloing command,lfs df -h, is

# lfs df -h UUID bytes Used Available Use% Mounted on scratch-MDT0000_UUID 1.1T 4.2G 1.0T 0% /scratch[MDT:0] scratch-OST0000_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:0] scratch-OST0001_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:1] scratch-OST0002_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:2] scratch-OST0003_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:3] scratch-OST0004_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:4] scratch-OST0005_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:5] scratch-OST0006_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:6] scratch-OST0007_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:7] scratch-OST0008_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:8] scratch-OST0009_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:9] scratch-OST000a_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:10] scratch-OST000b_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:11] scratch-OST000c_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:12] scratch-OST000d_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:13] scratch-OST000e_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:14] scratch-OST000f_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:15] scratch-OST0010_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:16] scratch-OST0011_UUID 21.8T 6.6T 14.1T 32% /scratch[OST:17]

filesystem summary: 392.8T 118.2T 254.5T 32% /scratch


Performance

On OSS2, the performance of OSTs was tested using the following command,

# nobjlo=2 thrhi=16 size=24576 case=disk sh /usr/bin/obdfilter-survey

where size is in units of MB. The number 24576 corresponds to 24000*1024KB. The output of the command is as follows

Sun Mar 8 22:01:19 KST 2015 Obdfilter-survey for case=disk from oss2 ost 6 sz 150994944K rsz 1024K obj 12 thr 12 write 1642.82 [ 86.98, 486.95] rewrite 1638.34 [ 181.99, 330.98] read 1559.87 [ 174.99, 581.98] ost 6 sz 150994944K rsz 1024K obj 12 thr 24 write 1639.56 [ 101.00, 450.98] rewrite 1636.98 [ 208.99, 329.99] read 1836.15 [ 193.00,2808.95] ost 6 sz 150994944K rsz 1024K obj 12 thr 48 write 1638.58 [ 116.99, 450.99] rewrite 1633.04 [ 184.00, 369.99] read 2006.77 [ 51.00, 547.99] ost 6 sz 150994944K rsz 1024K obj 12 thr 96 write 1636.80 [ 99.00, 436.97] rewrite 1596.56 [ 172.00, 755.98] read 1879.86 [ 5.00, 877.98] ost 6 sz 150994944K rsz 1024K obj 24 thr 24 write 1639.44 [ 79.99, 478.99] rewrite 1633.90 [ 211.00, 336.99] read 1592.10 [ 68.00,1497.95] ost 6 sz 150994944K rsz 1024K obj 24 thr 48 write 1635.25 [ 80.00, 473.98] rewrite 1626.25 [ 198.00, 648.99] read 1943.95 [ 42.00, 497.95] ost 6 sz 150994944K rsz 1024K obj 24 thr 96 write 1626.13 [ 139.99, 519.99] rewrite 1571.59 [ 153.99,1027.98] read 1785.66 [ 26.00, 533.99] ost 6 sz 150994944K rsz 1024K obj 48 thr 48 write 1629.72 [ 178.00, 479.98] rewrite 1623.61 [ 206.98, 350.99] read 1820.49 [ 97.00, 397.99] ost 6 sz 150994944K rsz 1024K obj 48 thr 96 write 1599.62 [ 149.00, 686.98] rewrite 1584.76 [ 170.00,1035.97] read 1905.63 [ 31.00, 449.99] ost 6 sz 150994944K rsz 1024K obj 96 thr 96 write 1608.10 [ 132.00, 355.99] rewrite 1603.28 [ 215.97, 469.99] read 1938.97 [ 58.00, 414.99]

The average write performance of 6 OSTs in OSS2 is about 1.6GB/sec, which is not much dependent upon the number of objects and threads. The average read performance is from 1.6~2.0 GB/sec