NFS Cluster Woes

A far too venerable cluster (Scientific Linux release 6.2, 2.6.32 kernel, Opteron 6212 processors) with more than 800 user accounts makes use of NFS-v4 to access storage directories. It is a typical architecture, with a management and login node with a number of compute nodes. The directory /usr/local is on the management node and mounted across to the login and compute nodes. User and project directories are distributed two storage arrays appropriately named storage1 and storage2.


[root@edward-m ~]# cat /etc/fstab
/dev/mapper/vg_edwardm-lv_root / ext4 defaults 1 1
UUID=ae88f150-2b1f-4760-940a-242b6deea5a7 /boot ext4 defaults 1 2
/dev/mapper/vg_edwardm-lv_swap swap swap defaults 0 0
/dev/mapper/vg_edwardm-usrlocal /usr/local xfs defaults 0 0
/dev/mapper/vg_edwardm-install /install xfs defaults 0 0
tmpfs /dev/shm tmpfs defaults 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
sysfs /sys sysfs defaults 0 0
proc /proc proc defaults 0 0
stg1-s:/data/users /data/user1 nfs hard,noatime,tcp 0 0
stg1-s:/data/projects /data/project1 nfs hard,noatime,tcp,fg 0 0
stg2-s:/data/users /data/user2 nfs hard,noatime,tcp 0 0
stg2-s:/data/projects /data/project2 nfs hard,noatime,tcp 0 0

A user or users decided ran a some job with the results outputted to their home directory. This filled the partition to close to 100% on storage1 and 100% on storage2.


root@storage2:/data/users# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg-root 263G 14G 237G 6% /
tmpfs 16G 0 16G 0% /lib/init/rw
udev 16G 240K 16G 1% /dev
tmpfs 16G 0 16G 0% /dev/shm
/dev/sdg1 228M 19M 198M 9% /boot
/dev/mapper/data-users
6.0T 6.0T 80K 100% /data/users
/dev/mapper/data-projects
10T 9.7T 341G 97% /data/projects

There are the two directories on the storage nodes that should mount across the management and compute nodes.


root@storage1:~# cat /etc/exports
/data/users 192.168.0.0/24(rw,async,fsid=0,no_subtree_check,no_root_squash)
/data/projects 192.168.0.0/24(rw,async,fsid=1,no_subtree_check,no_root_squash)
root@storage2:~# cat /etc/exports
/data/users 192.168.0.0/24(rw,async,fsid=0,no_subtree_check,no_root_squash)
/data/projects 192.168.0.0/24(rw,async,fsid=1,no_subtree_check,no_root_squash)

As is the case when this happens, NFS went a bit mad, the symptoms in this case being that home and project directories were not accessible and running a df on Edward-m would cause it to hang. I could still logon to stg-1 and stg-2 and conducted a quick check to see that user data was still available, which it was. There are some directories on stg-2 however which a `cd` or `ls` would cause it to hang. I also checked the PBS queue which showed that whilst 70 jobs were running, PBS was unable to get a full list of nodes to provide an accurate view of utilisation.

I went through a process checking dmesg, syslog, and log/messages on edward-m, stg-1, and stg-2 on and restarting nfs-server and client on stg-1 and stg-2, and remounted the NFS mounts from edward-m. This brought stg-1 back up and allowed for NFS connections to the users and project directories on that system. A mild cheer at this point.

On the storage nodes

/etc/init.d/nfs-kernel-server restart;/etc/init.d/nfs-common restart

On the management node

mount -a

This is the usual process when such events happen; restart NFS on the compute nodes, move some data to somewhere else (or reboot if not possible), and `mount -a` on the management node. This did not work in this instance.

Notably `mount -a`, if it fails to mount a node will fail silently. It it probably preferable to individually mount points e.g.,


mount stg1-s:/data/users /data/user1
mount stg2-s:/data/users /data/user2

Running a `tcpdump` was illustrative of the problem.


root@storage2:~# tcpdump -n -i eth1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), capture size 65535 bytes
10:48:02.625435 IP 192.168.0.81.747688002 > 192.168.0.246.2049: 84 getattr fh Unknown/0100010003000000000000000000000000000000000000000000000000000000
10:48:02.625442 IP 192.168.0.246.2049 > 192.168.0.81.675: Flags [.], ack 3523948961, win 70, options [nop,nop,TS val 1991902835 ecr 3394660396], length 0

I checked the portmapping between management node and stg2 with `rcpinfo -p` and that seems fine.


[root@edward-m ~]# rpcinfo -p 192.168.0.246
program vers proto port service
100000 2 tcp 111 portmapper
100021 1 udp 58959 nlockmgr

etc, including nfs


root@storage2:~# rpcinfo -p 192.168.0.236
program vers proto port
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper

etc, including nfs

I went through the stg-1 data directories and found two users who were using an inordinate amount of storage space. One of these had not logged in since 2012, and I have confirmed that they are happy for the terabyte(!) of data they were using to be deleted or archived. That has significantly improved the state of /data/users on stg-1. Another mild cheer.

For stg-2 restarting the NFS servers had not helped. I wrote short script that parsed through the user directories and ran a disk usage summary on those directories. Whilst this tool quite a while to run ended up hanging, I did identify at least one user who was using 717G of storage in their home directory and hadn't touched the system in almost a year, and thus began the process of shifting this date to another location.


#!/bin/bash
cd /data/users/user1; /root/diskuse.sh ;
cd /data/users/user2; /root/diskuse.sh ;
..
cd /data/users/usern; /root/diskuse.sh ;


#!/bin/bash
DU=/root/diskuse$(date +%Y%m%d);
echo $(pwd) >> $DU
du -sh >> $DU
exit

At this stage, around half of users (stg-2) and projects at this stage could not access their data and jobs could not be submitted. Which is pretty scary under normal circumstances. I rebooted stg-2 to clear the cache (this is usually an act of last resort) and whilst it came back up it is still unable to make the necessary NFS connections. Or at least I thought I rebooted stg-2 ; because the i/o stack was so great at this point even actions like a shutdown command were failing and falling into an uninterruptible sleep, which was sort of comic in its own right.

With sufficient space now available on the storage array and the system rebooted NFS was still refusing to mount properly; the only option left was to forcibly unmount the partitions and remount them.


[root@edward-m ~]# umount -f /data/user2
umount2: Device or resource busy
umount.nfs: /data/user2: device is busy
umount2: Device or resource busy
umount.nfs: /data/user2: device is busy
[root@edward-m ~]# umount -f /data/project2
umount2: Device or resource busy
umount.nfs: /data/project2: device is busy


[root@edward ~]# umount -f /data/user2
[root@edward ~]# umount -f /data/project2

Then remount as above.

This brought the symlinks back online; users and projects could access their data again. Some nodes starting coming back as well to accept jobs. A much larger cheer!

But not everything was perfect


[user2@edward ~]$ qsub -l nodes=edward047,nodes=1:ppn=2,walltime=0:10:0 -I
qsub: waiting for job 2045154.edward-m to start
qsub: job 2045154.edward-m ready
PBS: chdir to '/home/user2' failed: Stale file handle

A filehandle becomes stale whenever the file or directory referenced by the handle is removed by another host. That was certainly the case here.

Thus the compute nodes needed forcible remounting as well. Boo!


[root@edward047 ~]# umount -f /data/user2
[root@edward047 ~]# umount -f /data/project2
[root@edward047 ~]# mount -a


[user2@edward ~]$ qsub -l nodes=edward047,nodes=1:ppn=2,walltime=0:10:0 -I
qsub: waiting for job 2045157.edward-m to start
qsub: job 2045157.edward-m ready
qsub: job 2045157.edward-m completed

The job completed a little too quickly; disconcertingly, `mdiag -n` and `pbsnodes -ln` give differet results. So a number of users, whilst they have access to their data again, still cannot submit jobs successfully.