NFS problems under investigation

Synopsis
Begin	2025-09-12 08:25:00
End	2025-12-04 12:00:00
Affected	clusternodes, lilo’s using /vol and /home

When we moved the gateways of our networks from the old location to the new firewalls, we have received some complaints about NFS filesystems having slowness, longer delays or unavailability.

In general NFS should never be a requirement for a clusternode job if you can avoid it, because this I/O is always much slower than local I/O from /scratch.

We are investigating how we can optimise the network to resolve this issue, but we are hard pressed to know the exact cause of the problem. We are assuming the root cause might be that, now more traffic is passing through our firewalls, this degrades the performance in some way.

Note that NFS shares always have performance variability, due to the shared nature of the network. We currently have no QoS in place for NFS/Samba (meaning that any one user can use all the bandwidth and, unintentionally, make life worse for all other users).

Update 2025-10-27

The memory in the home servers has been doubled. See cpk 1411.

Update 2025-11-05

We are rolling out a change to NFS mount shares from the home6 server with soft instead of the current hard, see nfs(5). If successful we will be enabling this for all NFS shares.

Update 2025-11-24

We haven’t found any bottlenecks in our testing and user feedback also suggests things are better. We are waiting for the monthly reboot of the home server and then decide if we will roll out the soft mount for all NFS mounts.

Update 2025-11-25

One of the things that caught our eye, are heavy writes from SLURM clusters that, in this case, made home6 less responsive. In general intermediate results should be written to /scratch and not to a home directory. We have stopped these heavy writing jobs.

Update 2025-12-04

The soft mount option has not lead to an increase in issues, so we will enable this for all NFS mounts. We are also more aggressively looking at /home (ab)use from cluster nodes - as we see (on average) 40 MB/s writes from the clusters. We’ve updated the SLURM howto and documented that /scratch should be used instead.

Update 2025-10-27#

Update 2025-11-05#

Update 2025-11-24#

Update 2025-11-25#

Update 2025-12-04#

Update 2025-10-27

Update 2025-11-05

Update 2025-11-24

Update 2025-11-25

Update 2025-12-04