CPK messages are initially sent to the CPK mailing list, you can (un)subscribe via this link. You can also follow the service interruption messages via RSS using the link in the title under the RSS icon. If the CPK takes more time to resolve, any updates are published on this website.

For RU wide service interruption see meldingen.ru.nl.

 

Service Interruptions


1407: NFS problems under investigation

When we moved the gateways of our networks from the old location to the new firewalls, we have received some complaints about NFS filesystems having slowness, longer delays or unavailability. In general NFS should never be a requirement for a clusternode job if you can avoid it, because this I/O is always much slower than local I/O from /scratch. We are investigating how we can optimise the network to resolve this issue, but we are hard pressed to know the exact cause of the problem....

1415: Clusternode maintenance day - February 6th 2026

Every half year we do clusternode maintenance, with at least a package ugprade and a reboot, but sometimes other maintenance can happen, such as changes in filesystems or network configurations. The upcoming date for this maintenance is February 6th, 2026 (Friday)

Resolved Reports


1396: Sending science mail work in progress

We are working to upgrade the mail servers, due to caching in various parts, you may get messages about incorrect certificates or other problems, we expect this to resolve itself once we finish the updates. Update, we apparently still had problems with the certificates, but somehow missed it after last week’s fix. There’s a temporary fix in place, so we can work further on it Tomorrow.

1395: Sending science mail broken certificate

A change in the configuration management code has broken the certificate on the mailservers, we are trying to fix this properly, but until fixed, mail clients will complain about the server certificate. Fixed by restarting sendmail

1394: Clusternode maintenance day - August 5th

We picked August 5th to have a planned downtime/reboot of all clusternodes in the Science Cluster. This will involve at least package updates and a reboot, if other maintenance can be included we will try to fit this in as well.

1393: Servers and services unavailable due to physical moving on August 7th

We will be moving servers from the serverroom ak008 (Huygens A-2.008) to other locations. A lot of services and virtual machines will be offline for a few hours while we disconnect, move and reconnect the hardware. Assuming everything goes well, the services should be up after turning the machines back on again. This CPK will be updated with more servers and information about affected services in the weeks before the move on August 7th 2025....

1392: Network failure in part of network

There are servers unreachable due to an unknown problem with the 25Gbit network switch in room ak008. We don’t know the root cause yet. Update: rebooting the affected switch has resolved the problem. DOWN amanda22.science.ru.nl DOWN cephgrafana.science.ru.nl DOWN cephgw3.science.ru.nl DOWN cephgw4.science.ru.nl DOWN cephmon2.science.ru.nl DOWN cephosd07.science.ru.nl DOWN cephosd08.science.ru.nl DOWN cephosd09.science.ru.nl DOWN cephosd10.science.ru.nl DOWN cephosd11.science.ru.nl DOWN cephosd12.science.ru.nl DOWN cephosd13.science.ru.nl DOWN cephosd23.science.ru.nl DOWN cephosd26.science.ru.nl DOWN cephosd27.science.ru.nl DOWN chemotionvm.science.ru.nl DOWN containervm02.science.ru.nl DOWN dockervm01.science.ru.nl DOWN dockervm02....

1391: Planned network disruption on some server networks

Services on networks that are currently behind our old picos switches will be transferred to be behind our firewalls. This may take a few minutes of downtime due to the changes needed for moving the gateway functionality. If all goes well, it will be one period of a few minutes, if we need to roll back and fix things, there may be a repeat downtime. All went well, there was a few minutes of interruption on the cncz homepage due to ARP caching issues....

1390: Authentication downtime

Following the recent update of our LDAP servers certificates, multiple users have reported authentication failures when attempting to log in via RADIUS. The issue appears to affect only users with RADIUS-based authentication, while LDAP-based authentication continues to function successfully. Affected users typically include HFML technicians, guest console users, and Science logins with Wi-Fi access.

Updated Jun 6, 2025  ·  Erik Joost Visser · Created Jun 6, 2025 ·  Erik Visser

1389: Servers unreachable

In the transition from the older 25Gbit switches to newer switches, two blocks of 4 connections were temporarily unavailable due to a peculiarity of the switches where ports are operated in blocks of 4. The removal of an unused cable triggered a disabling of the other ports in the bunch, reseating one of these connections in one case and re-inserting the unused cable fixed the block of ports to start working again....

1388: /vol/astro2 and /vol/astro5 unavailable

Due to errors on the /vol/astro2 filesystem, we had to reboot the fileserver comas1 and take it offline to perform repairs. During this process, both /vol/astro2 and /vol/astro5 have been unavailable.

Updated May 12, 2025  ·  Erik Joost Visser · Created May 12, 2025 ·  Erik Visser

1387: /vol/astro6 and /vol/astro7 unavailable

Due to errors on the filesystem /vol/astro7, we had to reboot the fileserver comas2, and take in offline to run repairs on the filesystems. During those repairs /vol/astro7 and /vol/astro6 are not available. Last time this happend it took 24 hours to complete

Updated May 13, 2025  ·  Erik Joost Visser · Created May 12, 2025 ·  Eric Lieffers