CPK messages are initially sent to the CPK mailing list, you can (un)subscribe via this link. You can also follow the service interruption messages via RSS using the link in the title under the RSS icon. If the CPK takes more time to resolve, any updates are published on this website.

 

Service Interruptions


1365: restore old ceph shares to new locations

Even though we closed CPK #1359, the user’s shares are not all restored. We have recovered the data from cephfs to temporary storage that is not accessible to users, it will take a bit more time to find new permanent locations for this data. The storage will not be cluster based anymore, but single server ZFS with snapshots.

Resolved Reports


1337: Cephfs offline

After the power down of the Huygens building we are experiencing a problem with bringing Ceph file system back online. We currently do not know when the Ceph cluster is operational again. Update 2023-08-01 10:30 Ceph is working again. This CPK is now closed. CPK#1338 is also closed. Update 2023-07-31 12:30 After some more support from 42on, we managed to restart the cephfs, we cannot be sure all files are there, but almost all files are....

Updated Aug 1, 2023  ·  Miek Gieben · Created Jul 22, 2023

1336: VPN service downtime

The VPNsec service will be moved to a new server. This move will cause downtime and existing VPN connections will be destroyed. Downtime is expected not to exceed several minutes.

Updated Jul 6, 2023  ·  Wim Janssen · Created Jul 4, 2023

1335: Mailman disruption

Last friday, a change in the mailman configuration has been rolled out which had the inadvertent effect that mails were not delivered to external addresses anymore. However, these mailman posts were sent successfully to internal Science mail addresses. The change has been rolled back for the moment but is a necessity meaning that we’re looking for another solution.

Updated Sep 28, 2023  ·  Miek Gieben · Created Jul 3, 2023

1334: router change for most Science services (dr-huyg)

The connecting router (dr-huyg) for all servers in the subnets 131.174.30.0/24, 131.174.31.0/24 and 131.174.16.128/26 will be replaced. It is expected that this will cause an interruption of ca. 10 minutes in the connectivity, but unforeseen circumstances may increase this delay. The reason to do this now is because of the planned power interruption on July 22. The old router hardware has a high probability of failing to survive this.

1333: Science IT services down July 21 and 22 - Huygens building power outage

Friday July 21 from 17:00, we will start shutting down compute clusternodes, in order to prepare for the power outage of the Huygens building Saturday July 22. Other servers will be shut down later. The most important servers (mail, home, file, Ceph, gitlab, loginservers) will be shutdown starting Saturday morning 7:00. We will try to keep basic services (DNS/DHCP, SMTP(mail) and license servers) up during this power outage. RU services are not serviced from the Huygens building, so will not be affected....

1332: Certificate of authentication server expired

Due to the expiration of an LDAP certificate, it is temporarily not possible to log in to various services. A new certificate is being installed urgently. Affected services include Eduroam in combination with Science logins, VPN, GitLab and Mattermost.

1331: Downtime Felixdisk and bioboost

Due to a failure in a power distribition unit (pdu) the servers felixdisk and bioboost went down. Both servers have been connected to another pdu and are now working again.

1330: networking problems due to routing change

The planned routing change, which should not have caused issues for more than a few seconds, didn’t work as planned and caused problems for up to 15 minutes. Update 2023-06-12 - 22:00 The situation has become worse, some problems: DNS resolving, some fileservers and jupyterhub are having problems due to the network change. We will attempt to resolve the issue asap. Update 2023-06-13 - 11:30 After correcting errors (fixed IP addresses) all services are up again....

1329: DDOS on Science mailservers

Our smtp mailservers were under attack. In order to prevemt other problems, our configuration limits the number of connections that can be kept open at the same time. We cannot easily distinguish between connections by the attacker(s) and by regular users. When this limit is reached, no new connections can be made. Therefore sending e-mail using our mailservers can take a long time or will not work at all. There’s a good chance that your IP address will be blocked (max....

1328: Climate control failure in Huygens Datacenter

The datacenter cooling failed around 07:00 this morning. To prevent damage all non-essential systems are being turned off (clusternodes first). Most fileservers have also been turned off. Due to the urgency, some systems that have been turned off may not be in the location with the problem (Huygens HG04.070). Around 07:50 the cooling system came online again, about 30 minutes later the temperature dropped to under 25 degrees Celsius. After ca....