1288: Ceph storage expansion caused performance issues
As a result of the expansion of the Ceph storage cluster, the cluster had performance and availability issues. The problems were resolved this morning.
CPK messages are initially sent to the CPK mailing list, you can (un)subscribe via this link. You can also follow the service interruption messages via RSS using the link in the title under the RSS icon. If the CPK takes more time to resolve, any updates are published on this website.
For RU wide service interruption see meldingen.ru.nl.
As a result of the expansion of the Ceph storage cluster, the cluster had performance and availability issues. The problems were resolved this morning.
Two modules of an important switch in the main C&CZ server room lost power during the preparation of planned maintenance. This disconnected ca. 75% of the servers in the room from the network. Moving the modules to new PDU’s kimited the downtime to ca. 15 minutes.
An error in the management software prevented all license processes from starting correctly at the reboot of the license server. After fixing this error, all licenses were available again.
Course software that had been tested caused an overload of the fileserver when it was used by 100 students. The performance of the fileserver was impaired for all users of network shares of this server.
A broken PDU has offlined a switch, which has caused the VPN server to be unreachable (and several other things, which don’t affect users).
Due to an emergency maintenance, the central microsoft exchange server is unavailable for 4 hours. This may also affect systems that are dependent on exchange. E-mail and calendar functionality is expected to be restored when the maintenance is done around 13:30 Today.
During a routine upgrade of ceph, a bug in the latest version manifested itself and made the ceph manager unreachable. After aborting the upgrade and with help from the ceph-users mailinglist, everything became available again using a workaround.
Because of security issues the last remaining Windows 7 machines wil be disabled, effective 24-03-2021, as member of the Active Directory Domain B-FAC. Please upgrade these computers to a more up-to-date OS.
To change the network of lilo7, we need to reboot this loginserver. If you want a stable connection to a loginserver during this downtime, please use lilo6 or the soon to be taken down lilo5. For more info see the page on C&CZ loginservers.
Yesterday the SSD bootdisk of this VM host reported the first problems. This morning this had the effect of stopping all VMs running on this host. By moving the VMs to a different VM host, the problem has been solved. We will investigate how to best prevent this problem in the future or lessen its impact.