Ceph storage interruption

Synopsis
Begin	2022-10-23 23:06:45
End	2022-10-23 23:45:00
Affected	cephstorage pages.science.ru.nl ftp.science.ru.nl astro

Just before 23:09 quite a lot of ceph storage nodes became unreachable. This seems to be due to one of the redundant links between two datacenter locations failing for about 4 seconds. This triggered a whole slew of ceph osd processes being killed off and not starting again. A generic configuration change made for all our servers generated an extra interface, which confused some of the osd processes (depending on interface ordering) when starting up. We are reasonably confident we can avoid this from happening in the future.

Ceph storage automatically disables writing and even reading when not enough storage units are available, the data is still safe.

Some of our websites depend on the Ceph storage, like sites on gitlab pages and our ftp.science.ru.nl. The storage unavailability also led to a high load on the webserver, so other sites may have been affected as well.