Synopsis
Begin 2024-01-19 09:22:00
End 2024-04-15 11:00:00
Affected CephFS storage, ftp, RDR

Ceph filesystem failure.

Services required for the CephFS filesytem cannot start. Failure of these services wil cause failure to access files on Ceph.

We are in contact with our support party 42on to remedy the problem. Duration is currently unknown.

Update 2024-01-22

We’re working with 42on on the issue and have a meeting scheduled for 17:00 today.

Update 2024-01-23

An initial dentry_recover was successful and according to 42on the Ceph journals are OK. We are hopeful that we can fix the issue with Ceph in the coming days.

Update 2024-01-24

We have made some progress yesterday, but CephFS isn’t up and running yet. Today another meeting is scheduled with 42on.

We need to do another dentry_recover - this is a long running process, 4+ hours. After that we need another meeting with 42on to consider the next steps. Next meeting is scheduled for 15:00 tomorrow.

Update 2024-01-25

CephFS is back online.

We need a run some maintenance processes, this will happen during the weekend.

Update 2024-02-05

CephFS still not healthy, first it was slow, then an attempt to improve the speed made the problem worse.

Possibly CephFS will be unavailable again for part of this week in order to fix it again.

Update 2024-02-06

Part of the filesystem is still accessible, notably the shares using 3copy, ec54 and to a lesser extent ec83. Unfortunately the biggest user of all under ec83/rimlsfnwi seems to be completely unusable at the moment. We are working on a solution with our support partner, meanwhile we are trying to find intermediate solutions for as many users as we can. We expect CephFS to be offline again for an extended period as part of the repair process.

Update 2024-02-15

We are working to copy as much data as we can from CephFS to other storage. We expect that we will down CephFS (early?) next week and start the recovery.

Update 2024-02-19

It seems the problem has become worse, now everything on CephFS is unusable, so we are forced to abort efforts to save data and we have started with the repair operation.

Update 2024-02-20

The recovery process is running… Earlier this CPK didn’t mention the affected services, they’re added now in the metadata; cpk_affected: CephFS storage, ftp, install, RDR

Update 2024-02-22

The install share is being restored in another place, the repair operation is still running.

Update 2024-02-26

The repair operation continues with next steps, we think this may take a very long time (weeks?).

Update 2024-02-28

We are exploring alternatives strategies for recovering the CephFS data. Tomorrow (29th) we expect guidance from 42on on how to proceed.

Update 2024-03-05

The scan is still running, we expect this to take another two weeks before we can continue with the next step, which should also take about 3 weeks, then another 2 steps follow which should be faster. At this time we expect access to the CephFS within 2 months (lots of uncertainty with this estimate!).

There’s an alternative, high risk, strategy, where we skip the checks and try to mount the filesystem. We are reluctant to risk damaging the metadata if we try this.

Update 2024-03-07

The scan_extents step finished earlier than expected and we are now running the next step (scan_inodes)

Update 2024-03-08

The scan_inodes step finished already, so we have started the next step; scan_links. This step will probably take a while, because it cannot run in parallel.

Update 2024-03-15

After the scan_links stopped on a problematic object without a parent, we had to wait for advice from our support party. Something seems to be very wrong in our metadata and the possibility of starting with the recovery has to be delayed further.

Update 2024-03-19

Somewhat good news; we are now copying the RDR data out of CephFS. We have been able to mount CephFS again with help from 42on (our support party). Copying to temporary storage is going OK, but we can run into issues at any time (or not). We’re hopeful that this process will let us recover the data stored in CephFs and then we can look for future solutions.

NB: contact postmaster if you have urgent requests for small sets of files in a particular location, so we may restore this with priority. A Petabyte of data takes weeks/months to copy, but a small amount (< 1TB) can be retrieved relatively fast.

Update 2024-03-22

We are still copying the data. To clarify the situation; we are planning to copy all the data out of CephFS in order to reset CephFS completely. We are still considering options for after this reset is done.

Update 2024-03-26

We have already copied well over 50% of the data out of cephfs onto temporary storage, we are expecting new storage and more space in permanent locations for the cephfs data. We have been lucky so far in that we haven’t run into issues while copying the data.