ceph ocalypse


This story began with upgraded Ceph from Hammer to Infernalis, thinking it was the next stable release.

The upgrade was a little rough, there were some major changes to ceph, a big portion of it was that is switched file permissions from root to ceph, an unpriviledged account.

With some downtime, we ran a recursive chown across all OSD volumes (we spun un a tmux session for each drive, 34 drives in each server)

This ran overnight. In the morning, the cluster was still down. After a long morning of reading mailing list posts, we found this entry, http://www.spinics.net/lists/ceph-users/msg24220.html

Following the example, we set all OSD’s as down, then restarted all OSD’s.

We were back up!

After a day or so in production, we noticed some odd directory listing behavior. Using the built in cephfs kernel module, all files showed up. However, since we share ceph through nfs and samba, our windows and *nix clients using these protocols were showing an inconsistent directory listing.

I decided to compose an email to the ceph-users mailing list. One of the dev’s got back and recommended I set “msd bal frag” to true.

This was the begining of a series of failures. That particular setting required an unsupported feature in ceph, which is an active/active MDS environment. Until now, we have only done active/standby.

Once we enabled this, the cephfs mount became more unstable.

So, we tried to correct the issue. We found someone’s blog entry on how to “fix” this situation, but it turned out to be very bad advice for our situation. We ran “ceph mds rmfailed 1”, which we thought removed one of the active MDS servers.

It did, but it was a very destructive command, and it took the entire cluster out. Well, it was more akin to removing the partition data from a drive, but all your data is still there. Sure, the bits exists, but its inaccessible and basically useless.

Zheng, a brilliant developer who works for Red Hat, responded to our situation on the mailing list. Gave use the correct steps to remove a MDS and put it back in active/standby, but, we had already made the critical mistake of running the rmfailed command. There is no coming back from that.

He quickly wrote a patch against the git tag v9.2.0, which is what we were on.

I cloned the git repo, checked out that tag, and applied the patch.

The next part was fun, we had to recompile all of ceph.

I’m used to the FreeBSD ports tree, which makes compiling software easier, as the Ports framework really does a lot of the heavy lifting. I’m no stranger to compiling by hand though, so we read the README, ran the helpful install-deps.sh script, and began the hour long make process.

After the make completed, we then built the debian packages. This was around 8pm mind you, and Steve and I had been on edge all day planning for the worst: total data loss.

It turns out, the dpkg build tool cleans up any previously built code. It took another 1.5 hours to re-make everything that we just made, and then tar up packages.

Around 9:30pm, we had our packages built.

First thing was first, we made a backup of /etc/ceph and /var/lib/ceph:

mkdir backup
rsync -av /etc/ceph backup/etc_ceph
rsync -av /var/lib/ceph backup/var_lib_ceph

Next, we uninstalled the current ceph packages:

dpkg -r ceph ceph-mds ceph-common radosgw 

The instructions were to “install the new monitor package on all of the nodes”. Since there was no individual monitor packages, it is in ceph-common, and the dependency chain was large enough we had to install pretty much every package we built:

dpkg -i ceph_9.2.0-1_amd64.deb  ceph-mds_9.2.0-1_amd64.deb ceph-common_9.2.0-1_amd64.deb radosgw_9.2.0-1_amd64.deb librbd1_9.2.0-1_amd64.deb librados2_9.2.0-1_amd64.deb libradosstriper1_9.2.0-1_amd64.deb python-cephfs_9.2.0-1_amd64.deb python-rados_9.2.0-1_amd64.deb python-rbd_9.2.0-1_amd64.deb

Finally, when everything was installed, we started the entire cluster back up.

We set the max mds back to 2, and ran our new command:

# ceph mds addfailed 1

No message, it just returned to the prompt.

We watched the MDS logs on both MDS’ servers, and it was like starting a car back up after you rebuilt the engine. I think I heard the server groan and sputter. After a few minutes where Steve and I just watched paralyzed, we saw that both MDS’s replayed, and became active.

Finally we saw:

root@lts-osd1:~# ceph -s
    cluster cabd1728-2eca-4e18-a581-b4885364e5a4
     health HEALTH_OK
     monmap e2: 3 mons at {lts-mon=,lts-osd1=,lts-osd2=}
            election epoch 1462, quorum 0,1,2 lts-osd1,lts-osd2,lts-mon
     mdsmap e7962: 2/2/2 up {0=lts-osd1=up:active,1=lts-mon=up:active}
     osdmap e10299: 102 osds: 101 up, 101 in
      pgmap v6716380: 4192 pgs, 7 pools, 31748 GB data, 23494 kobjects
            96180 GB used, 273 TB / 367 TB avail
                4191 active+clean
                   1 active+clean+scrubbing+deep
  client io 1536 kB/s rd, 17 op/s

We were elated! We could not believe we had our cluster back.

The cephfs filesystem was mounted, and we immediately started to rsync the critical data back to a ZFS server. We left around 10pm.

The following morning, I came it to see the rsync had stalled with an I/O error, just as we left no less.

Well, the active/active MDS scenario was still a bad idea, so I followed the 3 simple steps to take the entire cluster back to where we were the day before:

  1. make sure all active mds are running
  2. run ‘ceph mds set max_mds 1’
  3. run ‘ceph mds stop 1’

So, running those, we got back to the correct operating environment:

    cluster cabd1728-2eca-4e18-a581-b4885364e5a4
     health HEALTH_OK
     monmap e2: 3 mons at {lts-mon=,lts-osd1=,lts-osd2=}
            election epoch 1462, quorum 0,1,2 lts-osd1,lts-osd2,lts-mon
     mdsmap e7999: 1/1/1 up {0=lts-osd1=up:active}, 1 up:standby
     osdmap e10304: 102 osds: 101 up, 101 in
      pgmap v6724626: 4192 pgs, 7 pools, 31748 GB data, 23494 kobjects
            96179 GB used, 273 TB / 367 TB avail
                4191 active+clean
                   1 active+clean+scrubbing+deep
  client io 16362 kB/s rd, 22 op/s

The Zheng had said it would probably not work the first time, and if so, just restart the MDS’ processes and re-run step 3. It worked exactly as he said, and after restarting the MDS’ processes and re-running the last command, it all came back. The cluster seems okay again, and the rsync has been running strong.

This is now where we are leaving it. Steve and I have mapped out our next steps.

We will take one of our 3 nodes, which has 34 4tb drives, and build a temporary ZFS pool.

Then, over the 10GB network, we are going to rsync all of the data we currently have, which is around 34TB.

Once that is done, we will rebuild the remaining two nodes and admin node as a brand new Ceph cluster, running the Hammer release.