The leader election and failure detection mechanisms are fairly mature, and typically just work… until they don't. How can this be? Well, after a lengthy investigation, we managed to uncover four different bugs coming together to conspire against us, resulting in random cluster-wide lockups. Two of those bugs lay in ZooKeeper, and the other two were lurking in the Linux kernel. This is our story.
I love reading bug investigation write-ups like this.