A few months ago, a routine deployment in a core service of M3, our open source metrics and monitoring platform, caused a doubling in overall latency for collecting and persisting metrics to storage [...]. Mitigating the issue was simple - we just reverted to the last known good build, but we still needed to figure out the root cause so we could fix it.
And it's a doozy. Great post on some in-depth debugging.