Blog site

What did a broken wheel teach Google site reliability engineers?

Google Cloud reliability advocate Steve McGhee once shared a key truth that the company’s Site Reliability Engineering (SRE) teams have learned.

“As Google scales, odds of a million to one happen all the time.”

Sooner or later, this perfect storm of bizarre conditions triggers “complex and emergent failure modes not seen elsewhere,” McGhee wrote in a Google Cloud blog post. “So the SREs within Google have become adept at developing systems to track failures deep into the many layers of our infrastructure….” And that sometimes takes them to surprising places.

But it is also part of the larger phenomenon of treating problems as learning opportunities – to study, analyze and
ultimately shared with a global community striving to be better. It’s a favorite pastime for geeks, making learning as easy as sharing stories of the trickiest puzzles ever solved, plus a dollop of advice, and some fun lore and legends from days gone by. .

And along the way, you will hear some very fascinating stories.

The bottom of the stack

Google engineers had encountered an issue on servers for frequently accessed content cached on Google’s low-latency edge network. After swapping out a new server, they discovered that the old one was having routing issues, with kernel messages warning of CPU throttling. Google’s hardware team finally pinpointed the source of the problem – and it was surprisingly low-tech.

“The rear wheel casters have failed,” they wrote, “and the machines are overheating due to tilting.” Tilting affected coolant flow, meaning broken wheels on the rack ultimately had consequences “causing some processors to heat up to the point of being choked”.

The title of the blog post? “Find a problem at the bottom of the Google pile.” Developer Dilip Kumar later joked on Hacker News that “I can’t think of a better way to use the phrase ‘bottom of the Google stack’.”

But like any good story, there is a lesson to be learned. “Another phrase we commonly use here on SRE teams is ‘All incidents must be new,'” McGhee wrote – meaning “they must never happen more than once”. In this case, the SREs and hardware operations teams worked together to ensure that this class of failure would never happen again.

The hardware team then suggested future solutions, including wheel repair kits and better procedures for installing and moving the rack (to avoid damage). And more importantly, they knew what to look for in other racks that might be on the verge of having their own wheel issues, which “resulted in a systematic replacement of all racks with the same issue”, wrote McGhee, “while avoiding any impact on the customer. .

But in a broader sense, it shows how problems can also be “teachable moments” – and those lessons can run surprisingly deep. McGhee recently co-wrote Business roadmap to SRE with Google cloud solutions architect James Brookbank (published earlier this year by O’Reilly).

And at one point, the 62-page report claims that SRE is now happening specifically because “the complexity of Internet-based services has clearly increased recently, and most notable is the rise of cloud computing,” a world of “architectural choices that expect failurebecause only a subset of components needs to be available for the system to work.

It requires a new way of thinking.

The lessons continue

McGhee’s story also attracted more stories about “new” issues when the post first appeared in a discussion on Hacker News. One commenter recalled that a colocation setup “replaced some missing blanking panels in a rack and the top rack switches were recycling hot air…

“The system temperatures of both switches were north of 100°C and to their credit (Dell/Force10 s4820Ts) they performed perfectly and did not degrade any traffic, sending the appropriate notifications to the email of Something as benign as this can destroy an entire infrastructure if left unchecked.

They went on to say that they had heard stories of even worse infrastructure problems. “A data center manager told the story of a rack falling through the raised floor…” (They added that even after the disaster, “it continued to operate until a technician noticed it on a guided tour.”) And that led to another comment that was more philosophical. “One of the byproducts of the public cloud era is the loss of the need to consider the physical aspect of things when considering operational art.”

“Don’t people visit their data centers often enough to notice a tilted rack? »

In a community that strives to always learn After, commenters soon wondered why Google seemed to monitor less the causes of problems and more “user-visible” problems. One user finally tracked down Google’s comprehensive set of tools for monitoring distributed systems, including dashboards, alerts, and a handy glossary of commonly used terms. Effectively, it included both “black box monitoring” (defined as “testing externally visible behavior as a user would see it”) and “white box monitoring” – that is- i.e. “metrics exposed by system internals, including logs…” And later the document explained that black box monitoring can alert you that “the system is not functioning properly, in this moment,” ultimately exploring how this fits into a core philosophy that every page should be actionable.

“it is better to devote much more effort to catching the symptoms than the causes; when it comes to causes, worry only about very specific, very imminent causes.

And Google Site Reliability Engineer Rob Ewaschuk once shared his own philosophy, writing that cause-based alerts “are bad (but sometimes necessary),” while saying symptom-based notifications help SREs to focus on resolving “alerts that matter”.

The insects of legends

This is perhaps proof that critical case discussions lead in surprisingly productive directions.

System administrator Evan Anderson once fixed an intermittent Wi-Fi connection that only went out during “one particular recurring lunchtime meeting,” he recalled on Hacker News – a meeting “just scheduled as a steady stream of workers went to the break room across the hall and reheated food in a microwave…”

And then there’s a legendary ticket complaining that “OpenOffice doesn’t print on Tuesdays”. This was because the utility detected file types on the VM using the headers in the files, which often contained the day of the week and misidentified the format of a PostScript file, but Tuesday only.)

Software architect Andreas Zwinkau has collected dozens of similar stories on his “software folklore” site – and IT professionals are always ready to join the conversation with more stories of their own.

But it more than just proves that you’ll never know where an investigation will lead. In a data-rich business — with all its alerts, notifications, pages, and dashboards — there is still little that can be automated. So there will always be a role for the very human faculties of inquiry and intuition, for the skill associated with curiosity.

And then for a very important celebratory narration afterwards to share what you have learned.


WebMinimize

Band Created with Sketch.