Computer science has largely neglected to define a methodology past the release stage for delivering ongoing reliability in continuously available distributed systems (system = hardware + software + operators). A collection of thoughts on this aspect of reliability, and how we can draw analogies from other industries to fill this gap.
= Elements of reliability
We can break down the problem of offering reliability into the problems of providing resiliency (in the form of fault prevention/tolerance/removal) against:
(1) At the system level:
* Service or node outage (e.g failure detector suspects a node as unresponsive, cluster layer fences off a node.)
(2) At the service/node level:
* Hardware failure (e.g. storage device failed),
* Software error (e.g. design error, new bug fix breaks something, dormant errors, memory leak crashed a process)
* Input and load error (e.g. resource exhaustion, malformed inputs, unexpected/unhandled workload)
(3) Operations error (e.g. configuration updates, deployment and installation, exception and change management, specification/communication errors)
In theory, all manual interaction for keeping the system running is automated away. In practice this is true till a point. While (1) and (2) are hard, they are well researched, and solutions exist. The research and tools for (3) are somewhat lacking. As a result, (personal experience and anecdotal evidence suggest that) for complex continuously available systems, (3) tends to be the weakest link in the uptime chain.
= The operator/support function is integral to system design
There are three aspects to this:
1. Fundamentally, human beings are error prone, some less than others, but as long as there is any single part of the system dependent on an operator being error free, there will be uptime holes. Our goal as system builders should be to identify these human dependencies and add processes and tools to minimize the risk from them.
2. Conventionally, the job of responding to monitoring alerts is left to the operator, who might or might not have a diagnosis ready in time. Often if we analyse a crash after the fact, the data shows we could “see it coming”. We need to build systems that alert the operator and engineering team with detailed diagnostic information, before the root cause error becomes observable as system downtime.
3. No matter how well engineered a system, we know that guaranteeing 100% uptime is hard, things do go wrong. Next best is minimizing the downtime period and quickly restoring state. (a.k.a MTTR – mean time to repair). We need to have tools and procedures in place for the team to quickly diagnose and then remediate uptime failures, ideally all conceivable Byzantine failures as well.
= Analogies from other industries
Borrowing from other industries, here are some analogies we can draw on to fill this gap:
“Security is a process not a product” – Bruce Schneier
“Uptime is a process not a product” – me
The security industry has historically relied on periodic security audits to ascertain if any points of attack are vulnerable. Reliability demands a formal “uptime audit” where every time something in the architecture changes, a team brainstorms about what else is can fail as a result, how the uptime risk if affected, and if the failure detection scheme remains sound; and then testing these assumptions with the appropriate failover tests. Also uptime processes for responding to application and system level monitoring alerts, handling failovers and recovery, and preventing them in the future.
When delivering a horizontally scaled cloud service or vertically scaled service with large bursty loads, outages at the hardware or software level is a given. “You need to design for failures.”
Most aircraft have Pilots Operating Handbooks with instructions on what to do when something fails. Our continuously available systems need well documented operator manuals too. Anyone who has seen a live system go down or run amuck, knows that adrenaline runs high during this time, and the operator needs to juggle diagnosing the failure and communication with stakeholders and engineers all at the same time, making it easy to forget a small detail.
At a scaled out web company “platform developers are required to write down detailed procedures called run-books that contain step-by-step information about what needs to be done in the event of a failed node. Every new release mandates an updated run book.”
A 2009 study by researchers from the Harvard School of Public Health found that major complications in operating rooms reduced by a third when hospitals introduced a single page checklist for use by the doctors and nursing staff. 
We can create similar checklists for handling failure and Byzantine errors, and possibly enforce these checklists using workflow software. This is like documenting code. Just as developers inherit code, operations people inherit operations. People leave, all the knowledge goes away with them.
The recent Mars mission used a form of run time verification that uses temporal logic to run assertions on structured log data, primarily for automated testing .   . We can do the same, but with online runtime information, since our systems have probes monitoring all sorts of run time information for us. In some ways, the evolution can be seen as keepalives -> granular monitoring -> health checks -> assertions on trace data. It can help determine runtime faults before they become observable otherwise. (edit: I started putting these notes together before Splunk, which offers run time log analytics.)
“Splunk indexes data across all tiers of the IT environment and helps users rapidly perform root cause analysis to determine the source of the problem. Ongame, a leading online gaming company has reduced their downtime by over 30% by using Splunk to gain visibility across 3 production environments”
“The Workplace Complacency Trend in accident prevention is the theory that there is occasionally a level of complacency present in the workplace prior to the occurrence of a major accident. Then, during a span of time following an accident, complacency will eventually return accident prevention efforts to pre-accident levels.” 
Preventing complacency is in large part a people and culture issue not a technical one, and therefore less interesting to most of us, but someone’s got to do it. That someone will probably not be found on the engineering team.
Manufacturing and supply chain:
“No one argues against continuous improvement. The concept of improving results and performance on a continual basis is universally hailed as a great idea. Doing it is another matter.” – a Xerox Lean Six Sigma paper
If one looks at manufacturing defects (say from operating a lathe) as operational errors, we have 4 decades more of research to borrow from, culminating in “lean six sigma”, a rigorous, data-driven, results-oriented approach to process improvement. Several technology organizations already use “root cause analyses” meetings after an event, to generate long term solutions over a blame-free conversation.
“At the minimum this means periodically looking at failures that are the most common and trying to automate those recovery processes.”
From our own industry:
Most technology organizations have a rigorous code review process. We need to extend best practices such as peer reviews and testing to configuration files, scripts, and deployments as well.
Second, if we are monitoring well, we can create non-linear models to predict when certain variables will breach a limit based on other measurable/controllable variables  and give system operators a better lead time to react.
= SA Forum and ITIL
The closest to a methodology for uptime comes from
1. The SA Forum: Offers a system developer perspective. (OpenAIS, the predecessor to Corosync, was originally an implementation of a SA Forum Service Availability interface.)
2. The ITIL framework : Offers an organizational IT operations perspective. (Service design processes for capacity management and availability management; and service delivery processes for continuity and availability.)
= The business perspective
Typically, the business will identify uptime goals, and the system design follows from that.
Building in reliability has costs associated with it in terms of engineering effort and redundant hardware. Early stage companies with funding constraints might have to make trade offs in how they allocate resources. Similar to performance, the business might not want pay for uptime costs till the system is proven successful, or until these issues pose an obvious revenue risk. However, like performance, building in reliability into a mature core architecture can be much harder+expensive than doing it from stage zero or v 0.1 itself.
For early stage/unfunded companies, mature processes don’t cost anything, but nevertheless can improve uptime figures.
 “Runtime verification of log files a trojan horse for formal methods”, Barringer et al
 “Monitoring the Execution of Space Craft Flight Software”, Havelund et al
 “The workplace complacency trend in accident prevention.“, Folk DW