Building Trust in Someone Else’s Computers: 4 Essential Elements of Cloud Resilience and Reliability

April 13, 2023
Mika Bostrom

Building Trust in Someone Else’s Computers: 4 Essential Elements of Cloud Resilience and Reliability

Very few companies want to maintain their own computers—it’s just not part of their competitive advantage. Eventually the cloud came along and made it feasible to use slices of someone else’s computers as virtual machines. Using cloud applications and services requires having trust in those other computers. You need to be confident that the system is resilient enough to quickly recover from various types of failures, such that the expected services continue to function reliably.

Building a cloud-native platform for financial services means delivering on 4 essential elements that make cloud computing resilient and reliable. Workloads and supporting components must be:

Ephemeral: Short-lived or temporary and does not persist for a long time
Immutable: Unchanging or unable to be changed
Replicable: Able to be copied or reproduced exactly
Observable: Able to be noticed or seen

Ephemeral

An extension to “elastic computing”, one of the core ideas of the cloud is that you only spin up resources when you need them, and you release or remove them when they are no longer needed. But the concept of short-lived or ephemeral resources is much broader than that, and lies at the heart of cloud resilience. The assumption that individual nodes or workloads can vanish at any time means that the system as a whole must be prepared to rapidly notice and recover from such a sudden disappearance. Technology companies have learned these lessons well, and turned the natural fragility of computers and networks into architectures that expect failures to happen at any time. Modern cloud systems manage ephemeral systems programmatically and automatically all the time: replacements are up and running often before users notice anything has gone wrong. And those same capabilities are used to add or remove resources based on user demand, delivering optimal results at the best cost.

“Building trust by assuming that everything breaks”

Immutable

Adding and removing resources quickly, whether as a result of a failure or just a change in demand, requires having systems that are clearly defined and unchanging, or immutable. Cloud providers offer a variety of mechanisms in support of immutable systems, such as user-defined and pre-built system images, separation of code and data, and robust change controls. If a workload needs to be replaced, clients can be confident that it will come from an exact copy of the one it is replacing. Naturally these images must be updated often, and change controls ensure that updates are appropriately reviewed, tested, and approved before they are made available to production. Once a new image is available, workloads created from older images are cycled out and new ones brought in. In addition to being essential to reliability, immutable and ephemeral systems also come with beneficial security properties. When compromising systems, attackers want persistence. With individual hosts coming and going, any foothold gained in the systems is continuously lost. And because data and workloads are kept separate, the data is not at risk.

Replicable

Immutable images make it possible to rapidly scale compute resources, by adding as many more copies of virtual machines as are needed or desired. But this replicability also applies to delivering resilient, reliable, and high-performance data. Distributed databases, each with a full or partial copy of the whole, distribute the load for reads and allow more workloads to access the data they need without running into delays or conflicts. In the finance industry, we often see data replicated and distributed within a region, or even across regions for greater fault tolerance. If a single data store fails, only the portion of workloads using that store need to be restarted. Modern cloud systems will often break the data up into smaller segments or “shards”, and distribute those shards across a larger number of nodes.To ensure that a complete set of data is always available, shards often contain overlapping and replicated segments of data.

Observable

When everything is running in the cloud, the physical machines are remote and the ephemeral virtual machines keep popping in and out of existence. The ability to see how everything is running is essential, from the machine and operating system to the workloads, applications, and data flows. Observability covers a wide range of tasks, such as performance monitoring, error tracking, and application telemetry. In traditional or legacy architectures, when a system misbehaves, system administrators expect to be able to log into the physical machine and inspect it in real time. With cloud computing, management dashboards and logging tools provide a similar level of information about the virtual machines and workloads running on them. Orchestration systems manage resource allocation, automatically scaling machine groups up and down according to the client’s configuration and restarting workloads as needed. Logging, monitoring, and alerting tools provide clients with detailed information on the status of their cloud domain.

Building trust by assuming that everything breaks

Starting with the assumption that anything and everything will break helps cloud providers and cloud-native platforms deliver systems that are highly tolerant to failure of individual components. Fully automated build and refresh of workloads, based on immutable images, containerized architecture, as well as separate data stores—these all support rapid recovery and return to service times measured in seconds. Sometimes even less. Secure and segregated client environments provide each client with dashboards and controls to configure and manage their own system, independent of anyone else. This separation of tasks and responsibilities is part of the cloud’s shared responsibility model between the infrastructure provider, service or application provider, and client. Clearly defining roles and responsibilities and leveraging each participant’s capabilities results in systems that deliver the desired levels of resilience and reliability.