Master the Fundamentals of Availability for Systems Design Interview

Representation of systems design availability with a city connected by networks and lights on

Knowing about fundamental concepts of availability is crucial for systems design. But high availability comes with a cost. One important point we must remember when designing a system is that every system design decision will have a trade-off. There is no silver bullet, but there are technologies that will solve specific problems.

Therefore, if the system you are designing does critical operations with payment, for example, you need to make systems design decisions that will make it highly available.

Why Systems Availability is Important?

Imagine you are using Slack in your company, and you need even more this tool to communicate because you are working remotely. Then the service goes down suddenly for many hours, and you cannot communicate with your team. That would likely delay your work delivery.

Imagine if the Amazon e-commerce website goes down for 1 hour. How many millions would they lose? Also, how would their reputation be harmed in this process?

Stripe is another excellent example; they are a SaaS (Software as a Service) that performs payments for companies. If Stripe goes down, their customers will lose money.

Let’s list the main reasons why availability is essential:

Business Continuity: Availability is vital for businesses as it ensures uninterrupted operations by keeping systems and services operational. This high availability is essential for maintaining productivity, meeting customer demands, and preventing financial losses due to downtime.

Customer Satisfaction: Customers expect reliable and accessible services. When systems are consistently available, customers can access products, services, or information without inconvenience or delays. High availability enhances the customer experience, builds trust, and fosters customer satisfaction.

Revenue Generation: Many businesses use their systems to generate revenue. For example, E-commerce platforms, online services, and digital marketplaces require continuous availability to process transactions and serve customers. Downtime can lead to lost sales opportunities and revenue. High availability ensures that revenue-generating systems are accessible, minimizing the risk of financial losses.

Data Integrity and Security: Availability is closely tied to data integrity and security. Systems with high availability typically incorporate measures such as data replication, backups, and disaster recovery mechanisms. These safeguards protect against data loss, ensure data integrity, and help maintain the confidentiality of sensitive information.

Compliance and Legal Requirements: Businesses must adhere to regulations and legal requirements regarding system availability and data protection in specific industries. Non-compliance can lead to penalties, legal consequences, and reputational damage. Organizations can meet these obligations and demonstrate their commitment to data privacy and security by maintaining availability.

Competitive Advantage: Availability can be a competitive differentiator. Businesses that provide reliable and accessible services have an advantage over competitors experiencing frequent downtime or performance issues. Customers are more likely to choose and remain loyal to businesses that offer dependable and available systems.

Availability Legal Agreement

When we develop web systems for other customers, we need to agree on the application’s availability. Customers need the application to be available for their businesses. That’s why we have to create agreement documents to build trust with our customers, and, of course, if what we wrote in the document is not met, there will be penalties. Therefore, it’s essential to make sure we provide the agreed availability.

Now let’s see how those documents and metrics work.

SLA – Service Level Agreement

SLA stands for Service Level Agreement. It is a contractual agreement between a service provider and a customer that defines the expected level of service and outlines the responsibilities of both parties. Companies use SLAs in various business arrangements, mainly IT services, cloud computing, telecommunications, and outsourcing.

The purpose of an SLA is to establish clear expectations and provide a framework for measuring and managing the performance of services. It typically includes the following components:

Service Description: The SLA begins by defining the services we provide. Therefore, we must include a detailed description of the service’s scope, features, and functionality. It outlines what the service provider will deliver to the customer.

Service Levels and Metrics: The SLA specifies the expected service performance levels, such as availability, response time, resolution time, or throughput. It also defines the metrics and measurement methods we will use to assess and monitor the service performance.

Responsibilities and Roles: The SLA clearly outlines the responsibilities of both the service provider and the customer. It defines who is accountable for what tasks, actions, and deliverables. This agreement ensures clarity and alignment in terms of roles and expectations.

Performance Targets: The SLA sets specific performance or service level targets (SLTs) that the service provider must meet. For example, an SLA might state that the service should be available 99.9% of the time or that response times should be within a specific timeframe.

Reporting and Monitoring: The SLA specifies the reporting and monitoring mechanisms we will use to track the service provider’s performance. It may include requirements for regular reporting, key performance indicators (KPIs), performance dashboards, or other monitoring tools.

Escalation and Dispute Resolution: The SLA includes handling issues, escalations, and dispute resolution procedures. It outlines the steps to follow when there are service disruptions, breaches of the SLA, or disagreements between the parties.

Remedies and Penalties: The SLA may outline remedies and penalties that we will impose if the service provider fails to meet the agreed-upon service levels. These could include financial penalties, service credits, or other forms of compensation.

The SLA document is a legal agreement showing our applications will be available. If the company does not meet the SLA, there are consequences. The company providing the service might pay a fee.

We can memorize some of the information in this document. Knowing the general idea will help when we need to provide a cloud solution and agree on how available the application will be.

Let’s have a look at the SLA from Amazon API Gateway: Let’s list the main reasons why availability is essential:

Amazon API gateway SLA diagram

You can check it further in the following link:
https://aws.amazon.com/api-gateway/sla

SLO – Service Level Objective

SLO stands for Service Level Objective. It is a target or goal set by a service provider to define the desired level of performance or quality for a particular service. SLOs are typically defined within the context of a Service Level Agreement (SLA) and provide specific, measurable metrics that the service provider aims to achieve.

Here are some critical aspects of SLOs:

Performance Metrics: SLOs are defined using specific performance metrics relevant to the service provided. These metrics can vary depending on the nature of the service, but common examples include availability, response time, throughput, error rate, or latency. The selection of appropriate metrics depends on the service’s objectives and the customer’s expectations.

Quantifiable Targets: SLOs set quantifiable targets or thresholds for the defined performance metrics. For example, an SLO for a web application’s response time might state that 95% of requests and we should get a response within 200 milliseconds. These targets provide a clear benchmark against which we can measure the service provider’s performance can be measured.

Measurable and Monitorable: SLOs should be measurable and monitorable, meaning there should be mechanisms to collect relevant data and track performance against the defined targets. This often involves using monitoring tools, analytics, or performance measurement systems to monitor the service’s performance continuously.

Alignment with Customer Expectations: We design SLOs to align the service provider’s performance with the customer’s expectations. They consider the needs, priorities, and requirements of the customer, ensuring that the service meets or exceeds their desired level of performance. SLOs help set clear expectations and provide a basis for evaluating the service provider’s performance.

Continuous Improvement: SLOs are not static but evolve. As technology advances, customer expectations change, and business requirements evolve, we must review and update SLOs to reflect new targets and goals. Continuous monitoring and analysis of performance data help identify areas for improvement and drive ongoing optimization of the service.

SLI – Service Level Indicator

SLI stands for Service Level Indicator. It is a quantitative measurement or metric that provides objective data about the performance or quality of a service. SLIs are used to monitor and evaluate the actual performance of a service and are often defined within the context of a Service Level Agreement (SLA) or Service Level Objective (SLO).

SLIs are essential for objectively measuring and evaluating the performance of a service. They provide a quantifiable and objective basis for assessing the service’s quality, ensuring that it meets the agreed-upon targets and aligns with customer expectations. By monitoring SLIs, service providers can identify areas for improvement, make data-driven decisions, and continuously enhance the service’s performance.

Nines of Availability

One of the most important metric in the SLA is the uptime agreement. If we are talking about critical systems, we need to have high availability. The gold standard for the market is to have 5 nines, this means only 5.26 minutes of downtime during the year.

However, if we don’t need the application to be available all the time, three nines will be more than sufficient. There is a trade-off when choosing to go for five nines, therefore, it’s better to make choose a suitable availability for the system.

As mentioned, those are the most common availability percentages applications in the market use:

Availability Percentage Downtime
99.9% (“three nines”) 8.77 hours
99.99% (“four nines”) 52.60 minutes
99.999% (“five nines”) 5.26 minutes

Now let’s see all the other nines in the following image:

Nines of Availability percentages table

Source from Wikipedia in the following link: https://en.wikipedia.org/wiki/High_availability

Redundancy

Redundancy refers to the duplication of critical components, systems, or processes within a larger system or organization. It involves having backup or redundant elements in place to ensure continued operation and mitigate the impact of failures or disruptions. The purpose of redundancy is to enhance reliability, fault tolerance, and resilience.

Here are a few key points about redundancy:

Backup and Failover: Redundancy often involves having duplicate components or systems that can take over the functionality of the primary component or system in the event of a failure. For example, in a computer network, redundant network switches or servers can act as backup devices and seamlessly take over if the primary ones fail.

Fault Tolerance: Redundancy enhances fault tolerance by minimizing the impact of failures. When redundant components or systems are in place, the failure of one element does not lead to a complete system failure. Instead, the redundant element can step in and maintain system operation, ensuring continuity and minimizing downtime.

Reliability and Resilience: Redundancy improves the overall reliability and resilience of a system. By having redundant elements, the system becomes less vulnerable to single points of failure. It can withstand failures, disruptions, or malfunctions without significant impact, ensuring that critical functions and services remain available.

Data Redundancy: In the context of data storage and backup, redundancy refers to storing multiple copies of data across different storage devices or locations. This ensures that if one copy is lost or corrupted, there are still additional copies available for recovery. Redundant data storage helps protect against data loss and increases data reliability.

Cost and Complexity: Implementing redundancy often comes with additional costs and complexity. Redundant components or systems require additional resources, such as hardware, infrastructure, or maintenance. Managing and synchronizing redundant elements can also introduce complexity in system design and maintenance.

Redundancy is a commonly used strategy in various domains, including information technology, telecommunications, power distribution, transportation, and disaster recovery. It helps organizations ensure continuous operation, minimize downtime, and increase the reliability and resilience of their systems.

Passive redundancy

Imagine you are a student who needs to submit a very important assignment online. The submission system is crucial for you to complete your task successfully. In this case, passive redundancy is like having a backup plan to ensure that your assignment gets submitted even if something goes wrong with the primary submission system.

Here’s how it works:

Primary Submission System: The primary submission system is the main system you use to submit your assignment. It’s the system you rely on and interact with initially.

Backup Submission System: The backup submission system is like a spare or backup option that is ready to take over if the primary system fails or has a problem. It’s not actively processing submissions, but it’s there as a standby.

Standby Mode: The backup submission system is in a standby mode, patiently waiting for any issues to occur with the primary system. It’s not actively processing submissions or doing anything until it’s needed.

Failover Process: If the primary submission system encounters a problem or fails, a failover process is triggered. This process involves activating the backup system and diverting your assignment submission to it.

Continued Service: Once the failover process is completed, the backup submission system takes over seamlessly, allowing you to submit your assignment without any disruption. From your perspective as a student, it appears as if nothing went wrong, and your assignment gets successfully submitted.

Recovery Time: In case of a failure or problem with the primary system, the recovery time is the time it takes for the backup system to become active and start accepting submissions. During this time, you may experience a brief delay or interruption, but the backup system ensures that the service is restored as quickly as possible.

Active Redundancy

Imagine that you have an assignment submission system that utilizes active redundancy to ensure a smooth and uninterrupted submission process. Here’s how it works:

Active Components: In active redundancy, there are multiple active components or systems working simultaneously to handle the assignment submissions. These components are actively processing and delivering services concurrently.

Load Balancing: The active components distribute the workload evenly among themselves through load balancing. This ensures that each component shares the processing load and can handle a portion of the submissions efficiently.

Seamless Failover: If one of the active components fails or experiences an issue, the workload is automatically shifted to the remaining active components without any interruption or impact on the submission process. The failover process is seamless and transparent to you as a student.

High Availability: With active redundancy, the submission system maintains high availability, meaning it remains operational and accessible even if one or more active components encounter problems. The remaining active components continue to process submissions, ensuring that the service remains uninterrupted.

Increased Performance: Active redundancy can also improve the performance of the system. By distributing the workload among multiple active components, the overall processing capacity and throughput are increased. This allows for faster and more efficient handling of assignment submissions.

Redundancy Monitoring: In an actively redundant system, there is continuous monitoring of the health and performance of the active components. This monitoring allows for proactive detection of any issues or failures, enabling quick remediation and failover processes to ensure uninterrupted service.

In this scenario, active redundancy in the assignment submission system ensures that multiple components are actively working together to handle submissions. If one component fails, the others seamlessly take over the workload to ensure continuous availability and smooth operation. This redundancy configuration enhances reliability, fault tolerance, and performance, providing you with a reliable and efficient submission experience.

What is the Cost of Highly-Available Systems?

The cost of high-availability can vary depending on several factors, such as the specific requirements of the system, the level of redundancy needed, the technologies utilized, and the scale of the infrastructure. Here are some aspects to consider when evaluating the cost of high-availability:

Hardware and Infrastructure: High-availability often requires redundant hardware components and infrastructure to ensure continuous operation. This may include redundant servers, storage systems, network equipment, and power supplies. The cost of these components can vary significantly based on the required capacity, performance, and reliability.

Software and Licensing: High-availability solutions may involve specialized software, such as load balancers, clustering software, or failover mechanisms. The cost of these software licenses can add to the overall expense.

Network Connectivity: Redundant network connectivity is crucial for high-availability systems. It may involve multiple internet service providers (ISPs), redundant network links, or even geographically diverse data centers. The cost will depend on the chosen providers, the required bandwidth, and the complexity of the network setup.

Data Replication and Storage: High-availability often involves replicating data across multiple systems or data centers to ensure redundancy and eliminate single points of failure. The cost will depend on the amount of data, the replication method used (synchronous or asynchronous), and the storage infrastructure required.

Monitoring and Management: Effective high-availability systems require robust monitoring and management tools to detect issues, perform failovers, and maintain system health. The cost may include licensing fees for monitoring software, employing dedicated staff, or outsourcing these tasks to a managed service provider.

Staffing and Expertise: Maintaining a high-availability system usually requires skilled staff with expertise in system administration, network management, and troubleshooting. The cost will depend on the size of the team required and their level of expertise.

Downtime Impact: It’s important to consider the potential cost of downtime and the impact it may have on your business. High-availability solutions aim to minimize downtime, which can help avoid financial losses due to service disruptions, loss of productivity, or damage to the organization’s reputation.

Summary

Availability is a crucial characteristic for systems design. That will make a big difference in the technologies we will use. To have a highly available system will require a lot more resources, it will be more expensive and the complexity will be higher. Let’s review the key points of availability:

  • The more available is the system, the more expensive and complex it will be.
  • Nines of availability mean the availability percentage in the year for an application.
  • 3 Nines have the yearly downtime of 8.77 hours.
  • 4 Nines have the yearly downtime 52.60 minutes.
  • 5 Nines have the yearly downtime of 5.26 minutes.
  • SLA means Service Level Agreement and is an agreement document between the service provider and the customer.
  • SLO means Service Level Objective which contains the objectives within the SLA document with real metrics of availability.
  • SLI means Service Level Indicator which contains the real data of how the available is the service and how well it is performing.
  • To make an application highly-available it’s necessary to have redundancy. This means that services have to be replicated, also the load balancer. Otherwise, we will have one point of failure.
  • Redundancy involves having backup or duplicate systems, components, or resources in place.
  • Redundancy ensures the availability, continuity, and reliability of the system or process.
  • Redundancy can be applied to various areas, such as data storage, power supply, and networking.
  • Redundancy is commonly used in critical systems, industries, and infrastructure where failure can have significant consequences.
  • Redundancy requires careful planning, design, and maintenance to balance the costs and benefits effectively.
  • Redundancy is part of a broader strategy for achieving fault tolerance, resilience, and system reliability.

  • Active redundancy refers to the implementation of duplicate components or systems that operate simultaneously, with one actively serving as a backup to immediately take over in the event of a failure in order to ensure continuous operation and minimize downtime.
  • Passive redundancy refers to the presence of duplicate components or systems that are not actively operating but can be activated manually or automatically when needed as a backup in case of failure.
Written by
Rafael del Nero
Join the discussion