AWS High Availability and Fault Tolerance

High Availability

  • High availability ensures a system maintains uptime even when facing failures, though sometimes in a degraded state.
  • In AWS, a system is considered highly available with 99.999% uptime, also known as “five nines”. This translates to about five minutes and fifteen seconds of downtime per year.
  • Achieving high availability involves removing single points of failure by introducing system redundancy. For example, instead of relying on a single server, a highly available system would utilize multiple servers configured in the same way across different availability zones.
  • If one availability zone experiences an outage, the application can automatically redirect to the servers in the backup availability zone, eliminating downtime.
  • High availability prioritizes guaranteeing essential services over maintaining complete functionality. During an outage, users may experience limited features, such as read-only access to a database, while critical services remain operational.
  • Mobile Banking App Example
    • Imagine a mobile banking app designed for high availability. Servers for the app are distributed across multiple availability zones, with each zone containing multiple, identically configured EC2 instances.
    • This setup ensures that if one availability zone fails, the app can seamlessly switch to the backup availability zone, eliminating single points of failure.
    • User data is also protected through redundancy. The primary database is replicated to a read-only database in a different availability zone.
    • If the primary availability zone fails, users can still view their account balances (read operation), but actions requiring write operations (transfers, withdrawals) will be unavailable until the primary zone is restored.

Fault Tolerance

  • Fault tolerance goes beyond high availability. It ensures that a system remains operational even during outages, and users do not experience any noticeable disruption or degradation in service.
  • Implementing fault tolerance traditionally involved complex and expensive measures, including load balancing across multiple servers, database replication, and multi-region availability.
  • AWS significantly simplifies and reduces the cost of achieving fault tolerance by providing the necessary infrastructure and services, eliminating the need for companies to manage the hardware and expertise required.
  • E-Commerce App Example
    • Consider an e-commerce application built with a microservices architecture, where each service runs in a Docker container. High availability is achieved using Amazon Elastic Container Service (ECS) with Fargate, allowing containerized applications to run without managing the underlying infrastructure.
    • Fault tolerance is achieved by using Amazon Aurora Global Database and AWS Lambda. Aurora Global Database replicates data across multiple AWS regions, ensuring data availability even if one region experiences an outage.
    • AWS Lambda, triggered by API Gateway, creates a serverless architecture that automatically scales to meet demand, eliminating single points of failure and maintaining high availability.
    • Recent AWS innovations enhance fault tolerance and high availability. AWS Fault Injection Simulator (FIS) enables testing application resilience to various failure scenarios, helping identify and address potential weaknesses.
    • Amazon CloudWatch Container Insights provides detailed metrics and logs, allowing for quick identification and resolution of performance issues or bottlenecks.