My EC2 Instance Is Dead

4 min readOct 8, 2020

Troubleshooting instances with failed status checks and how to improve it.

image source: https://pixabay.com/illustrations/traffic-lights-problem-analysis-466950/

Recently, we try to improve our failure management, because one of our EC2 instance became unavailable out of the blue. Failure Management, as AWS Well-Architected Framework defines in the chapter of Reliability, means

In any system of reasonable complexity, it is expected that failures will occur. Reliability requires that your workload be aware of failures as they occur and take action to avoid impact on availability. Workloads must be able to both withstand failures and automatically repair issues.

We have an internal application running on an EC2 instance and it worked well for a long time. Therefore, we assumed it should work well in the future without modifying anything. However, one day, the EC2 instance died.

To elaborate, the statistical data of CPU and memory seemed normal in the monitor, but both of the status checks are failed. That is to say, our small application running on the EC2 instance became unavailable, and we could not ssh to the instance.

To solve this problem, we rebooted the instance several times, and fortunately, it became available again. Finally, thanks to AWS support’s help, we found that the root cause of the malfunction is the hardware problem.

So, after the disaster happened, the question is how to manage the failure, and how to improve the reliability and the availability of the system.

Troubleshooting instances with failed status checks

As Why is my EC2 Linux instance unreachable and failing one or both of its status checks? says:

Amazon EC2 monitors the health of each EC2 instance with two status checks:
1. System status check: The system status check detects issues with the underlying host that your instance runs on. If the underlying host is unresponsive or unreachable due to network, hardware, or software issues, then this status check fails.
2. Instance status check: An instance status check failure indicates a problem with the instance due to operating system-level errors such as the following:
a. Failure to boot the operating system
b. Failure to mount volumes correctly
c. File system issues
d. Incompatible drivers
e. Kernel panic
Instance status checks might also fail due to severe memory pressures caused by over-utilization of instance resources.

If you find your EC2 instances do not pass the status check, you can check which type of status check is failed, and follow documentation to solve the issues. For example, if an instance failed the system status check, see My instance failed the system status check. How do I troubleshoot this?. For more information about debugging, you can refer to Troubleshooting instances with failed status checks.

How to improve?

There are several ways to manage failures and improve the reliability and availability of a system, such as:

Backup

Backups are used to restore servers after server failure. It can help you to meet your requirements for recovery time objectives (RTO) and recovery point objectives (RPO). Depends on the data, the process could take from minutes to days to complete. Backing up and restoring an EC2 instance requires additional protection then just the instance’s individual EBS volumes. To restore an instance, you’ll need to restore all EBS volumes but also recreate an identical instance: instance type, VPC, Security Group, IAM role etc. For more information, please refer to AWS Backup: EC2 Instances, EFS Single File Restore, and Cross-Region Backup.

Create alarms that stop, terminate, reboot, or recover an instance

You can create alarms to monitor the CPU utilization percent, memory utilization percent, and send an email to you if it breaches the threshold. Furthermore, you can add actions to the alarm so that it can stop, terminate, reboot, or recover an instance when the status check is failed. The monthly price of standard 1-minute alarms is $0.10 per alarm metric. For more information, please refer to Amazon CloudWatch pricing.

Make your system become high available

If you want to make your system high available, it is better to deploy your application on a cluster in multiple Available Zones (AZ) with load balancer. To elaborate, if one of instance is unavailable, the load balancer will route requests to other healthy instances in your cluster. Besides, setting up your cluster in multiple AZ can reduce the impact of a single failure in one of the available zone. To achieve the goal, you can place your EC2 instances in multiple AZs with EC2 Auto Scaling, or use EKS to deploy your application.

Reference

AWS Well-Architected Framework

This whitepaper describes the AWS Well-Architected Framework. It provides guidance to help customers apply best…

wa.aws.amazon.com

Troubleshooting EC2 Linux Instance System Status Check Failure

My Amazon Elastic Compute Cloud (Amazon EC2) instance failed its system status check and is no longer accessible. How…

aws.amazon.com

Troubleshooting instances with failed status checks

The following information can help you troubleshoot issues if your instance fails a status check. First determine…

docs.aws.amazon.com

Troubleshoot Status Check Failures on an Unreachable EC2 Linux Instance

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance is unreachable, and is failing one or both of its status…

aws.amazon.com

Reliability vs Availability: What's the Difference?

When you pay for a service or invest in the underlying technology infrastructure, you expect the service to be…

www.bmc.com

My EC2 Instance Is Dead

Troubleshooting instances with failed status checks

How to improve?

Backup

Make your system become high available

Reference

AWS Well-Architected Framework

This whitepaper describes the AWS Well-Architected Framework. It provides guidance to help customers apply best…

Troubleshooting EC2 Linux Instance System Status Check Failure

My Amazon Elastic Compute Cloud (Amazon EC2) instance failed its system status check and is no longer accessible. How…

Troubleshooting instances with failed status checks

The following information can help you troubleshoot issues if your instance fails a status check. First determine…

Troubleshoot Status Check Failures on an Unreachable EC2 Linux Instance

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance is unreachable, and is failing one or both of its status…

Reliability vs Availability: What's the Difference?

When you pay for a service or invest in the underlying technology infrastructure, you expect the service to be…

Snapshots and Backups: What is the Difference? | Pair Knowledge Base

"Backups" and "snapshots" are terms that you may often hear in the web hosting space. They seem similar, but they are…

Written by Jo-Yu Liao