Home Serverless services Tutorials

AWS Details How to Test ElastiCache Resilience

February 24, 2026

Proactively testing your application’s resilience is crucial for production readiness. Using AWS Fault Injection Service (FIS), you can simulate a primary node failure in Amazon ElastiCache to verify your application handles automatic failover without manual intervention.

This guide walks through configuring and running a failover experiment on an ElastiCache for Redis cluster. AWS recommends using a non-production cluster with cluster mode disabled to more easily observe role changes.

Prerequisites

An ElastiCache for Redis cluster (non-production recommended)
Cluster mode disabled for clear primary/replica observation
At least one read replica configured
CloudWatch log group for experiment logging

Step 1: Create the Experiment Template

Navigate to the AWS FIS console and create a new experiment template:

Choose Create experiment template
Under Actions, add the action aws:elasticache:interrupt-cluster-az-power
Configure the target to select your ElastiCache cluster by resource ID or tag
Under Logging configuration, point to a CloudWatch log group (create new if needed)
Note the IAM role ARN FIS generates automatically

During setup, FIS provides a sample IAM policy JSON for CloudWatch logging. Copy this for the next step.

Step 2: Update IAM Role Permissions

The auto-generated IAM role lacks CloudWatch write permissions. Fix this manually:

Open the IAM console
Find the role created by FIS (check the ARN from Step 1)
Click Add permissions → Create inline policy
Paste the CloudWatch logging policy JSON from the FIS template creation screen
Name the policy (e.g., FIS-CloudWatch-Logs) and save

This grants the role logs:CreateLogStream and logs:PutLogEvents permissions on your specified log group.

Step 3: Run the Experiment

Return to the AWS FIS console:

Select your experiment template
Click Start experiment
Confirm the action when prompted
Monitor the state change to Running

The experiment executes immediately. You can view real-time logs via the CloudWatch log destination link in the experiment details.

Step 4: Monitor and Validate Resilience

The primary node will fail over, and a replica will be promoted. This typically completes in 5-15 seconds. Validate the following:

ElastiCache Events: In the ElastiCache console, check for:

Failover from master node [node-id] to replica node [node-id] completed
Recovering cache nodes [node-id]
Finished recovery for cache nodes [node-id]

Application Behavior: A well-architected application should experience brief connection errors during failover before automatically reconnecting. Monitor for:

Temporary increase in cache misses (expected)
Graceful fallback to primary database
No cascading performance failures downstream

Performance Metrics: Measure application response times. Temporary latency spikes during failover are normal, but sustained high latency indicates issues with connection pooling or retry logic. Consult the official ElastiCache best practices documentation for tuning guidance.

Simulating failures with FIS moves resilience from theory to testable reality. Instead of discovering connection handling bugs during an actual outage, you identify and fix them under controlled conditions. Regular chaos engineering exercises ensure your caching layer is genuinely highly available, not just theoretically resilient.

Cleanup

After testing, delete the FIS experiment template and its associated IAM role to maintain a clean AWS environment and avoid unnecessary charges.

Follow Hashlytics on Bluesky, LinkedIn , Telegram and X to Get Instant Updates

Prerequisites

Step 1: Create the Experiment Template

Step 2: Update IAM Role Permissions

Step 3: Run the Experiment

Step 4: Monitor and Validate Resilience

Cleanup

RELATED ARTICLESMORE FROM AUTHOR

HCLTech Joins AWS European Sovereign Cloud

AWS, Azure May Reroute West Asia Data to India Centers

AWS Launches Agent Plugins to Automate Development Tasks

Join the conversation

RELATED ARTICLES MORE FROM AUTHOR