blog > cache-me-if-you-can-how-caching-goes-wrong-and-how-to-fix-it

Cache Me If You Can: How Caching Goes Wrong and How to Fix It

by Thuso Kharibe
Published on: 5/26/2025

Caching is a fundamental technique in modern system design, promising lightning-fast response times and reduced load on backend services. From client-side browser caches to distributed in-memory stores like Redis, and even down to the database’s own buffer pools, caching layers are ubiquitous. However, while the benefits are clear, a misconfigured or misunderstood cache can quickly turn from a performance booster into a system’s Achilles’ heel.

In this post, we’ll dive into common caching catastrophes, explore their impact, and discuss robust solutions to ensure your caching strategy remains a powerful asset.


I. The Usual Suspects: 4 Common Caching Catastrophes

When caching systems are not implemented with foresight, several predictable issues can arise, leading to degraded performance or even system outages.

A. The Thunder Herd Problem (Cache Stampede)

  • What it is: This occurs when a popular cached item (or many items simultaneously) expires or is invalidated. The subsequent flood of requests for this now-uncached data all try to regenerate it from the underlying data store (e.g., a database) at the same time.
  • Impact: A massive, sudden spike in load on your database, potentially overwhelming it and causing slow responses or even a complete outage. It’s like a flash mob all trying to get through a single turnstile at once.
  • Symptoms:
    • Sudden, sharp increase in database CPU/IO.
    • Increased application latency.
    • Potential for cascading failures as downstream services time out.

B. Cache Penetration

  • What it is: This problem arises when requests are made for data that does not exist in the cache and also does not exist in the primary data store. Each such request will bypass the cache and hit the database, fruitlessly searching for non-existent data.
  • Impact: The database is burdened with unnecessary queries, consuming resources that could be used for valid requests. Malicious actors can exploit this by deliberately requesting non-existent keys to overload the system.
  • Symptoms:
    • High number of database queries returning “not found” or empty results.
    • Increased database load without a corresponding increase in useful work.

C. Cache Breakdown (Hot Key Expiry)

  • What it is: Similar to a cache stampede but focused on a single, extremely popular (hot) piece of data. When this hot key expires, the first request to miss the cache triggers a database lookup. While this lookup is in progress, subsequent requests for the same hot key also miss the cache and pile up, all hitting the database.
  • Impact: Concentrated load on the database for a specific query, leading to significant latency for requests involving that hot data.
  • Symptoms:
    • A specific application feature or data point becomes extremely slow.
    • Monitoring shows repeated, concurrent queries for the same data against the database.

D. Cache Crash/Failure

  • What it is: The cache service or cluster itself becomes unavailable (e.g., Redis server crashes, network partition isolates the cache).
  • Impact: All requests that would normally be served by the cache now fall through to the primary data store. If the database isn’t provisioned to handle this sudden full load, it can quickly become overwhelmed.
  • Symptoms:
    • Application errors related to cache connectivity.
    • Massive, system-wide increase in database load and application latency.

II. Fighting Back: Effective Solutions and Mitigation Strategies

Understanding these problems is the first step; implementing effective solutions is key to building resilient systems.

A. Taming the Thunder Herd & Cache Breakdown:

Randomized Expiry (Jitter): Instead of setting a fixed TTL (Time-To-Live) for all similar items, add a small random deviation to each item’s expiry time. This spreads out the expirations, preventing a mass expiry event.

// Example using a generic cache interface (e.g., IMemoryCache or IDistributedCache in .NET)
public class CacheService
{
    private readonly IMemoryCache _cache; // Or IDistributedCache
    private static readonly Random _random = new Random();
    private const int BASE_TTL_SECONDS = 3600; // 1 hour
    private const int JITTER_RANGE_SECONDS = 300; // 5 minutes

    public CacheService(IMemoryCache cache)
    {
        _cache = cache;
    }

    public void SetWithJitter<T>(string key, T value)
    {
        int randomJitter = _random.Next(JITTER_RANGE_SECONDS + 1); // 0 to JITTER_RANGE_SECONDS
        TimeSpan actualTtl = TimeSpan.FromSeconds(BASE_TTL_SECONDS + randomJitter);
        _cache.Set(key, value, actualTtl);
    }

    // For IDistributedCache, options would be slightly different:
    // public async Task SetDistributedWithJitterAsync<T>(string key, T value)
    // {
    //     int randomJitter = _random.Next(JITTER_RANGE_SECONDS + 1);
    //     var options = new DistributedCacheEntryOptions
    //     {
    //         AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(BASE_TTL_SECONDS + randomJitter)
    //     };
    //     // Assuming value is serializable to byte[] or string
    //     await _distributedCache.SetAsync(key, Serialize(value), options);
    // }
}

Distributed Locking/Semaphores (for Hot Keys): When a hot key expires and a request misses the cache, acquire a distributed lock. Only the request holding the lock regenerates the data from the database and repopulates the cache.

// Simplified example using a hypothetical IDistributedLockService
// Real implementations might use RedLock.net for Redis or similar.
public class DataService
{
    private readonly IMemoryCache _cache;
    private readonly IDatabaseService _dbService;
    private readonly IDistributedLockService _lockService; // Hypothetical service

    public DataService(IMemoryCache cache, IDatabaseService dbService, IDistributedLockService lockService)
    {
        _cache = cache;
        _dbService = dbService;
        _lockService = lockService;
    }

    public async Task<DataObject> GetHotDataAsync(string key)
    {
        if (_cache.TryGetValue(key, out DataObject cachedData))
        {
            return cachedData;
        }

        var lockKey = $"lock:{key}";
        // Attempt to acquire the lock with a timeout
        if (await _lockService.AcquireLockAsync(lockKey, TimeSpan.FromSeconds(5)))
        {
            try
            {
                // Double-check if another thread populated the cache while waiting for the lock
                if (_cache.TryGetValue(key, out DataObject dataAfterLock))
                {
                    return dataAfterLock;
                }

                var dbData = await _dbService.FetchHotDataAsync(key);
                if (dbData != null)
                {
                    _cache.Set(key, dbData, TimeSpan.FromHours(1)); // Set appropriate TTL
                }
                return dbData;
            }
            finally
            {
                await _lockService.ReleaseLockAsync(lockKey);
            }
        }
        else
        {
            // Lock not acquired, could serve stale data, wait and retry, or return error
            // For simplicity, returning null or throwing an exception.
            // Or, potentially, read stale data if available and acceptable.
            _cache.TryGetValue(key, out DataObject staleData); // Try to get stale if it exists
            return staleData; // Could be null
        }
    }
}
// public interface IDistributedLockService { /* ... */ }
// public interface IDatabaseService { /* ... */ }
// public class DataObject { /* ... */ }

Proactive Cache Refresh (for Hot Keys): For critical hot keys, implement a background process (e.g., a scheduled job or a background service) that refreshes the cache before it expires.

Prioritize Core Data (During Stampedes): This is more of an architectural decision. If a stampede is detected (e.g., via monitoring high DB load), the application or an API gateway could temporarily rate-limit or queue requests for non-critical data, allowing essential operations to proceed.

Never Expire Critical Hot Keys (or Use Very Long TTLs): For data that changes infrequently but is accessed very often, set a very long TTL or no expiry, relying on explicit cache invalidation when the source data changes.

B. Preventing Cache Penetration:

Cache Null Values: If a query to the database for a specific key returns no data, store a special “null” or “sentinel” value in the cache for that key with a short TTL.

public class ProductService
{
    private readonly IMemoryCache _cache;
    private readonly IProductRepository _repository;
    private const string NULL_SENTINEL = "##NULL##"; // Or a dedicated object instance
    private static readonly TimeSpan SHORT_TTL_FOR_NULL = TimeSpan.FromMinutes(5);
    private static readonly TimeSpan REGULAR_TTL = TimeSpan.FromHours(1);


    public ProductService(IMemoryCache cache, IProductRepository repository)
    {
        _cache = cache;
        _repository = repository;
    }

    public async Task<Product> GetProductByIdAsync(string productId)
    {
        if (_cache.TryGetValue(productId, out object cachedValue))
        {
            return cachedValue as string == NULL_SENTINEL ? null : (Product)cachedValue;
        }

        var product = await _repository.GetByIdAsync(productId);

        if (product == null)
        {
            _cache.Set(productId, NULL_SENTINEL, SHORT_TTL_FOR_NULL);
            return null;
        }

        _cache.Set(productId, product, REGULAR_TTL);
        return product;
    }
}
// public interface IProductRepository { /* ... */ }
// public class Product { /* ... */ }

Bloom Filters: A Bloom filter is a probabilistic data structure. Before querying the cache or database, check the Bloom filter. If it indicates the key definitely doesn’t exist, avoid the lookup. (Implementing a full Bloom Filter is beyond a simple snippet, but you’d typically use a library for this).

Conceptual Usage:

// Assuming 'bloomFilter' is an initialized BloomFilter instance
// if (!bloomFilter.MightContain(requestedKey)) {
//     return null; // Key definitely not in DB
// }
// // Proceed to check cache, then DB

C. Surviving a Cache Crash:

Circuit Breaker Pattern: Libraries like Polly for .NET can implement circuit breakers. If the cache service is down, the circuit breaker “opens,” and the application stops trying to contact the cache for a cooldown period.

// Using Polly library
// In Startup.cs or Program.cs for HttpClient configuration:
// services.AddHttpClient("CacheClient")
//     .AddPolicyHandler(
//         HttpPolicyExtensions
//             .HandleTransientHttpError() // Or specific exceptions from cache client
//             .CircuitBreakerAsync(
//                 handledEventsAllowedBeforeBreaking: 3,
//                 durationOfBreak: TimeSpan.FromSeconds(30)
//             )
//     );

// In your service:
// public class MyServiceUsingCache
// {
//     private readonly HttpClient _cacheClient;
//     public MyServiceUsingCache(IHttpClientFactory clientFactory)
//     {
//         _cacheClient = clientFactory.CreateClient("CacheClient");
//     }
//     public async Task<string> GetDataFromCacheOrSource(string key)
//     {
//         try
//         {
//             // response = await _cacheClient.GetAsync($"cache-endpoint/{key}");
//             // if (response.IsSuccessStatusCode) return await response.Content.ReadAsStringAsync();
//             // This is a simplified representation; actual cache client would be used.
//             // If cache call fails and circuit breaker is open, it will throw BrokenCircuitException immediately.
//             string cachedData = await AttemptGetFromCacheAsync(key);
//             if (cachedData != null) return cachedData;
//         }
//         catch (BrokenCircuitException)
//         {
//             // Log: Circuit is open, falling back to source for key {key}
//             return await GetDataFromPrimarySourceAsync(key);
//         }
//         catch (CacheAccessException ex) // Custom exception for other cache errors
//         {
//             // Log: Cache access error, falling back to source for key {key}, Error: {ex.Message}
//             return await GetDataFromPrimarySourceAsync(key);
//         }
//         // If cache miss and no error:
//         var sourceData = await GetDataFromPrimarySourceAsync(key);
//         await AttemptSetToCacheAsync(key, sourceData); // Try to set, might also be within circuit breaker
//         return sourceData;
//     }
//     private async Task<string> AttemptGetFromCacheAsync(string key) { /* ... actual cache get ... */ return null; }
//     private async Task AttemptSetToCacheAsync(string key, string value) { /* ... actual cache set ... */ }
//     private async Task<string> GetDataFromPrimarySourceAsync(string key) { /* ... get from DB ... */ return "data_from_db"; }
// }

High Availability Cache Cluster: Deploy your cache (e.g., Redis) in a clustered configuration with replication and automatic failover (like Redis Sentinel or Redis Cluster). This is a configuration aspect of the cache system itself.

Graceful Degradation: Design your application to function, perhaps with reduced performance or features, even if the cache is unavailable. For example, serving slightly stale data if the fresh data relies on the cache and the cache is down.


III. The Art of Letting Go: Smart Cache Eviction Strategies

Caches have finite memory. When a cache is full and a new item needs to be added, an existing item must be evicted. The choice of eviction policy significantly impacts cache hit rates.

  • LRU (Least Recently Used): Evicts the item that hasn’t been accessed for the longest time. Good for data with temporal locality (recently accessed items are likely to be accessed again). Most built-in IMemoryCache implementations in .NET offer LRU-like behavior by default when size limits are enforced.
  • LFU (Least Frequently Used): Evicts the item that has been accessed the fewest times. Useful when some items are consistently more popular than others. Requires more overhead to track frequencies.
  • MRU (Most Recently Used): Evicts the most recently accessed item.
  • FIFO (First-In, First-Out): Evicts the oldest item, regardless of access patterns.
  • Sliding Window Expiration (related to TTL): Items expire after a fixed period of inactivity. In .NET’s IMemoryCache: SetSlidingExpiration(TimeSpan).
  • Absolute Expiration (TTL): Items expire at a specific time or after a specific duration, regardless of activity. In .NET’s IMemoryCache: SetAbsoluteExpiration(TimeSpan).
  • SLRU (Segmented LRU): Divides the cache into a “probationary” segment and a “protected” segment.
  • Two-Tiered Caching: Utilizes multiple cache layers, often an in-memory L1 cache and a distributed L2 cache.

Choosing the right eviction policy depends heavily on your application’s specific data access patterns. Many caching libraries allow configuration of these policies.


IV. Layers of Caching: It’s Not Just One Server

It’s crucial to remember that caching isn’t monolithic. Caching occurs at multiple levels in a typical system:

  1. Client-Side Cache: Web browsers cache static assets and API responses.
  2. CDN (Content Delivery Network): Caches static content at edge locations.
  3. Load Balancer Cache: Some load balancers can cache content.
  4. Application/Service-Level Cache: In-memory caches within the application (e.g., IMemoryCache in ASP.NET Core).
  5. Distributed Cache: Shared services like Redis or Memcached (IDistributedCache in ASP.NET Core).
  6. Database Cache: Internal mechanisms like buffer pools.

A holistic caching strategy considers how these layers interact.


V. Conclusion: Caching Wisely for Resilient Systems

Caching is an indispensable tool for building high-performance, scalable applications. However, it introduces its own complexities. By understanding common pitfalls and implementing appropriate solutions and eviction strategies, you can harness the true power of caching.

Proactive monitoring of cache hit/miss rates, latency, and eviction counts, combined with a willingness to adapt your strategy based on observed access patterns, will ensure your cache layers contribute positively to your system’s health and user experience.

WRITTEN BY

Thuso Kharibe