Caching is a fundamental technique in modern system design, promising lightning-fast response times and reduced load on backend services. From client-side browser caches to distributed in-memory stores like Redis, and even down to the database’s own buffer pools, caching layers are ubiquitous. However, while the benefits are clear, a misconfigured or misunderstood cache can quickly turn from a performance booster into a system’s Achilles’ heel.
In this post, we’ll dive into common caching catastrophes, explore their impact, and discuss robust solutions to ensure your caching strategy remains a powerful asset.
When caching systems are not implemented with foresight, several predictable issues can arise, leading to degraded performance or even system outages.
Understanding these problems is the first step; implementing effective solutions is key to building resilient systems.
Randomized Expiry (Jitter): Instead of setting a fixed TTL (Time-To-Live) for all similar items, add a small random deviation to each item’s expiry time. This spreads out the expirations, preventing a mass expiry event.
// Example using a generic cache interface (e.g., IMemoryCache or IDistributedCache in .NET)
public class CacheService
{
private readonly IMemoryCache _cache; // Or IDistributedCache
private static readonly Random _random = new Random();
private const int BASE_TTL_SECONDS = 3600; // 1 hour
private const int JITTER_RANGE_SECONDS = 300; // 5 minutes
public CacheService(IMemoryCache cache)
{
_cache = cache;
}
public void SetWithJitter<T>(string key, T value)
{
int randomJitter = _random.Next(JITTER_RANGE_SECONDS + 1); // 0 to JITTER_RANGE_SECONDS
TimeSpan actualTtl = TimeSpan.FromSeconds(BASE_TTL_SECONDS + randomJitter);
_cache.Set(key, value, actualTtl);
}
// For IDistributedCache, options would be slightly different:
// public async Task SetDistributedWithJitterAsync<T>(string key, T value)
// {
// int randomJitter = _random.Next(JITTER_RANGE_SECONDS + 1);
// var options = new DistributedCacheEntryOptions
// {
// AbsoluteExpirationRelativeToNow = TimeSpan.FromSeconds(BASE_TTL_SECONDS + randomJitter)
// };
// // Assuming value is serializable to byte[] or string
// await _distributedCache.SetAsync(key, Serialize(value), options);
// }
}
Distributed Locking/Semaphores (for Hot Keys): When a hot key expires and a request misses the cache, acquire a distributed lock. Only the request holding the lock regenerates the data from the database and repopulates the cache.
// Simplified example using a hypothetical IDistributedLockService
// Real implementations might use RedLock.net for Redis or similar.
public class DataService
{
private readonly IMemoryCache _cache;
private readonly IDatabaseService _dbService;
private readonly IDistributedLockService _lockService; // Hypothetical service
public DataService(IMemoryCache cache, IDatabaseService dbService, IDistributedLockService lockService)
{
_cache = cache;
_dbService = dbService;
_lockService = lockService;
}
public async Task<DataObject> GetHotDataAsync(string key)
{
if (_cache.TryGetValue(key, out DataObject cachedData))
{
return cachedData;
}
var lockKey = $"lock:{key}";
// Attempt to acquire the lock with a timeout
if (await _lockService.AcquireLockAsync(lockKey, TimeSpan.FromSeconds(5)))
{
try
{
// Double-check if another thread populated the cache while waiting for the lock
if (_cache.TryGetValue(key, out DataObject dataAfterLock))
{
return dataAfterLock;
}
var dbData = await _dbService.FetchHotDataAsync(key);
if (dbData != null)
{
_cache.Set(key, dbData, TimeSpan.FromHours(1)); // Set appropriate TTL
}
return dbData;
}
finally
{
await _lockService.ReleaseLockAsync(lockKey);
}
}
else
{
// Lock not acquired, could serve stale data, wait and retry, or return error
// For simplicity, returning null or throwing an exception.
// Or, potentially, read stale data if available and acceptable.
_cache.TryGetValue(key, out DataObject staleData); // Try to get stale if it exists
return staleData; // Could be null
}
}
}
// public interface IDistributedLockService { /* ... */ }
// public interface IDatabaseService { /* ... */ }
// public class DataObject { /* ... */ }
Proactive Cache Refresh (for Hot Keys): For critical hot keys, implement a background process (e.g., a scheduled job or a background service) that refreshes the cache before it expires.
Prioritize Core Data (During Stampedes): This is more of an architectural decision. If a stampede is detected (e.g., via monitoring high DB load), the application or an API gateway could temporarily rate-limit or queue requests for non-critical data, allowing essential operations to proceed.
Never Expire Critical Hot Keys (or Use Very Long TTLs): For data that changes infrequently but is accessed very often, set a very long TTL or no expiry, relying on explicit cache invalidation when the source data changes.
Cache Null Values: If a query to the database for a specific key returns no data, store a special “null” or “sentinel” value in the cache for that key with a short TTL.
public class ProductService
{
private readonly IMemoryCache _cache;
private readonly IProductRepository _repository;
private const string NULL_SENTINEL = "##NULL##"; // Or a dedicated object instance
private static readonly TimeSpan SHORT_TTL_FOR_NULL = TimeSpan.FromMinutes(5);
private static readonly TimeSpan REGULAR_TTL = TimeSpan.FromHours(1);
public ProductService(IMemoryCache cache, IProductRepository repository)
{
_cache = cache;
_repository = repository;
}
public async Task<Product> GetProductByIdAsync(string productId)
{
if (_cache.TryGetValue(productId, out object cachedValue))
{
return cachedValue as string == NULL_SENTINEL ? null : (Product)cachedValue;
}
var product = await _repository.GetByIdAsync(productId);
if (product == null)
{
_cache.Set(productId, NULL_SENTINEL, SHORT_TTL_FOR_NULL);
return null;
}
_cache.Set(productId, product, REGULAR_TTL);
return product;
}
}
// public interface IProductRepository { /* ... */ }
// public class Product { /* ... */ }
Bloom Filters: A Bloom filter is a probabilistic data structure. Before querying the cache or database, check the Bloom filter. If it indicates the key definitely doesn’t exist, avoid the lookup. (Implementing a full Bloom Filter is beyond a simple snippet, but you’d typically use a library for this).
Conceptual Usage:
// Assuming 'bloomFilter' is an initialized BloomFilter instance
// if (!bloomFilter.MightContain(requestedKey)) {
// return null; // Key definitely not in DB
// }
// // Proceed to check cache, then DB
Circuit Breaker Pattern: Libraries like Polly for .NET can implement circuit breakers. If the cache service is down, the circuit breaker “opens,” and the application stops trying to contact the cache for a cooldown period.
// Using Polly library
// In Startup.cs or Program.cs for HttpClient configuration:
// services.AddHttpClient("CacheClient")
// .AddPolicyHandler(
// HttpPolicyExtensions
// .HandleTransientHttpError() // Or specific exceptions from cache client
// .CircuitBreakerAsync(
// handledEventsAllowedBeforeBreaking: 3,
// durationOfBreak: TimeSpan.FromSeconds(30)
// )
// );
// In your service:
// public class MyServiceUsingCache
// {
// private readonly HttpClient _cacheClient;
// public MyServiceUsingCache(IHttpClientFactory clientFactory)
// {
// _cacheClient = clientFactory.CreateClient("CacheClient");
// }
// public async Task<string> GetDataFromCacheOrSource(string key)
// {
// try
// {
// // response = await _cacheClient.GetAsync($"cache-endpoint/{key}");
// // if (response.IsSuccessStatusCode) return await response.Content.ReadAsStringAsync();
// // This is a simplified representation; actual cache client would be used.
// // If cache call fails and circuit breaker is open, it will throw BrokenCircuitException immediately.
// string cachedData = await AttemptGetFromCacheAsync(key);
// if (cachedData != null) return cachedData;
// }
// catch (BrokenCircuitException)
// {
// // Log: Circuit is open, falling back to source for key {key}
// return await GetDataFromPrimarySourceAsync(key);
// }
// catch (CacheAccessException ex) // Custom exception for other cache errors
// {
// // Log: Cache access error, falling back to source for key {key}, Error: {ex.Message}
// return await GetDataFromPrimarySourceAsync(key);
// }
// // If cache miss and no error:
// var sourceData = await GetDataFromPrimarySourceAsync(key);
// await AttemptSetToCacheAsync(key, sourceData); // Try to set, might also be within circuit breaker
// return sourceData;
// }
// private async Task<string> AttemptGetFromCacheAsync(string key) { /* ... actual cache get ... */ return null; }
// private async Task AttemptSetToCacheAsync(string key, string value) { /* ... actual cache set ... */ }
// private async Task<string> GetDataFromPrimarySourceAsync(string key) { /* ... get from DB ... */ return "data_from_db"; }
// }
High Availability Cache Cluster: Deploy your cache (e.g., Redis) in a clustered configuration with replication and automatic failover (like Redis Sentinel or Redis Cluster). This is a configuration aspect of the cache system itself.
Graceful Degradation: Design your application to function, perhaps with reduced performance or features, even if the cache is unavailable. For example, serving slightly stale data if the fresh data relies on the cache and the cache is down.
Caches have finite memory. When a cache is full and a new item needs to be added, an existing item must be evicted. The choice of eviction policy significantly impacts cache hit rates.
Choosing the right eviction policy depends heavily on your application’s specific data access patterns. Many caching libraries allow configuration of these policies.
It’s crucial to remember that caching isn’t monolithic. Caching occurs at multiple levels in a typical system:
A holistic caching strategy considers how these layers interact.
Caching is an indispensable tool for building high-performance, scalable applications. However, it introduces its own complexities. By understanding common pitfalls and implementing appropriate solutions and eviction strategies, you can harness the true power of caching.
Proactive monitoring of cache hit/miss rates, latency, and eviction counts, combined with a willingness to adapt your strategy based on observed access patterns, will ensure your cache layers contribute positively to your system’s health and user experience.
WRITTEN BY
Thuso Kharibe