Solving Race Conditions in Distributed Cache Systems with External Data Sources
Recently, I encountered a critical concurrency issue in an API that integrates with external data platforms: duplicate records in Redis cache caused by race conditions during expensive third-party API calls.
Problem Context
Our dashboard system provides analytics by fetching data from an external business intelligence platform. The process involves:
- Receiving client requests for specific date ranges and customer filters
- Checking Redis cache for existing aggregated data
- If cache miss: querying the external BI platform API (expensive operation)
- Processing and storing results in Redis for future requests
After opening our API for public usage, we identified that multiple concurrent requests with identical parameters were triggering duplicate external API calls and generating conflicting cache entries.
Problem Analysis
The expected flow was straightforward:
- Check if processed data exists in cache
- If missing: fetch from external platform and populate cache
- If present: return cached results
However, the actual behavior revealed a classic race condition:
const hasResultCount = await query.return.count();
if (hasResultCount === 0) {
// Multiple requests pass this check simultaneously
// Each triggers expensive external API call
await this.externalDataProvider.execute(params);
}
Business Impact
This race condition had significant implications:
- Cost: Each external API call had associated costs
- Performance: External queries took 2–5 seconds vs sub-100ms cache hits
- Rate Limits: Risk of hitting third-party API throttling
- Data Consistency: Duplicate processing created inconsistent aggregations
Root Cause
Multiple requests were checking cache existence simultaneously, before any had completed the external data fetching and cache population. This resulted in:
- All requests executing expensive external API calls
- Multiple processing of the same dataset
- Duplicate cache entries with potential inconsistencies
- Unnecessary costs and performance degradation
Solution Implementation
I implemented a distributed lock mechanism using Redis atomic operations to serialize access to external data fetching:
if (hasResultCount === 0) {
// Create unique lock key based on query parameters
const lockKey = `lock:external_data:${startDate}:${endDate}:${customer}`;
// Atomic lock acquisition using Redis SET NX EX
const lockAcquired = await redis.set(lockKey, '1', {
NX: true, // Only set if key doesn't exist
EX: 60, // Auto-expire after 60 seconds
});
if (lockAcquired) {
try {
// Double-check pattern after acquiring lock
const recheckCount = await query.return.count();
if (recheckCount === 0) {
// Only one request executes the expensive external call
await this.externalDataProvider.execute(params);
}
} finally {
await redis.del(lockKey);
}
} else {
// Other requests wait briefly then proceed to cache check
await new Promise(resolve => setTimeout(resolve, 100));
}
}
Technical Decisions
Lock Granularity: Used parameter-specific locks to allow concurrent processing of different datasets while preventing duplication of identical requests.
External API Protection: Ensured only one request per parameter combination hits the external platform, regardless of concurrent load.
Timeout Strategy: Set 60-second lock expiry to handle cases where external API calls might timeout or fail.
Fallback Mechanism: Non-blocking approach where failed lock acquisition results in brief wait, allowing requests to eventually access cached data populated by the lock holder.
Architecture Considerations
The solution addresses the distributed nature of the problem:
- Horizontal Scaling: Works across multiple application instances
- Fault Tolerance: Auto-expiring locks prevent permanent deadlocks
- Cost Optimization: Dramatically reduces unnecessary external API calls
- Performance: Maintains low latency for cache hits while controlling expensive operations
Results and Impact
The implementation delivered significant improvements:
- Cost Reduction: 80% decrease in external API calls
- Performance: Maintained sub-100ms response times for cached data
- Reliability: Eliminated data inconsistencies from duplicate processing
- Scalability: System now handles 10x concurrent load without external API stress
Lessons Learned
When integrating with external data sources in distributed systems, race conditions can have amplified business impact beyond simple data duplication. The cost and latency of external operations make proper synchronization critical.
Redis atomic operations provide an elegant solution for distributed coordination, especially when protecting expensive external resources. The key insight was recognizing that the lock granularity should match the external resource's cost structure — in this case, per-query-parameter combinations.
This experience reinforced that understanding the full data flow, including external dependencies and their constraints, is essential for designing robust caching strategies in distributed architectures.