Articles
Jul 11, 2025·3 min read

Solving Race Conditions in Distributed Cache Systems with External Data Sources

Recently, I encountered a critical concurrency issue in an API that integrates with external data platforms: duplicate records in Redis cache caused by race conditions during expensive third-party API calls.

Problem Context

Our dashboard system provides analytics by fetching data from an external business intelligence platform. The process involves:

  • Receiving client requests for specific date ranges and customer filters
  • Checking Redis cache for existing aggregated data
  • If cache miss: querying the external BI platform API (expensive operation)
  • Processing and storing results in Redis for future requests

After opening our API for public usage, we identified that multiple concurrent requests with identical parameters were triggering duplicate external API calls and generating conflicting cache entries.

Problem Analysis

The expected flow was straightforward:

  1. Check if processed data exists in cache
  2. If missing: fetch from external platform and populate cache
  3. If present: return cached results

However, the actual behavior revealed a classic race condition:

const hasResultCount = await query.return.count();

if (hasResultCount === 0) {
  // Multiple requests pass this check simultaneously
  // Each triggers expensive external API call
  await this.externalDataProvider.execute(params);
}

Business Impact

This race condition had significant implications:

  • Cost: Each external API call had associated costs
  • Performance: External queries took 2–5 seconds vs sub-100ms cache hits
  • Rate Limits: Risk of hitting third-party API throttling
  • Data Consistency: Duplicate processing created inconsistent aggregations

Root Cause

Multiple requests were checking cache existence simultaneously, before any had completed the external data fetching and cache population. This resulted in:

  • All requests executing expensive external API calls
  • Multiple processing of the same dataset
  • Duplicate cache entries with potential inconsistencies
  • Unnecessary costs and performance degradation

Solution Implementation

I implemented a distributed lock mechanism using Redis atomic operations to serialize access to external data fetching:

if (hasResultCount === 0) {
  // Create unique lock key based on query parameters
  const lockKey = `lock:external_data:${startDate}:${endDate}:${customer}`;

  // Atomic lock acquisition using Redis SET NX EX
  const lockAcquired = await redis.set(lockKey, '1', {
    NX: true, // Only set if key doesn't exist
    EX: 60,   // Auto-expire after 60 seconds
  });

  if (lockAcquired) {
    try {
      // Double-check pattern after acquiring lock
      const recheckCount = await query.return.count();
      if (recheckCount === 0) {
        // Only one request executes the expensive external call
        await this.externalDataProvider.execute(params);
      }
    } finally {
      await redis.del(lockKey);
    }
  } else {
    // Other requests wait briefly then proceed to cache check
    await new Promise(resolve => setTimeout(resolve, 100));
  }
}

Technical Decisions

Lock Granularity: Used parameter-specific locks to allow concurrent processing of different datasets while preventing duplication of identical requests.

External API Protection: Ensured only one request per parameter combination hits the external platform, regardless of concurrent load.

Timeout Strategy: Set 60-second lock expiry to handle cases where external API calls might timeout or fail.

Fallback Mechanism: Non-blocking approach where failed lock acquisition results in brief wait, allowing requests to eventually access cached data populated by the lock holder.

Architecture Considerations

The solution addresses the distributed nature of the problem:

  • Horizontal Scaling: Works across multiple application instances
  • Fault Tolerance: Auto-expiring locks prevent permanent deadlocks
  • Cost Optimization: Dramatically reduces unnecessary external API calls
  • Performance: Maintains low latency for cache hits while controlling expensive operations

Results and Impact

The implementation delivered significant improvements:

  • Cost Reduction: 80% decrease in external API calls
  • Performance: Maintained sub-100ms response times for cached data
  • Reliability: Eliminated data inconsistencies from duplicate processing
  • Scalability: System now handles 10x concurrent load without external API stress

Lessons Learned

When integrating with external data sources in distributed systems, race conditions can have amplified business impact beyond simple data duplication. The cost and latency of external operations make proper synchronization critical.

Redis atomic operations provide an elegant solution for distributed coordination, especially when protecting expensive external resources. The key insight was recognizing that the lock granularity should match the external resource's cost structure — in this case, per-query-parameter combinations.

This experience reinforced that understanding the full data flow, including external dependencies and their constraints, is essential for designing robust caching strategies in distributed architectures.