6.3.4 Preventing race conditions
As you saw when building locks in section 6.2, dealing with race conditions that cause
retries or data corruption can be difficult. In this case, the semaphores that we created
have race conditions that we alluded to earlier, which can cause incorrect operation.
We can see the problem in the following example. If we have two processes A and
B that are trying to get one remaining semaphore, and A increments the counter first
but B adds its identifier to the ZSETs and checks its identifier’s rank first, then B will
get the semaphore. When A then adds its identifier and checks its rank, it’ll “steal” the
semaphore from B, but B won’t know until it tries to release or renew the semaphore.
When we were using the system clock as a way of getting a lock, the likelihood of
this kind of a race condition coming up and resulting in more than the desired number
of semaphore owners was related to the difference in system times—the greater
the difference, the greater the likelihood. After introducing the counter with the
owner ZSET, this problem became less likely (just by virtue of removing the system
clock as a variable), but because we have multiple round trips, it’s still possible.
To fully handle all possible race conditions for semaphores in Redis, we need to
reuse the earlier distributed lock with timeouts that we built in section 6.2.5. We need
to use our earlier lock to help build a correct counting semaphore. Overall, to acquire
the semaphore, we’ll first try to acquire the lock for the semaphore with a short timeout.
If we got the lock, we then perform our normal semaphore acquire operations
with the counter, owner ZSET, and the system time ZSET. If we failed to acquire the
lock, then we say that we also failed to acquire the semaphore. The code for performing
this operation is shown next.
I know, it can be disappointing to come so far only to end up needing to use a lock at
the end. But that’s the thing with Redis: there are usually a few ways to solve the same
or a similar problem, each with different trade-offs. Here are some of the trade-offs
for the different counting semaphores that we’ve built:
- If you’re happy with using the system clock, never need to refresh the semaphore,
and are okay with occasionally going over the limit, then you can use the
first semaphore we created.
- If you can only really trust system clocks to be within 1 or 2 seconds, but are still
okay with occasionally going over your semaphore limit, then you can use the
- If you need your semaphores to be correct every single time, then you can use a
lock to guarantee correctness.
Now that we’ve used our lock from section 6.2 to help us fix the race condition, we
have varying options for how strict we want to be with our semaphore limits. Generally it’s a good idea to stick with the last, strictest version. Not only is the last semaphore
actually correct, but whatever time we may save using a simpler semaphore, we could lose by using too many resources.
In this section, we used semaphores to limit the number of API calls that can be running
at any one time. Another common use case is to limit concurrent requests to databases
to reduce individual query times and to prevent dogpiling like we talked about at
the end of section 6.2.4. One other common situation is when we’re trying to download
many web pages from a server, but their robots.txt says that we can only make (for example)
three requests at a time. If we have many clients downloading web pages, we can
use a semaphore to ensure that we aren’t pushing a given server too hard.
As we finish with building locks and semaphores to help improve performance for
concurrent execution, it’s now time to talk about using them in more situations. In
the next section, we’ll build two different types of task queues for delayed and concurrent