In this article, I want to talk to you about the “security” of Redis distributed locks.

On the topic of Redis distributed locks, many articles have been written badly, why should I write this article?

Because I found that 99% of the articles on the Internet did not really clarify this issue. As a result, many readers have read a lot of articles, but they are still in a fog. For example, can you answer the following questions clearly?

How to implement a distributed lock based on Redis?

Is Redis distributed lock really safe?

What’s wrong with Redlock for Redis? Surely it’s safe?

The industry is arguing about Redlock, what are they arguing about? Which view is correct?

Do distributed locks use Redis or Zookeeper?

What issues need to be considered when implementing a “fault-tolerant” distributed lock?

In this article, I will clarify these issues thoroughly.

After reading this article, you will not only have a thorough understanding of distributed locks, but also have a deeper understanding of “distributed systems”.

The article is a bit long, but there is a lot of dry stuff. I hope you can read it patiently.

Why do you need distributed locks?

Before we start talking about distributed locks, it is necessary to briefly introduce why we need distributed locks?

Corresponding to the distributed lock is the “single-machine lock”. When we write multi-threaded programs, we avoid data problems caused by operating a shared variable at the same time, and usually use a lock to “mutually exclude” to ensure the correctness of the shared variable. , which is used within the “same process”.

If it is replaced by multiple processes, which need to operate a shared resource at the same time, how to mutually exclusive?

For example, current business applications are usually micro-service architectures, which also means that an application will deploy multiple processes. If these multiple processes need to modify the same row of records in MySQL, in order to avoid data errors caused by out-of-order operations, At this point, we need to introduce “distributed locks” to solve this problem.

To achieve distributed locks, an external system must be used, and all processes go to this system to apply for “locking”.

And this external system must implement the ability of “mutual exclusion”, that is, two requests come in at the same time, only one process will return success, and the other will return failure (or wait).

This external system can be MySQL, Redis or Zookeeper. But in pursuit of better performance, we usually choose to use Redis or Zookeeper to do it.

Next, I will take Redis as the main line, from shallow to deep, and take you to deeply analyze the various “security” issues of distributed locks, so as to help you understand distributed locks thoroughly.

How to implement distributed lock?

Let’s start with the simplest.

To implement distributed locks, Redis must have the ability to “mutual exclusion”. We can use the SETNX command, which means SET if Not eXists, that is, if the key does not exist, its value will be set, otherwise nothing will be done. .

Two client processes can execute this command to achieve mutual exclusion, and a distributed lock can be implemented.

Client 1 applies for a lock, and the lock is successful:

127.0.0.1:6379》 SETNX lock 1

(integer) 1 // Client 1, the lock is successful

Client 2 applies for a lock, because it arrives later, and the lock fails:

127.0.0.1:6379》 SETNX lock 1

(integer) 0 // Client 2, failed to lock

At this point, the successfully locked client can operate the “shared resource”, for example, modify a row of MySQL data, or call an API request.

After the operation is completed, the lock should be released in time to give the latecomer the opportunity to operate the shared resources. How to release the lock?

It is also very simple, just use the DEL command to delete the key directly:

127.0.0.1:6379″ DEL lock // Release the lock

(integer) 1

However, it has a big problem. When client 1 gets the lock, if the following scenario occurs, it will cause a “deadlock”:

The program handles business logic exceptions and does not release the lock in time

The process hangs and has no chance to release the lock

At this time, the client will always occupy the lock, and other clients will “never” get the lock.

How to solve this problem?

How to avoid deadlock?

The solution we can easily think of is to set a “lease period” for the lock when applying for a lock.

When implemented in Redis, it is to set an “expiration time” for this key. Here we assume that the time to operate the shared resource will not exceed 10s, then when locking, set the key to expire in 10s:

127.0.0.1:6379″ SETNX lock 1 // lock

(integer) 1

127.0.0.1:6379″ EXPIRE lock 10 // Automatically expire after 10s

(integer) 1

In this way, regardless of whether the client is abnormal or not, the lock can be “automatically released” after 10s, and other clients can still get the lock.

But is this really okay?

Still having problems.

The current operation, locking and setting expiration are two commands. Is it possible that only the first command is executed, but the second command is “too late”? E.g:

The execution of SETNX is successful, but the execution fails due to network problems when EXPIRE is executed

SETNX is executed successfully, Redis is down abnormally, EXPIRE has no chance to execute

SETNX is executed successfully, the client crashes abnormally, and EXPIRE has no chance to execute

In short, these two commands cannot be guaranteed to be atomic operations (success together), and there is a potential risk that the expiration time setting fails, and the “deadlock” problem still occurs.

How to do?

Before the Redis 2.6.12 version, we needed to try our best to ensure the atomic execution of SETNX and EXPIRE, and also consider how to handle various exceptions.

But after Redis 2.6.12, Redis expands the parameters of the SET command, and you can use this command:

// A command is guaranteed to execute atomically

127.0.0.1:6379》 SET lock 1 EX 10 NX

OK

This solves the deadlock problem and is relatively simple.

Let’s take a look at the analysis, what is the problem with it?

Consider such a scenario:

Client 1 is successfully locked and starts to operate shared resources

The time for client 1 to operate the shared resource “exceeds” the expiration time of the lock, and the lock is “automatically released”

Client 2 locks successfully and starts to operate shared resources

Client 1 completes the operation of the shared resource and releases the lock (but the lock of client 2 is released)

See, there are two serious problems here:

Lock expired: Client 1 took too long to operate the shared resource, causing the lock to be automatically released and then held by client 2

Release other people’s locks: After client 1 completes the operation of the shared resource, it releases the lock of client 2

What is the cause of these two problems? Let’s look at them one by one.

The first problem may be caused by the inaccuracy of our assessment of the timing of operating shared resources.

For example, the “slowest” time to operate a shared resource may take 15s, but we only set an expiration of 10s, so there is a risk that the lock will expire prematurely.

If the expiration time is too short, then increase the redundant time, for example, set the expiration time to 20s. Is this always okay?

This can indeed “mitigate” the problem and reduce the probability of problems, but it still cannot “completely solve” the problem.

Why?

The reason is that after the client obtains the lock, when operating the shared resource, the scenario encountered may be very complicated, for example, an exception occurs inside the program, the network request times out, and so on.

Since it is an “estimated” time, it can only be calculated roughly, unless you can predict and cover all the scenarios that cause the time to become longer, but this is actually very difficult.

Is there any better solution?

Don’t worry, about this problem, I will explain the corresponding solution in detail later.

Let’s move on to the second question.

The second problem is that one client releases locks held by other clients.

Think about it, what is the key point that leads to this problem?

The point is that when each client releases the lock, it is a “brainless” operation, and does not check whether the lock is “held by itself”, so there is a risk of releasing other people’s locks. Such an unlocking process, Very not “rigorous”!

How to solve this problem?

What if the lock is released by someone else?

The solution is: when the client locks, set a “unique identifier” that only it knows.

For example, it can be its own thread ID, or it can be a UUID (random and unique). Here we take UUID as an example:

// The VALUE of the lock is set to UUID

127.0.0.1:6379》 SET lock $uuid EX 20 NX

OK

It is assumed here that the 20s operation sharing time is completely sufficient, and the problem of automatic lock expiration is not considered.

After that, when releasing the lock, you must first determine whether the lock is returned to you. The pseudo code can be written as follows:

// The lock is own, only release

if redis.get(“lock”) == $uuid:

redis.del(“lock”)

The two commands GET + DEL are used to release the lock here. At this time, the atomic problem we mentioned earlier will be encountered again.

Client 1 executes GET and determines that the lock is its own

Client 2 executes the SET command to forcibly acquire the lock (although the probability of occurrence is relatively low, we need to carefully consider the security model of the lock)

Client 1 executes DEL, but releases the lock of client 2

It can be seen that these two commands still have to be executed atomically.

How to do it atomically? Lua script.

We can write this logic as a Lua script and let Redis execute it.

Because Redis processes each request is “single-threaded”, when executing a Lua script, other requests must wait until the Lua script is processed, so that no other commands are inserted between GET + DEL.

The Lua script to safely release the lock is as follows:

// Judging that the lock is your own, only release it

if redis.call(“GET”,KEYS[1]) == ARGV[1]

then

return redis.call(“DEL”,KEYS[1])

else

return 0

end

Well, in this way, the entire locking and unlocking process is more “rigorous”.

Here we first summarize, a rigorous process of distributed locks based on Redis is as follows:

Lock: SET lock_key $unique_id EX $expire_time NX

Manipulate shared resources

Release lock: Lua script, first GET to determine whether the lock belongs to itself, and then DEL to release the lock

Well, with this complete lock model, let’s go back to the first question mentioned earlier.

What should I do if the lock expiration time is not well evaluated?

What should I do if the lock expiration time is not well evaluated?

As we mentioned earlier, if the expiration time of the lock is not well evaluated, the lock will have the risk of “early” expiration.

The compromise solution given at that time was to try to “redundant” the expiration time to reduce the probability of the lock expiring early.

In fact, this solution does not solve the problem perfectly, so what should we do?

Is it possible to design such a scheme: when adding a lock, first set an expiration time, and then we start a “daemon thread” to regularly detect the expiration time of the lock. If the lock is about to expire and the operation of shared resources has not been completed, then Automatically “renew” the lock and reset the expiration time.

This is indeed a better solution.

If you’re on the Java stack, luckily, there’s a library that wraps all this work: Redisson.

Redisson is a Redis SDK client implemented in Java language. When using distributed locks, it adopts the “automatic renewal” scheme to avoid lock expiration. We generally call this daemon thread the “watchdog” thread. .

In addition, this SDK also encapsulates many easy-to-use functions:

reentrant lock

optimistic locking

fair lock

Read-write lock

Redlock (red lock, will be described in detail below)

The API provided by this SDK is very friendly. It can operate distributed locks in the same way as local locks. If you are a Java technology stack, you can use it directly.

I don’t focus on the use of Redisson here. You can see the official Github to learn how to use it, which is relatively simple.

Here we will summarize again, the implementation of distributed locks based on Redis, the problems encountered before, and the corresponding solutions:

Deadlock: set expiration time

The expiration time is poorly evaluated, and the lock expires early: daemon thread, automatic renewal

The lock is released by someone else: the lock is written into a unique identifier, and the identifier is checked before releasing the lock, and then released

What other problem scenarios will endanger the security of Redis locks?

The scenarios analyzed before are all the problems that may arise from locking in a “single” Redis instance, and do not involve the details of the Redis deployment architecture.

When we use Redis, we generally deploy the mode of master-slave cluster + sentinel. The advantage of this is that when the main library goes down abnormally, the sentinel can realize “automatic failover”, and upgrade the slave library to the master library. Continue Provide services to ensure availability.

When the “master-slave switchover” occurs, will this distribution lock still be safe?

Consider this scenario:

Client 1 executes the SET command on the main library, and the lock is successful

At this point, the master library is down abnormally, and the SET command has not been synchronized to the slave library (master-slave replication is asynchronous)

The slave library was promoted to the new master library by the sentinel, which is locked on the new master library and lost!

It can be seen that when the Redis copy is introduced, the distribution lock may still be affected.

how to solve this problem?

To this end, the author of Redis proposes a solution, which is the Redlock we often hear.

Does it really solve the above problem?

Is Redlock really safe?

Well, finally to the main point of this article. ah? So many of the problems mentioned above, are they just basics?

Yes, those are just appetizers, real hard dishes, just started here.

If you don’t understand the content mentioned above, I suggest you read it again and clarify the basic process of locking and unlocking first.

If you already know something about Redlock, you can follow me here to review it again. If you don’t know Redlock, it’s okay, I will take you to know it again.

It is worth reminding you that I will not only talk about the principle of Redlock, but also raise many questions about “distributed systems”. You’d better follow my ideas and analyze the answers to the questions in your mind.

Now let’s see how the Redlock solution proposed by the author of Redis solves the problem of lock failure after master-slave switching.

Redlock’s solution is based on 2 premises:

It is no longer necessary to deploy slave libraries and sentinel instances, only the main library is deployed

However, multiple main libraries need to be deployed, and at least 5 instances are officially recommended.

That is to say, if you want to use Redlock, you need to deploy at least 5 Redis instances, and they are all main libraries, they have no relationship between them, they are all isolated instances.

Note: Either deploy Redis Cluster or deploy 5 simple Redis instances.

How exactly is Redlock used?

The overall process is as follows, divided into 5 steps:

The client first obtains the “current timestamp T1”

The client initiates lock requests to these five Redis instances in turn (using the SET command mentioned above), and each request will set a timeout (millisecond level, much less than the effective time of the lock). If an instance is locked If it fails (including network timeout, locks held by others, etc.), immediately apply for a lock to the next Redis instance

If the client is successfully locked from >=3 (most) Redis instances, it will obtain the “current timestamp T2” again. If T2 – T1 < the expiration time of the lock, at this time, the client is considered to be successfully locked, otherwise Think that the lock failed

After the lock is successful, operate the shared resource (for example, modify a row in MySQL, or initiate an API request)

If the lock fails, a lock release request is issued to “all nodes” (the Lua script mentioned above releases the lock)

Let me briefly summarize for you, there are 4 key points:

The client applies for locks on multiple Redis instances

Must ensure that most nodes are successfully locked

The total time spent on locking most nodes is less than the expiration time set by the lock

To release the lock, a lock release request must be issued to all nodes

It may not be easy to understand when you read it for the first time. It is recommended that you read the above text several times to deepen your memory.

Then, it is very important to remember these 5 steps. The following will analyze various problems and assumptions that may lead to lock failure according to this process.

Well, understand the process of Redlock, let’s see why Redlock does this.

1) Why lock on multiple instances?

In essence, it is for “fault tolerance”, some instances are abnormally down, the remaining instances are successfully locked, and the entire lock service is still available.

2) Why is it considered a success when most of the locking is successful?

When multiple Redis instances are used together, they actually form a “distributed system”.

In a distributed system, there will always be “abnormal nodes”, so when talking about distributed system problems, it is necessary to consider how many abnormal nodes there are, and it will still not affect the “correctness” of the entire system.

This is a “fault-tolerance” problem in a distributed system. The conclusion of this problem is that if there are only “faulty” nodes, as long as most nodes are normal, the entire system can still provide correct services.

The model of this problem is the “Byzantine General” problem we often hear. If you are interested, you can go to see the deduction process of the algorithm.

3) Why is it necessary to calculate the cumulative time spent on locking after successful locking in step 3?

Because multiple nodes are operated, it will definitely take longer than operating a single instance. Moreover, because it is a network request, the network situation is complex, and there may be delays, packet loss, timeout, etc. The more, the greater the probability of anomalies occurring.

Therefore, even if most nodes are successfully locked, if the cumulative time for locking has “exceeded” the expiration time of the lock, then the lock on some instances may have expired at this time, and the lock is meaningless.

4) Why release the lock and operate all nodes?

When a Redis node is locked, the locking may fail due to “network reasons”.

For example, if the client successfully locks a Redis instance, but when reading the response result, the network problem causes the reading to fail. In fact, the lock has been successfully locked on Redis.

Therefore, when releasing the lock, regardless of whether the lock has been successfully locked before, the locks of “all nodes” need to be released to ensure that the “residual” locks on the nodes are cleaned up.

Well, I understand the process and related problems of Redlock. It seems that Redlock has indeed solved the problem of abnormal shutdown of Redis nodes and the failure of locks, and ensured the “security” of the locks.

But is this really the case?

Who is right and who is wrong in the Redlock debate?

As soon as the Redis author proposed this solution, he was immediately questioned by the famous distributed system experts in the industry!

The expert is Martin, a distributed systems researcher at the University of Cambridge, UK. Before that he was a software engineer and entrepreneur working on large-scale data infrastructure. It also frequently speaks at conferences, blogs, writes books, and is an open source contributor.

He immediately wrote an article, questioning the problem of this Redlock algorithm model, and put forward his own views on the design of distributed locks.

Afterwards, Redis author Antirez, not to be outdone in the face of doubts, also wrote an article to refute the other party’s point of view, and analyzed more design details of the Redlock algorithm model in detail.

Moreover, the debate on this issue also caused a very heated discussion on the Internet at that time.

The two have clear ideas and sufficient arguments. This is a master move, and it is also a very good collision of ideas in the field of distributed systems! Both sides are experts in the field of distributed systems, but they make many opposite assertions on the same issue. What is going on?

Below I will extract important points from their debates and present them to you.

Reminder: The amount of information that follows is very large and may not be understandable. It is best to slow down and read.

Distributed expert Martin’s doubts about Relock

In his article, he mainly elaborated on 4 arguments:

1) What is the purpose of distributed locks?

Martin said that you must first understand what is the purpose of using distributed locks?

He sees two purposes.

First, efficiency.

Using the mutual exclusion ability of distributed locks is to avoid doing the same work twice (such as some expensive computing tasks) unnecessarily. If the lock fails, it will not bring “malign” consequences, such as sending 2 emails, which is harmless.

Second, correctness.

Locks are used to prevent concurrent processes from interfering with each other. If the lock fails, it will cause multiple processes to operate the same data at the same time, resulting in serious data errors, permanent inconsistencies, data loss and other malignant problems, just like taking repeated doses of drugs to patients, the consequences are very serious.

He believes that if you are for the former – efficiency, then you can use the stand-alone version of Redis, even if the lock failure (downtime, master-slave switch) occurs occasionally, it will not have serious consequences. And using Redlock is too heavy and unnecessary.

And if it is for correctness, Martin believes that Redlock cannot meet the security requirements at all, and there is still a problem of lock failure!

2) Problems encountered by locks in distributed systems

Martin said that a distributed system is more like a complex “beast”, with all kinds of anomalies that you can’t think of.

These abnormal scenarios mainly include three blocks, which are also the three mountains that distributed systems will encounter: NPC.

N: Network Delay, network delay

P: Process Pause, process pause (GC)

C: Clock Drift, clock drift

Martin points out the Redlock security issue with an example of process suspension (GC):

Client 1 requests to lock nodes A, B, C, D, E

After client 1 gets the lock, it enters GC (it takes a long time)

Locks expired on all Redis nodes

Client 2 acquires locks on A, B, C, D, E

Client 1 GC is over, it is considered that the lock was successfully acquired

Client 2 also thinks that the lock has been acquired, and a “conflict” occurs

Martin believes that GC can occur at any point in the program, and the execution time is uncontrollable.

Note: Of course, even using a programming language without GC, in the event of network delay and clock drift, it may cause problems with Redlock. Here Martin just takes GC as an example.

3) It is unreasonable to assume that the clock is correct

Or, when there is a problem with the “clock” of multiple Redis nodes, it will also cause the Redlock lock to fail.

Client 1 acquires locks on nodes A, B, C, but cannot access D and E due to network issues

The clock on node C “jumps forward”, causing the lock to expire

Client 2 acquires locks on nodes C, D, E, and cannot access A and B due to network problems

Both clients 1 and 2 now believe they hold the lock (conflict)

Martin believes that Redlock must “strongly rely” on the synchronization of the clocks of multiple nodes. Once the clock of a node is wrong, the algorithm model will fail.

A similar problem occurs even if C is not a clock jump, but “restarts immediately after a crash”.

Martin went on to explain that it is very likely that the machine’s clock will go wrong:

The system administrator “manually modified” the machine clock

The machine clock made a big “jump” when synchronizing the NTP time

In short, Martin believes that Redlock’s algorithm is based on the “synchronization model”, and there is a lot of data research showing that the assumption of the synchronization model is problematic in distributed systems.

In a messy distributed system, you can’t assume the system clock is right, so you have to be very careful with your assumptions.

4) Propose a fecing token scheme to ensure correctness

Correspondingly, Martin proposed a scheme called fencing token to ensure the correctness of distributed locks.

The model flow is as follows:

When the client acquires the lock, the lock service can provide an “incremental” token

The client takes this token to operate the shared resource

Shared resources can reject requests from “latecomers” based on the token

In this way, no matter what kind of abnormal situation occurs in the NPC, the security of the distributed lock can be guaranteed because it is built on the “asynchronous model”.

And Redlock can’t provide a scheme similar to fencing token, so it can’t guarantee security.

He also said that a good distributed lock, no matter what happens to the NPC, can give a result within the specified time, but it will not give a wrong result. That is, it only affects the “performance” (or activity) of the lock, but not its “correctness”.

Martin’s conclusion:

1. Redlock is nondescript: For efficiency, Redlock is heavier, there is no need to do so, and for correctness, Redlock is not safe enough.

2. Unreasonable clock assumptions: The algorithm makes dangerous assumptions about the system clock (assuming that multiple node machine clocks are consistent), and if these assumptions are not met, the lock will fail.

3. Unable to guarantee correctness: Redlock cannot provide a solution similar to fencing token, so it cannot solve the problem of correctness. For correctness, use software with a “consensus system” such as Zookeeper.

Alright, that’s Martin’s argument against using Redlock, and it looks justified.

Let’s see how Redis author Antirez refutes it.

A rebuttal from Redis author Antirez

In the Redis author’s article, there are 3 key points:

1) Explain the clock problem

First of all, the author of Redis saw through the most core problem raised by the other party at a glance: the clock problem.

The author of Redis stated that Redlock does not need a completely consistent clock, but only needs to be roughly consistent, allowing “errors”.

For example, if you want to time 5s, you may actually remember 4.5s, and then 5.5s. There is a certain error, but as long as it does not exceed the “error range” lock failure time, the accuracy requirements for the clock are not very high. And this is also in line with the real environment.

Regarding the “clock modification” issue mentioned by the other party, the Redis author retorted:

Manually modify the clock: just don’t do this, otherwise you will directly modify the Raft log and Raft will not work. . .

Clock jumping: Through “proper operation and maintenance”, it is ensured that the machine clock does not jump significantly (every time through small adjustments), in fact, this can be done

Why did the Redis authors give priority to explaining the clock problem? Because in the subsequent rebuttal process, it is necessary to rely on this basis for further explanation.

2) Explain network latency, GC issues

After that, the author of Redis also refuted the problems raised by the other party that network delay and process GC may cause Redlock to fail:

Let’s revisit the problem assumptions that Martin posed:

Client 1 requests to lock nodes A, B, C, D, E

After client 1 gets the lock, it enters GC

Locks expired on all Redis nodes

Client 2 acquires locks on nodes A, B, C, D, E

Client 1 GC is over, it is considered that the lock was successfully acquired

Client 2 also thinks that the lock is acquired and a “conflict” occurs

The author of Redis countered that this assumption is actually problematic, and Redlock can guarantee lock security.

What’s going on here?

Remember the 5 steps we introduced to the Redlock process earlier? I’ll bring it here for you to review.

The client first obtains the “current timestamp T1”

The client initiates lock requests to these five Redis instances in turn (using the SET command mentioned above), and each request will set a timeout (millisecond level, much less than the effective time of the lock). If an instance is locked If it fails (including network timeout, locks held by others, etc.), immediately apply for a lock to the next Redis instance

If the client successfully locks from more than 3 (most) Redis instances, it will obtain the “current timestamp T2” again. If T2 – T1 < the expiration time of the lock, at this time, the client is considered to be successfully locked, otherwise it is considered to be added. lock failed

After the lock is successful, operate the shared resource (for example, modify a row in MySQL, or initiate an API request)

If the lock fails, a lock release request is issued to “all nodes” (the Lua script mentioned above releases the lock)

Note that the key point is 1-3. In step 3, why do you need to re-acquire the “current timestamp T2” after the lock is successful? Also use the time T2 – T1 to compare with the expiration time of the lock?

The author of Redis emphasizes: If there are abnormal situations such as network delay, process GC, etc. that take a long time in 1-3, it can be detected in the third step T2 – T1. If the expiration time of the lock setting is exceeded, then Just think that the locking will fail, and then release the locks of all nodes just fine!

The author of Redis continues to discuss that if the other party believes that the network delay occurs and the process GC occurs after step 3, that is, the client confirms that it has obtained the lock, and there is a problem on the way to operate the shared resource, causing the lock to fail, then this is not only Redlock’s Problem, any other lock service such as Zookeeper, has a similar problem, which is beyond the scope of this discussion.

Here I give an example to explain the problem:

The client successfully obtained the lock through Redlock (passed the logic of successful locking of most nodes and time-consuming check of locking)

The client starts to operate the shared resource. At this time, the network delay, process GC, etc. take a long time.

At this point, the lock expires and is automatically released

The client starts to operate MySQL (the lock at this time may be acquired by others, and the lock becomes invalid)

The conclusion of the Redis author here is:

No matter what time-consuming problem the client experiences before getting the lock, Redlock can detect it in step 3

After the client gets the lock, an NPC occurs, and neither Redlock nor Zookeeper can do anything.

Therefore, the author of Redis believes that Redlock can guarantee correctness on the basis of ensuring the correctness of the clock.

3) Question the fencing token mechanism

The author of Redis also raised doubts about the fencing token mechanism proposed by the other party. It is mainly divided into 2 questions. This is the most inappropriate to understand. Please follow my train of thought.

First, this solution must require the “shared resource server” to be operated to have the ability to reject “old tokens”.

For example, to operate MySQL, get a token with an increasing number from the lock service, and then the client needs to take this token to modify a certain row in MySQL, which requires the use of MySQL’s “thing isolation”.

// Both clients must use things and isolation to achieve their goals

// Pay attention to the judgment conditions of the token

UPDATE table T SET val = $new_val WHERE id = $id AND current_token 《 $token

But what if it’s not MySQL? For example, to write a file to the disk or initiate an HTTP request, then this solution is powerless, which puts higher requirements on the resource server to be operated.

That is to say, most of the resource servers to be operated do not have this mutual exclusion capability.

Furthermore, since the resource servers have the “mutual exclusion” capability, why do they need distributed locks?

Therefore, the Redis author believes that this scheme is untenable.

Second, to take a step back, even if Redlock does not provide the ability to meet token, Redlock has already provided a random value (that is, the UUID mentioned above). Using this random value, the same effect as fecing token can be achieved.

How to do it?

The author of Redis only mentioned that the functions similar to fencing token can be completed, but he did not expand the relevant details. According to the information I checked, the approximate process should be as follows. If there is any error, welcome to communicate~

The client uses Redlock to get the lock

Before the client operates the shared resource, it first marks the VALUE of the lock on the shared resource to be operated.

The client processes the business logic, and finally, when modifying the shared resource, judge whether the tag is the same as before, and only modify it (similar to the idea of ​​CAS)

Or take MySQL as an example, an example is this:

The client uses Redlock to get the lock

Before the client wants to modify a row of data in the MySQL table, it first updates the VALUE of the lock to a field in this row (this is assumed to be the current_token field)

Client handles business logic

The client modifies this row of data in MySQL, treats VALUE as a WHERE condition, and then modifies

UPDATE table T SET val = $new_val WHERE id = $id AND current_token = $redlock_value

It can be seen that this scheme relies on the transaction mechanism of MySQL, and also achieves the same effect as the fencing token mentioned by the other party.

But there is still a small problem here, which was raised by netizens when they participated in the discussion: through this scheme, two clients first “mark” and then “check + modify” the shared resources, so the order of operations of the two clients cannot be guaranteed. ?

With the fencing token mentioned by Martin, because this token is a monotonically increasing number, the resource server can reject small token requests, ensuring the “sequentiality” of operations!

The author of Redis explained this issue differently. I think it makes sense. He explained: The essence of distributed locks is for “mutual exclusion”. As long as two clients can be guaranteed concurrently, one succeeds and the other fails. That’s fine, no need to care about “sequentiality”.

In the previous question from Martin, he has always been concerned about this sequential problem, but the author of Redis has a different view.

To sum up, the conclusion of the Redis author:

1. The author agrees with the other party about the impact of “clock hopping” on Redlock, but believes that clock hopping can be avoided, depending on infrastructure and operation and maintenance.

2. Redlock has fully considered the NPC problem in its design. The presence of NPC before step 3 of Redlock can ensure the correctness of the lock, but the occurrence of NPC after step 3 is not only a problem with Redlock, but also with other distributed lock services. , so it is not within the scope of the discussion.

是不是觉得很有意思?

在分布式系统中,一个小小的锁,居然可能会遇到这么多问题场景,影响它的安全性!

不知道你看完双方的观点,更赞同哪一方的说法呢?

别急,后面我还会综合以上论点,谈谈自己的理解。

好,讲完了双方对于 Redis 分布锁的争论,你可能也注意到了,Martin 在他的文章中,推荐使用 Zookeeper 实现分布式锁,认为它更安全,确实如此吗?

基于 Zookeeper 的锁安全吗?

如果你有了解过 Zookeeper,基于它实现的分布式锁是这样的:

客户端 1 和 2 都尝试创建「临时节点」,例如 /lock

假设客户端 1 先到达,则加锁成功,客户端 2 加锁失败

客户端 1 操作共享资源

客户端 1 删除 /lock 节点,释放锁

你应该也看到了,Zookeeper 不像 Redis 那样,需要考虑锁的过期时间问题,它是采用了「临时节点」,保证客户端 1 拿到锁后,只要连接不断,就可以一直持有锁。

而且,如果客户端 1 异常崩溃了,那么这个临时节点会自动删除,保证了锁一定会被释放。

不错,没有锁过期的烦恼,还能在异常时自动释放锁,是不是觉得很完美?

其实不然。

思考一下,客户端 1 创建临时节点后,Zookeeper 是如何保证让这个客户端一直持有锁呢?

原因就在于,客户端 1 此时会与 Zookeeper 服务器维护一个 Session,这个 Session 会依赖客户端「定时心跳」来维持连接。

如果 Zookeeper 长时间收不到客户端的心跳,就认为这个 Session 过期了,也会把这个临时节点删除。

同样地,基于此问题,我们也讨论一下 GC 问题对 Zookeeper 的锁有何影响:

客户端 1 创建临时节点 /lock 成功,拿到了锁

客户端 1 发生长时间 GC

客户端 1 无法给 Zookeeper 发送心跳,Zookeeper 把临时节点「删除」

客户端 2 创建临时节点 /lock 成功,拿到了锁

客户端 1 GC 结束,它仍然认为自己持有锁(冲突)

可见,即使是使用 Zookeeper,也无法保证进程 GC、网络延迟异常场景下的安全性。

这就是前面 Redis 作者在反驳的文章中提到的:如果客户端已经拿到了锁,但客户端与锁服务器发生「失联」(例如 GC),那不止 Redlock 有问题,其它锁服务都有类似的问题,Zookeeper 也是一样!

所以,这里我们就能得出结论了:一个分布式锁,在极端情况下,不一定是安全的。

如果你的业务数据非常敏感,在使用分布式锁时,一定要注意这个问题,不能假设分布式锁 100% 安全。

好,现在我们来总结一下 Zookeeper 在使用分布式锁时优劣:

Zookeeper 的优点:

不需要考虑锁的过期时间

watch 机制,加锁失败,可以 watch 等待锁释放,实现乐观锁

但它的劣势是:

性能不如 Redis

部署和运维成本高

客户端与 Zookeeper 的长时间失联,锁被释放问题

我对分布式锁的理解

好了,前面详细介绍了基于 Redis 的 Redlock 和 Zookeeper 实现的分布锁,在各种异常情况下的安全性问题,下面我想和你聊一聊我的看法,仅供参考,不喜勿喷。

1) 到底要不要用 Redlock?

前面也分析了,Redlock 只有建立在「时钟正确」的前提下,才能正常工作,如果你可以保证这个前提,那么可以拿来使用。

但保证时钟正确,我认为并不是你想的那么简单就能做到的。

第一,从硬件角度来说,时钟发生偏移是时有发生,无法避免。

例如,CPU 温度、机器负载、芯片材料都是有可能导致时钟发生偏移的。

第二,从我的工作经历来说,曾经就遇到过时钟错误、运维暴力修改时钟的情况发生,进而影响了系统的正确性,所以,人为错误也是很难完全避免的。

所以,我对 Redlock 的个人看法是,尽量不用它,而且它的性能不如单机版 Redis,部署成本也高,我还是会优先考虑使用主从+ 哨兵的模式 实现分布式锁。

那正确性如何保证呢?第二点给你答案。

2) 如何正确使用分布式锁?

在分析 Martin 观点时,它提到了 fecing token 的方案,给我了很大的启发,虽然这种方案有很大的局限性,但对于保证「正确性」的场景,是一个非常好的思路。

所以,我们可以把这两者结合起来用:

1、使用分布式锁,在上层完成「互斥」目的,虽然极端情况下锁会失效,但它可以最大程度把并发请求阻挡在最上层,减轻操作资源层的压力。

2、但对于要求数据绝对正确的业务,在资源层一定要做好「兜底」,设计思路可以借鉴 fecing token 的方案来做。

两种思路结合,我认为对于大多数业务场景,已经可以满足要求了。

总结

好了,总结一下。

这篇文章,我们主要探讨了基于 Redis 实现的分布式锁,究竟是否安全这个问题。

从最简单分布式锁的实现,到处理各种异常场景,再到引出 Redlock,以及两个分布式专家的辩论,得出了 Redlock 的适用场景。

最后,我们还对比了 Zookeeper 在做分布式锁时,可能会遇到的问题,以及与 Redis 的差异。

这里我把这些内容总结成了思维导图,方便你理解。

后记

这篇文章的信息量其实是非常大的,我觉得应该把分布锁的问题,彻底讲清楚了。

如果你没有理解,我建议你多读几遍,并在脑海中构建各种假定的场景,反复思辨。

在写这篇文章时,我又重新研读了两位大神关于 Redlock 争辩的这两篇文章,可谓是是收获满满,在这里也分享一些心得给你。

1、在分布式系统环境下,看似完美的设计方案,可能并不是那么「严丝合缝」,如果稍加推敲,就会发现各种问题。所以,在思考分布式系统问题时,一定要谨慎再谨慎。

2、从 Redlock 的争辩中,我们不要过多关注对错,而是要多学习大神的思考方式,以及对一个问题严格审查的严谨精神。

编辑:jq

Leave a Reply

Your email address will not be published.