Raft Locking Advice

http://nil.csail.mit.edu/6.824/2020/labs/raft-locking.txt

If you are wondering how to use locks in the 6.824 Raft labs, here are some rules and ways of thinking that might be helpful.

如果您想知道如何在6.824 Raft labs中使用锁,以下是一些可能会有所帮助的规则和思考方法。

规则1:go race

Rule 1: Whenever you have data that more than one goroutine uses, and at least one goroutine might modify the data, the goroutines should use locks to prevent simultaneous use of the data. The Go race detector is pretty good at detecting violations of this rule (though it won’t help with any of the rules below).

规则1:当您有多个goroutine使用的数据,并且至少有一个goroutine可能会修改数据时,goroutine应该使用锁来防止同时使用数据。Go race检测器非常擅长检测违反此规则的情况(尽管它对下面的任何规则都没有帮助)。

规则2:锁住整个代码sequence

Rule 2: Whenever code makes a sequence of modifications to shared data, and other goroutines might malfunction if they looked at the data midway through the sequence, you should use a lock around the whole sequence.

规则2:每当代码对共享数据进行一系列修改时,如果其他goroutine查看sequence中间的某个数据,则可能会出现故障,您应该对整个sequence使用锁。

An example:

1
2
3
4
rf.mu.Lock()
rf.currentTerm += 1
rf.state = Candidate
rf.mu.Unlock()

It would be a mistake for another goroutine to see either of these updates alone (i.e. the old state with the new term, or the new term with the old state). So we need to hold the lock continuously over the whole sequence of updates. All other code that uses rf.currentTerm or rf.state must also hold the lock, in order to ensure exclusive access for all uses.

如果另一个goroutine单独看到这些更新中的任何一个(即旧的state和新的term,或者新的term和旧的state),这将是一个错误。所有其他使用rf.currentTerm或rf.state的代码也必须持有该锁,以确保所有使用的独占性。

The code between Lock() and Unlock() is often called a “critical section.”The locking rules a programmer chooses (e.g.“a goroutine must hold rf.mu when using rf.currentTerm or rf.state”) are often called a “locking protocol”.

Lock()unlock()之间的代码通常被称为“临界区”。程序员选择的锁定规则(例如,当使用rf.currentTerm或rf.state时,goroutine必须持有rf.mu)通常被称为”锁定协议”。

规则3:同规则2

Rule 3: Whenever code does a sequence of reads of shared data (or reads and writes), and would malfunction if another goroutine modified the data midway through the sequence, you should use a lock around the whole sequence.

规则3:每当代码执行一系列共享数据读取(或读写)时,如果另一个goroutine在该sequence中途修改数据了,则会出现故障,您应该在整个sequence使用锁。

An example that could occur in a Raft RPC handler:

1
2
3
4
5
rf.mu.Lock()
if args.Term > rf.currentTerm {
rf.currentTerm = args.Term
}
rf.mu.Unlock()

This code needs to hold the lock continuously for the whole sequence. Raft requires that currentTerm only increases, and never decreases. Another RPC handler could be executing in a separate goroutine; if it were allowed to modify rf.currentTerm between the if statement and the update to rf.currentTerm, this code might end up decreasing rf.currentTerm. Hence the lock must be held continuously over the whole sequence. In addition, every other use of currentTerm must hold the lock, to ensure that no other goroutine modifies currentTerm during our critical section.

此代码需要在整个sequence中连续持有锁。Raft要求currentTerm只增不减。另一个RPC处理程序可以在单独的goroutine中执行;如果允许它在if语句和rf.currentTerm更新之间修改rf.currentTerm,则此代码最终可能会减少rf.currentTerm。因此,必须在整个sequence中连续持有锁。此外,currentTerm的每一次其他使用都必须持有锁,以确保在我们的临界区期间没有其他goroutine修改currentTerm。

Real Raft code would need to use longer critical sections than these examples; for example, a Raft RPC handler should probably hold the lock for the entire handler.

真正的Raft代码需要使用比这些示例更长的临界区;例如,Raft RPC处理程序可能应该持有整个处理程序的锁。

规则4:并发情况下锁释放前后值可能会发生变化

Rule 4: It’s usually a bad idea to hold a lock while doing anything that might wait: reading a Go channel, sending on a channel, waiting for a timer, calling time.Sleep(), or sending an RPC (and waiting for the reply). One reason is that you probably want other goroutines to make progress during the wait. Another reason is deadlock avoidance. Imagine two peers sending each other RPCs while holding locks; both RPC handlers need the receiving peer’s lock; neither RPC handler can ever complete because it needs the lock held by the waiting RPC call.

规则4:在执行任何可能等待的操作时保持锁定通常不是一个好主意:读取channel、发送channel、等待timer、调用time.sleep()或发送RPC(并等待回复)。其中一个原因是,可能希望其他goroutine在等待期间能够继续执行代码。另一个原因是避免死锁。设想两个peer在持有锁的同时相互发送RPC;两个RPC处理程序都需要接收peer的锁;两个RPC处理程序都无法完成,因为它需要等待的RPC调用持有的锁。

Code that waits should first release locks. If that’s not convenient, sometimes it’s useful to create a separate goroutine to do the wait.

等待的代码应该首先释放锁。如果这不方便,有时创建一个单独的goroutine来执行等待是很有用的。

Rule 5: Be careful about assumptions across a drop and re-acquire of a lock. One place this can arise is when avoiding waiting with locks held. For example, this code to send vote RPCs is incorrect:

对于锁的释放和重新获取的假设(if语句)要小心,可能出现这种情况的一个地方是避免持有锁的等待。例如,发送投票RPC的代码是错误的:

1
2
3
4
5
6
7
8
9
10
11
12
13
rf.mu.Lock()
rf.currentTerm += 1
rf.state = Candidate
for <each peer> {
go func() {
rf.mu.Lock()
args.Term = rf.currentTerm
rf.mu.Unlock()
Call("Raft.RequestVote", &args, ...)
// handle the reply...
} ()
}
rf.mu.Unlock()

The code sends each RPC in a separate goroutine. It’s incorrect because args.Term may not be the same as the rf.currentTerm at which the surrounding code decided to become a Candidate. Lots of time may pass between when the surrounding code creates the goroutine and when the goroutine reads rf.currentTerm; for example, multiple terms may come and go, and the peer may no longer be a candidate. One way to fix this is for the created goroutine to use a copy of rf.currentTerm made while the outer code holds the lock. Similarly, reply-handling code after the Call() must re-check all relevant assumptions after re-acquiring the lock; for example, it should check that rf.currentTerm hasn’t changed since the decision to become a candidate.

该代码在一个单独的goroutine中发送每个RPC。这是不正确的,因为args.Term可能与周围代码决定成为候选者的rf.currentTerm不一样。从周围的代码创建goroutine到goroutine读取rf.currentTerm之间可能会经过很多时间;例如,多个term可能来来去去去,该peer可能不再是候选人。解决这个问题的一个方法是,创建的goroutine使用在外层代码持有锁的时候创建的rf.currentTerm的副本。类似地,Call()后的应答处理代码必须在重新获取锁后重新检查所有相关假设;例如,它应该检查rf.currentTerm在决定成为候选人之后没有改变。

如何正确加锁

It can be difficult to interpret and apply these rules. Perhaps most puzzling is the notion in Rules 2 and 3 of code sequences that shouldn’t be interleaved with other goroutines’ reads or writes. How can one recognize such sequences? How should one decide where a sequence ought to start and end?

解释和应用这些规则可能会很困难。也许最令人困惑的是规则2和规则3中的代码sequence的概念,这些代码序列不应该与其他 goroutine 的读或写交错。人们如何识别这样的sequence?我们应该如何决定一个sequence应该从哪里开始和结束?

One approach is to start with code that has no locks, and think carefully about where one needs to add locks to attain correctness. This approach can be difficult since it requires reasoning about the correctness of concurrent code.

一种方法是从没有锁的代码开始,仔细考虑需要在何处添加锁以实现正确性。 这种方法可能很困难,因为它需要对并发代码的正确性进行推理。

A more pragmatic approach starts with the observation that if there were no concurrency (no simultaneously executing goroutines), you would not need locks at all. But you have concurrency forced on you when the RPC system creates goroutines to execute RPC handlers, and because you need to send RPCs in separate goroutines to avoid waiting. You can effectively eliminate this concurrency by identifying all places where goroutines start (RPC handlers, background goroutines you create in Make(), &c), acquiring the lock at the very start of each goroutine, and only releasing the lock when that goroutine has completely finished and returns. This locking protocol ensures that nothing significant ever executes in parallel;

一种更实用的方法首先观察到,如果没有并发(没有同时执行goroutines),则根本不需要锁。但是,当RPC系统创建goroutine来执行RPC处理程序时,您就会遇到并发问题,因为您需要在单独的goroutine中发送RPC以避免等待。您可以通过确定goroutine开始的所有地方(RPC处理程序,在Make()或者其他地方中创建的后台goroutine),在每个goroutine的最开始获取锁,并仅在该goroutine完全完成并返回时释放锁,从而有效地消除这种并发性。此锁定协议确保不会并行执行任何重要任务;

the locks ensure that each goroutine executes to completion before any other goroutine is allowed to start. With no parallel execution, it’s hard to violate Rules 1, 2, 3, or 5. If each goroutine’s code is correct in isolation (when executed alone, with no concurrent goroutines), it’s likely to still be correct when you use locks to suppress concurrency. So you can avoid explicit reasoning about correctness, or explicitly identifying critical sections.

锁确保每个 goroutine 在允许启动任何其他 goroutine 之前执行完成。由于没有并行执行,很难违反规则1、2、3或5。如果每个goroutine的代码在单独执行时是正确的(单独执行时,没有并发的goroutine),那么当您使用锁来抑制并发性时,它很可能仍然是正确的。因此,您可以避免对正确性进行显式推理,或显式识别关键部分。

However, Rule 4 is likely to be a problem. So the next step is to find places where the code waits, and to add lock releases and re-acquires (and/or goroutine creation) as needed, being careful to re-establish assumptions after each re-acquire. You may find this process easier to get right than directly identifying sequences that must be locked for correctness.

然而,规则4可能是个问题。因此,下一步是找到代码等待的位置,并根据需要添加锁释放和重新获取(和/或goroutine创建),在每次重新获取之后小心地重新建立假设。您可能会发现,与直接识别必须锁定以确保正确性的sequence相比,此过程更容易正确。

(As an aside, what this approach sacrifices is any opportunity for better performance via parallel execution on multiple cores: your code is likely to hold locks when it doesn’t need to, and may thus unnecessarily prohibit parallel execution of goroutines. On the other hand, there is not much opportunity for CPU parallelism within a single Raft peer.)

(顺便说一句,这种方法所牺牲的是通过在多个内核上并行执行来获得更好性能的任何机会:你的代码可能会在不需要的时候持有锁,因此可能会不必要地禁止goroutine的并行执行。另一方面,在单个Raft peer中,也没有太多的机会实现CPU并行化。)