解析 Viewstamped Replication (VSR)：与 Paxos 并行的另一种分布式共识演进路径-洪萨配资

各位同学，各位专家，大家好！

今天，我们齐聚一堂，共同探讨分布式系统领域一个核心且迷人的课题：分布式共识。当谈及分布式共识，一个名字几乎立刻就会浮现在每个人的脑海中——Paxos。它以其严谨的数学证明和对两阶段提交的超越，成为了学术界和工业界公认的基石。然而，历史的长河中，往往并非只有一条主干道。今天，我将带领大家踏上一条与 Paxos 并行，同样精妙且富有洞察力的演进路径——Viewstamped Replication (VSR)。

VSR，由 MIT 的 Barbara Liskov 和 Oki 等人在上世纪 80 年代末 90 年代初提出，与 Paxos 几乎是同时期的产物，但其设计理念和关注点略有不同。如果说 Paxos 更多地是从单个值（或一系列值）的原子性协议角度出发，那么 VSR 则更直接地为“状态机复制”（State Machine Replication, SMR）这一分布式系统的核心范式量身定制。它提供了一种强大而优雅的方式，确保在部分节点失效的情况下，一组服务器能够像一个整体一样，以相同的顺序执行相同的操作，从而维持强一致性。

本次讲座，我们将深入剖析 VSR 的核心机制，理解其在正常操作、故障恢复及视图变更中的精妙设计。我们还将将其与 Paxos 进行细致的对比，从而更好地理解这两种协议的异同，以及它们各自的适用场景和设计哲学。我的目标是，通过严谨的逻辑和具体的代码示例（尽管是伪代码形式，但足以表达核心思想），让大家能够透彻地理解 VSR 的运作原理，并对其在构建高可用、强一致性分布式系统中的价值有一个清晰的认识。

1. 分布式共识的根本挑战与状态机复制

在深入 VSR 之前，让我们快速回顾一下分布式系统面临的根本挑战。想象一个由多个独立计算机组成的系统，它们需要协同工作，共同维护一个共享状态。例如，一个银行账户余额系统，或是一个分布式文件系统。在这样的环境中，我们面临着以下挑战：

节点故障（Crash Faults）：任何一台机器都可能随时崩溃、重启或永久失效。
网络分区（Network Partitions）：网络连接可能中断，导致部分机器之间无法通信。
消息丢失、乱序或重复：网络不可靠，消息可能在传输过程中丢失、延迟，甚至重复投递。
异步性：没有全局时钟，消息传输时间不确定。

在这些挑战下，我们仍然希望系统能够满足一些关键属性：

安全性（Safety）：永远不会发生错误的事情，例如，所有副本上的状态机状态最终一致，且符合操作的语义。
活性（Liveness）：好的事情最终会发生，例如，客户端的请求最终会被处理。

状态机复制 (State Machine Replication, SMR)是解决这些挑战的强大范式。其核心思想是：如果所有副本都从相同的初始状态开始，并以相同的顺序执行相同的确定性操作序列，那么它们最终将达到相同的状态。因此，分布式共识问题就转化为：如何让所有副本对操作的顺序达成一致。VSR 和 Paxos 都是解决这一问题的优秀协议。

2. Viewstamped Replication 的起源与核心概念

VSR 的设计哲学，从一开始就瞄准了 SMR。它提供了一套完整的协议，涵盖了客户端请求处理、副本间日志同步、以及最关键的故障检测与主副本切换（即视图变更）。

2.1 核心术语与角色

在 VSR 中，我们主要有以下几个核心概念和角色：

副本 (Replica)：组成分布式系统的一组服务器。它们维护着系统的状态机和操作日志。
主副本 (Primary)：在给定时间内，负责协调客户端请求处理和日志复制的特定副本。客户端通常只与主副本交互。
备份副本 (Backup)：除了主副本之外的所有副本。它们被动地接收并应用主副本发来的操作。
视图 (View)：一个逻辑上的时代或周期，由一个唯一的递增整数view_num标识。每个视图都与一个特定的主副本相关联。当主副本发生故障或被替换时，系统会进入一个新的视图。
操作日志 (Log)：每个副本都维护一个有序的操作序列。日志中的每个条目都包含一个客户端请求及其在日志中的序列号op_num。
已提交操作 (Committed Operation)：当一个操作被大多数副本确认写入其日志后，就被认为是已提交的。已提交的操作可以安全地应用到状态机。
法定人数 (Quorum)：为了确保安全性和活性，VSR 中的许多操作都需要得到大多数副本的响应。通常，一个法定人数是指超过半数的副本集合。对于N个副本，法定人数的大小是floor(N/2) + 1。

2.2 副本状态

每个副本都维护着以下关键状态变量：

replica_id: 副本的唯一标识符。
current_view_num: 当前副本所处的视图号。
is_primary: 布尔值，指示当前副本是否是主副本。
log: 一个操作列表，存储着所有已接收但可能尚未提交的操作。
commit_num: 已知最后一个已提交操作的日志序列号op_num。所有op_num <= commit_num的操作都已安全地应用到状态机。
last_applied_op_num: 最后一个已应用到状态机的操作的op_num。通常last_applied_op_num <= commit_num。
replica_state: 副本的状态机当前状态。
client_table: 存储每个客户端的最新请求 ID 和响应，用于处理重复请求。

3. VSR 协议详解：正常操作流程

我们首先来看 VSR 在没有故障发生时的正常操作流程。这部分相对直观，它确保了所有副本以相同的顺序复制客户端请求。

3.1 客户端请求处理流程

客户端发送请求: 客户端将操作请求发送给当前的主副本。
主副本预处理: 主副本接收请求，检查是否为重复请求。如果是新请求，主副本会将其添加到自己的操作日志中，并分配一个递增的op_num。
主副本复制日志: 主副本向所有备份副本发送Prepare消息，其中包含新的日志条目和当前视图号。
备份副本响应: 备份副本接收Prepare消息。如果视图号匹配，并且它能正确地将操作添加到自己的日志中（即op_num是日志中的下一个预期序列号），它会向主副本发送PrepareOK消息。
主副本提交: 当主副本收到来自大多数副本（包括自身）的PrepareOK消息后，它就知道这个操作已经被法定数量的副本接受。此时，主副本将该操作标记为已提交，更新commit_num。
主副本应用并响应: 主副本将所有last_applied_op_num到commit_num之间的操作应用到自己的状态机，然后向客户端发送响应。
备份副本应用: 主副本会通过后续的Prepare消息（或者专门的Commit消息，取决于具体实现）告知备份副本commit_num的更新。备份副本在收到更新后，也会将相应的操作应用到自己的状态机。

我们用伪代码来模拟这个过程：

# 假设我们有一个抽象的Replica类 class Replica: def __init__(self, id, num_replicas): self.replica_id = id self.num_replicas = num_replicas self.current_view_num = 0 self.is_primary = (id == 0) # 假设Replica 0 是初始Primary self.log = [] # 存储 (op_num, client_id, request) self.commit_num = -1 # Index of the last committed operation self.last_applied_op_num = -1 # Index of the last applied operation self.state_machine = {} # 模拟状态机 self.client_table = {} # {client_id: (last_req_id, last_response)} self.peers = {} # {replica_id: network_connection} # For primary only self.prepare_ok_counts = {} # {op_num: set_of_replica_ids} self.last_sent_prepare_log_len = {} # {replica_id: log_length} def apply_committed_ops(self): while self.last_applied_op_num < self.commit_num: self.last_applied_op_num += 1 op_num, client_id, request_data = self.log[self.last_applied_op_num] # Apply request_data to self.state_machine print(f"Replica {self.replica_id} applied op {op_num}: {request_data}") # In a real system, this would interact with the actual state machine logic # For simplicity, we just print here. def send_message(self, target_id, message_type, payload): # Simulate network send print(f"Replica {self.replica_id} sending {message_type} to {target_id}: {payload}") # In a real system, this would go through a network layer # For this example, we assume messages are delivered to the target's handle_message method # This is a conceptual representation pass # --- Primary Role --- def handle_client_request(self, client_id, request_id, operation): if not self.is_primary: # Redirect or error print(f"Replica {self.replica_id} is not primary, cannot handle client request.") return # Check for duplicate request if client_id in self.client_table and self.client_table[client_id][0] >= request_id: # Return cached response print(f"Returning cached response for client {client_id}, request {request_id}") return self.client_table[client_id][1] # Assign new op_num and add to log new_op_num = len(self.log) self.log.append((new_op_num, client_id, operation)) self.prepare_ok_counts[new_op_num] = {self.replica_id} # Primary itself counts as one OK print(f"Primary {self.replica_id} received client request {request_id} for op {new_op_num}") # Send Prepare messages to backups for peer_id in range(self.num_replicas): if peer_id == self.replica_id: continue self.send_message(peer_id, "Prepare", { "view_num": self.current_view_num, "op_num": new_op_num, "request": (client_id, request_id, operation), "commit_num": self.commit_num # Include current commit_num for backups to catch up }) self.last_sent_prepare_log_len[peer_id] = len(self.log) # For simplicity, we'll assume a synchronous wait for PrepareOKs here. # In reality, this would be event-driven. self.check_commit_progress() def check_commit_progress(self): # This method is called after receiving PrepareOKs or sending new prepares while True: # Find the highest op_num for which a quorum has responded next_commit_candidate = self.commit_num + 1 if next_commit_candidate >= len(self.log): break # No new operations to commit yet required_quorum_size = self.num_replicas // 2 + 1 if next_commit_candidate in self.prepare_ok_counts and len(self.prepare_ok_counts[next_commit_candidate]) >= required_quorum_size: self.commit_num = next_commit_candidate print(f"Primary {self.replica_id} committed op {self.commit_num}") self.apply_committed_ops() # Store client response if this was the client's request # For simplicity, we just simulate a generic response if self.log[self.commit_num][0] == next_commit_candidate: # Just a sanity check client_id, request_id, _ = self.log[self.commit_num][1:] self.client_table[client_id] = (request_id, f"Response for {request_id}") else: break # Cannot commit further # --- Backup Role --- def handle_prepare(self, view_num, op_num, request, primary_commit_num): if view_num < self.current_view_num: print(f"Replica {self.replica_id} ignoring old Prepare for view {view_num} (current {self.current_view_num})") return if view_num > self.current_view_num: print(f"Replica {self.replica_id} received future view Prepare. Initiate view change or update view.") # This is a simplification; a real system would trigger view change or update its view # based on more robust logic. For now, we assume primary is in sync. return # Check if op_num is the next expected in log if op_num != len(self.log): print(f"Replica {self.replica_id} received out-of-order Prepare. Expected {len(self.log)}, got {op_num}. Requesting state transfer or view change.") # This indicates a gap; backup needs to catch up, possibly via state transfer. # For simplicity, we'll assume it's always in order for normal ops. return self.log.append((op_num, request[0], request[2])) # client_id, operation print(f"Backup {self.replica_id} appended op {op_num} to log.") # Update commit_num based on primary's commit_num # This implicitly tells backups what to commit if primary_commit_num > self.commit_num: self.commit_num = primary_commit_num self.apply_committed_ops() self.send_message(0, "PrepareOK", { # Assuming replica 0 is primary "view_num": self.current_view_num, "op_num": op_num, "replica_id": self.replica_id }) def handle_prepare_ok(self, view_num, op_num, replica_id): if not self.is_primary or view_num != self.current_view_num: return # Ignore if not primary or old view if op_num not in self.prepare_ok_counts: self.prepare_ok_counts[op_num] = set() self.prepare_ok_counts[op_num].add(replica_id) print(f"Primary {self.replica_id} received PrepareOK for op {op_num} from {replica_id}. Quorum count: {len(self.prepare_ok_counts[op_num])}") self.check_commit_progress() # --- Simulation Example --- # num_replicas = 3 # replicas = [Replica(i, num_replicas) for i in range(num_replicas)] # # Simulate primary receiving a client request # replicas[0].handle_client_request("client_A", 1, "deposit 100") # # Simulate messages being handled (in a real system, this is event-driven) # # Replica 0 sends Prepare to 1, 2 # # Replica 1, 2 receive Prepare and send PrepareOK to 0 # # Replica 0 receives PrepareOKs and commits

这段伪代码展示了正常操作下，主副本如何接收请求、复制日志、等待法定人数确认并最终提交和应用操作。备份副本则负责接收并确认日志条目，并根据主副本的指示提交和应用操作。

4. VSR 的精髓：视图变更 (View Change)

VSR 的核心复杂性和鲁棒性体现在其视图变更协议中。当主副本失效或网络分区导致其无法与大多数副本通信时，VSR 必须能够安全地选举一个新的主副本，并确保新的主副本拥有所有已提交的日志信息。这个过程至关重要，因为它直接关系到系统的安全性和活性。

视图变更协议的目标是：

选举新的主副本：确定一个健康的副本作为新的主副本。
确保日志一致性：新的主副本必须拥有所有已提交的操作，并且其日志不能包含任何未提交但可能与其他副本冲突的操作。

VSR 的视图变更是一个多阶段的协议：

4.1 阶段一：检测主副本故障并启动视图变更 (StartViewChange)

故障检测: 每个副本都维护一个定时器。如果一个副本长时间未收到主副本的消息（例如Prepare消息或心跳），它会认为主副本可能已经失效。
发起视图变更: 任何副本（包括主副本本身，如果它发现自己被隔离）都可以发起视图变更。它会递增current_view_num，并向所有其他副本发送StartViewChange消息，请求进入新的视图。

# Part of Replica class class Replica: # ... (previous attributes) ... def __init__(self, id, num_replicas): # ... (previous init) ... self.view_change_timer = None # Timer object self.current_primary_id = 0 # Assume replica 0 is initially primary self.last_primary_message_time = time.time() # For timeout detection # For view change process self.next_view_num_candidate = -1 self.received_do_view_change_messages = {} # {replica_id: DoViewChange_message} self.start_view_change_sent = False def start_view_change_timer(self): # Simulate starting a timer if self.view_change_timer: self.view_change_timer.cancel() # In a real system, this would be a proper timer callback self.view_change_timer = threading.Timer(PRIMARY_TIMEOUT, self.on_primary_timeout) self.view_change_timer.start() def on_primary_timeout(self): print(f"Replica {self.replica_id} detected primary timeout.") self.initiate_view_change() def initiate_view_change(self): if self.start_view_change_sent and self.next_view_num_candidate == self.current_view_num + 1: # Already initiated for this next view return self.next_view_num_candidate = self.current_view_num + 1 self.start_view_change_sent = True self.received_do_view_change_messages = {} # Reset for new view change print(f"Replica {self.replica_id} initiating view change for view {self.next_view_num_candidate}") for peer_id in range(self.num_replicas): if peer_id == self.replica_id: continue self.send_message(peer_id, "StartViewChange", { "view_num": self.next_view_num_candidate, "replica_id": self.replica_id }) # Also send to self (conceptually) self.handle_start_view_change(self.next_view_num_candidate, self.replica_id) def handle_start_view_change(self, new_view_num, sender_id): if new_view_num <= self.current_view_num: print(f"Replica {self.replica_id} ignoring old StartViewChange from {sender_id} for view {new_view_num}") return if new_view_num > self.next_view_num_candidate: # A higher view change is being initiated, follow that one self.next_view_num_candidate = new_view_num self.start_view_change_sent = False # Re-initiate for the higher view self.initiate_view_change() # Recursive call to send StartViewChange for the new higher view return # If we reach here, new_view_num == self.next_view_num_candidate # This replica should participate in this view change if new_view_num == self.current_view_num + 1: # Only respond to the immediate next view self.send_message(sender_id, "DoViewChange", { "view_num": new_view_num, "replica_id": self.replica_id, "log": list(self.log), # Send a copy of the log "commit_num": self.commit_num, "last_applied_op_num": self.last_applied_op_num }) print(f"Replica {self.replica_id} received StartViewChange from {sender_id} for view {new_view_num}. Sent DoViewChange.")

4.2 阶段二：收集副本状态 (DoViewChange)

当一个副本收到StartViewChange消息时，如果它同意进入这个新的视图（即new_view_num比它当前的视图号高），它会停止处理正常的客户端请求，并向发起者（或其他所有副本，取决于实现）发送DoViewChange消息。
DoViewChange消息包含了该副本的关键状态信息：当前的视图号、完整的操作日志、已提交的操作序列号commit_num和已应用的操作序列号last_applied_op_num。这些信息对于新主副本确定最新的日志状态至关重要。

# Part of Replica class class Replica: # ... (previous methods) ... def handle_do_view_change(self, view_num, sender_id, sender_log, sender_commit_num, sender_last_applied_op_num): if view_num != self.next_view_num_candidate: print(f"Replica {self.replica_id} ignoring DoViewChange for incorrect view {view_num} (expected {self.next_view_num_candidate})") return self.received_do_view_change_messages[sender_id] = { "log": sender_log, "commit_num": sender_commit_num, "last_applied_op_num": sender_last_applied_op_num } print(f"Replica {self.replica_id} received DoViewChange from {sender_id} for view {view_num}. Count: {len(self.received_do_view_change_messages)}") # Check if we have received a quorum of DoViewChange messages required_quorum_size = self.num_replicas // 2 + 1 if len(self.received_do_view_change_messages) >= required_quorum_size: # This replica has collected enough messages to potentially become the new primary # Or, if it's the designated new primary, it can proceed. self.become_new_primary_candidate() def become_new_primary_candidate(self): # This replica has collected a quorum of DoViewChange messages # It now needs to determine the correct log for the new view. # The new primary is usually determined by (view_num % num_replicas) or some other deterministic rule. # For simplicity, let's assume the replica with the lowest ID among the quorum that collected enough messages becomes primary. # Or, the one that initiated the view change and collected a quorum. # In a more robust system, a specific replica (e.g., (new_view_num % num_replicas)) is designated to become primary # and it's responsible for collecting DoViewChange messages. # Here, we let any replica that collects a quorum to try and become primary. # This will be refined. Let's assume the replica with ID (self.next_view_num_candidate % self.num_replicas) # is the designated new primary. designated_new_primary_id = self.next_view_num_candidate % self.num_replicas if self.replica_id == designated_new_primary_id: print(f"Replica {self.replica_id} is the designated new primary for view {self.next_view_num_candidate}.") self.select_new_primary_log_and_start_view() else: print(f"Replica {self.replica_id} collected quorum but is not designated primary for view {self.next_view_num_candidate}.") # It just waits for the designated primary to send StartView. pass

4.3 阶段三：选择新主副本日志并启动新视图 (StartView)

选择新日志: 某个副本（通常是view_num % N决定的那个，或者第一个收集到法定人数DoViewChange消息的副本）被选为新的主副本。新的主副本必须从它收集到的DoViewChange消息中，选择一个“最完整”的日志作为新视图的初始日志。
- 选择规则: 通常，新主副本会选择拥有最高view_num的日志。如果多个副本拥有相同的最高view_num，则选择其中日志长度最长的那个。如果仍有多个，则选择具有最高op_num的commit_num的那个。这个规则确保了所有已提交的操作都被包含在新主副本的日志中，并且尽可能地保留了未提交但可能最终提交的操作。
发送StartView: 选定新日志后，新主副本会更新自己的current_view_num和log，然后向所有其他副本发送StartView消息。StartView消息包含了新的视图号、新主副本的完整日志以及commit_num。
备份副本确认: 备份副本收到StartView消息后，如果view_num匹配且日志有效，它们会更新自己的视图号、日志和commit_num，并向新主副本发送StartViewOK消息。

# Part of Replica class class Replica: # ... (previous methods) ... def select_new_primary_log_and_start_view(self): # This replica is the designated new primary for self.next_view_num_candidate # Include its own state self.received_do_view_change_messages[self.replica_id] = { "log": list(self.log), "commit_num": self.commit_num, "last_applied_op_num": self.last_applied_op_num } # 1. Find the log with the highest view_num in the collected DoViewChange messages # 2. If tie, find the longest log # 3. If tie, find the highest commit_num best_log_data = None best_log_replica_id = -1 for r_id, data in self.received_do_view_change_messages.items(): candidate_log = data["log"] candidate_commit_num = data["commit_num"] if best_log_data is None: best_log_data = data best_log_replica_id = r_id continue # This part is crucial: how to determine the "most up-to-date" log # The standard VSR paper suggests a more nuanced approach. # Simplified rule: pick the one with highest commit_num, then longest log. # A more accurate VSR rule would be: # 1. From all DoViewChange messages, find the highest view number in which # any operation was committed. Let this be V_max. # 2. Among logs that are consistent with V_max (i.e., contain all ops from V_max), # choose the one with the largest op_num. # This is complex because DoViewChange only contains current view_num and commit_num. # A more practical approach might be to just take the longest log from the highest commit_num seen. # Let's use a common heuristic: prioritize higher commit_num, then longer log. if candidate_commit_num > best_log_data["commit_num"]: best_log_data = data best_log_replica_id = r_id elif candidate_commit_num == best_log_data["commit_num"]: if len(candidate_log) > len(best_log_data["log"]): best_log_data = data best_log_replica_id = r_id # Update self with the chosen log self.current_view_num = self.next_view_num_candidate self.is_primary = True self.log = list(best_log_data["log"]) # Copy the log self.commit_num = best_log_data["commit_num"] self.last_applied_op_num = best_log_data["last_applied_op_num"] self.current_primary_id = self.replica_id # I am the new primary! print(f"Replica {self.replica_id} became primary for view {self.current_view_num} with log from replica {best_log_replica_id}.") print(f"New log length: {len(self.log)}, commit_num: {self.commit_num}") # Send StartView messages to all other replicas for peer_id in range(self.num_replicas): if peer_id == self.replica_id: continue self.send_message(peer_id, "StartView", { "view_num": self.current_view_num, "primary_id": self.replica_id, "log": list(self.log), # Send the full consolidated log "commit_num": self.commit_num }) # Reset for normal operation self.prepare_ok_counts = {} # Prime the prepare_ok_counts for existing committed ops for i in range(self.commit_num + 1): if i not in self.prepare_ok_counts: self.prepare_ok_counts[i] = set() for r_id in self.received_do_view_change_messages: # Assume all replicas that sent DoViewChange have committed this self.prepare_ok_counts[i].add(r_id) self.prepare_ok_counts[self.current_view_num].add(self.replica_id) # Primary itself self.start_view_change_sent = False self.next_view_num_candidate = -1 self.received_do_view_change_messages = {} self.apply_committed_ops() # Apply any newly committed ops self.start_view_change_timer() # Start timer to detect future primary failures def handle_start_view(self, view_num, primary_id, new_log, new_commit_num): if view_num < self.current_view_num: print(f"Replica {self.replica_id} ignoring old StartView for view {view_num}") return if view_num > self.current_view_num: # Accept the new view self.current_view_num = view_num self.current_primary_id = primary_id self.is_primary = (self.replica_id == primary_id) self.log = list(new_log) self.commit_num = new_commit_num # Apply any newly committed operations self.apply_committed_ops() self.send_message(primary_id, "StartViewOK", { "view_num": self.current_view_num, "replica_id": self.replica_id }) print(f"Replica {self.replica_id} accepted StartView for view {view_num} from primary {primary_id}.") # Reset view change state self.start_view_change_sent = False self.next_view_num_candidate = -1 self.received_do_view_change_messages = {} self.last_primary_message_time = time.time() # Reset timeout self.start_view_change_timer() # Start timer to detect future primary failures else: # This can happen if multiple replicas try to become primary for the same view_num # Or if a replica is just late in processing StartView. print(f"Replica {self.replica_id} received StartView for current view {view_num} from {primary_id}. Ignoring.") pass # Already in this view, or a concurrent view change is happening. def handle_start_view_ok(self, view_num, replica_id): if not self.is_primary or view_num != self.current_view_num: return # Ignore if not primary or old view # Count StartViewOK messages, if a quorum responds, the new view is stable. # This part is implicit in the original VSR paper, but good for robust implementations. # For simplicity, we assume once primary sends StartView, the view is established. print(f"Primary {self.replica_id} received StartViewOK from {replica_id} for view {view_num}.") # No explicit action needed here beyond acknowledging.

5. VSR 与 Paxos：并发演进路径的对比

VSR 和 Paxos 都是解决分布式共识问题的强大协议，但它们在设计哲学、复杂性焦点和实现细节上存在显著差异。理解这些差异有助于我们根据具体需求选择合适的协议。

5.1 相似之处

分布式共识: 两者都旨在让一组节点对某个值或一系列值达成一致。
法定人数 (Quorum): 都依赖于法定人数机制来确保安全性和活性，即大多数节点的同意才能使操作生效。
领导者 (Leader): 都采用了一种领导者机制来简化正常操作流程。在 VSR 中是“主副本” (Primary)，在 Paxos 中是“提议者” (Proposer) 或“领导者” (Leader in Multi-Paxos)。
强一致性: 都能提供线性一致性 (Linearizability) 或顺序一致性 (Sequential Consistency)。
容错性: 都能在部分节点（通常是少于法定人数）故障的情况下继续运行。

5.2 关键差异

特性	Viewstamped Replication (VSR)	Paxos (核心 / Multi-Paxos)
主要目标	状态机复制 (State Machine Replication, SMR)；维护有序操作日志	对单个值达成共识 (可以扩展为 SMR)
领导者概念	明确的“主副本” (Primary)，在特定“视图” (View) 中活跃	“提议者” (Proposer) 角色，在特定“轮次” (Round) 中活跃；Multi-Paxos 引入“Leader”概念
领导者选举/切换	显式且重量级的“视图变更”协议，涉及`StartViewChange`,`DoViewChange`,`StartView`消息交换，并可能包含完整日志传输。	隐式或轻量级：核心 Paxos 中，任何提议者都可以尝试成为领导者。Multi-Paxos 有单独的领导者选举阶段 (`Prepare`/`Promise`)，但侧重于协调序列号，而非日志内容。
日志管理	日志是协议的一等公民。视图变更时，新主副本会整合并传输完整的日志。	核心 Paxos 关注对单个槽位的值达成一致。日志是 SMR 扩展后的结果，通过对连续槽位运行 Paxos 协议构建。
状态同步/追赶	视图变更协议直接处理日志同步，新主副本会向所有副本发送其确定的最新日志。	Paxos 本身不直接包含状态同步机制，需要上层协议（如 Multi-Paxos）或单独的机制来帮助落后节点追赶。
协议复杂性	对于 SMR 场景，VSR 通常被认为更直接，更容易理解和实现。其复杂性集中在视图变更阶段。	核心 Paxos 算法以其难以理解而闻名。Multi-Paxos 简化了重复共识，但整体概念依然抽象。
消息类型	`Prepare`,`PrepareOK`,`StartViewChange`,`DoViewChange,`StartView`,`StartViewOK`等。 \|`Prepare`,`Promise`,`Accept`,`Accepted`(核心 Paxos);`Request`,`Response`(Multi-Paxos)。
历史背景	MIT (Barbara Liskov, Oki 等)	Leslie Lamport

5.3 深度解读差异

领导者选举/视图变更的哲学差异：

VSR的视图变更是一个整体性的、状态感知的过程。当主副本失效时，整个系统会进入一个明确的“视图变更模式”。新的主副本在被选出之前，会积极地从其他副本收集它们各自的完整日志和提交信息。这个过程的目的是重建全局的正确日志状态，确保新主副本拥有所有已提交的操作，并且其日志是最新的。这意味着在视图变更期间，日志可能会被部分或完全替换。这种“一揽子”解决方式使得 VSR 在处理主副本失效和日志恢复时，显得非常直接和完整。
Paxos，尤其是核心 Paxos，更侧重于对单个“提案号”下的“值”达成一致。领导者选举（或者说提议者竞争）是通过尝试发出Prepare消息来隐式进行的。一个提议者在成功收集到法定多数的Promise响应后，就获得了在特定轮次中提议的权力。Multi-Paxos 在此基础上构建了一个更稳定的领导者，但其领导者切换仍然是通过竞争Prepare轮次来实现的。Paxos 的关注点在于序列号的递增和值的安全性，而不是直接传输或合并整个日志。日志的一致性是通过确保每个槽位的提案都遵循 Paxos 协议来实现的。

日志管理与状态传输：

VSR 将日志视为核心数据结构。在视图变更中，新主副本会从所有DoViewChange消息中选择一个最完整的日志来初始化自己的状态，并将其推送到所有备份副本。这使得 VSR 在故障恢复时，能够非常明确地进行日志的“校准”和“同步”。
Paxos 协议本身不直接处理日志的传输。它关注的是如何在一个槽位上达成共识。当一个节点落后时，它可能需要从领导者或其他副本请求缺失的日志条目，或者通过重新执行历史操作来追赶。这通常需要一个额外的、独立于核心 Paxos 的“日志快照”或“状态传输”机制。

实现复杂性感知：

许多人认为，虽然 Paxos 在理论上非常优雅，但其直接实现往往比看起来要复杂得多，尤其是要处理所有边缘情况和优化。Multi-Paxos 解决了连续提案的效率问题，但理解其多轮次、多角色交互仍然需要深刻的洞察。
VSR，因为它直接为 SMR 设计，所以其概念模型与 SMR 的需求（有序日志、主从复制、故障切换）更为贴合。虽然视图变更的细节也很精巧，但其整体流程对于想要实现 SMR 的开发者来说，可能更容易映射到实际代码结构。

6. 实际实现考虑与优化

在实际部署 VSR 时，除了上述协议核心，还需要考虑以下几个方面：

持久化存储 (Stable Storage): 为了在节点崩溃后能够恢复，副本的关键状态必须持久化。这包括：
- current_view_num：当前的视图号。
- log：操作日志。
- commit_num：已提交的日志序列号。
- client_table：客户端请求的最新状态和响应，用于去重。
  这些数据需要在写入内存后，立即刷盘，以确保崩溃恢复时的正确性。
消息传递可靠性: 尽管 VSR 协议能够容忍消息丢失，但底层网络层仍然需要提供基本的可靠性（例如，重传机制）。VSR 通常假定消息最终会送达，或者发送方会通过超时进行重试。
心跳机制: 主副本除了发送Prepare消息来同步日志外，还需要定期向所有备份副本发送心跳消息，以证明自己仍然存活。备份副本利用这些心跳来重置它们的视图变更定时器。
客户端交互: 客户端需要知道当前的主副本是谁。这可以通过一个“配置服务”来提供，或者客户端可以简单地轮询所有副本，直到找到响应其请求的主副本。当主副本变更时，旧的主副本可以通知客户端新的主副本，或者客户端在收到错误响应后重试。
批量处理 (Batching): 为了提高吞吐量，主副本可以将多个客户端请求批处理成一个Prepare消息发送给备份副本。这可以减少网络往返次数和磁盘写入次数。
日志压缩/快照: 随着操作日志的不断增长，它会消耗大量存储空间。定期进行日志压缩或创建状态快照是必要的。已提交并应用到状态机的旧日志条目可以被安全地移除。新加入或从故障中恢复的副本可以直接下载快照，然后从快照点开始追赶剩余的日志。
成员变更 (Membership Changes): 动态地添加或移除副本是一个复杂的挑战，无论是 VSR 还是 Paxos 都需要额外的协议来支持。这通常涉及到一个多阶段的变更过程，以确保在新旧配置之间能够安全地过渡。

7. VSR 的应用与展望

尽管 Paxos 及其变体（如 Raft）在工业界获得了更广泛的关注和实现，但 VSR 仍然是一个非常值得研究和理解的协议。它的设计理念，特别是其视图变更协议中对日志完整性和一致性的直接处理，为我们提供了另一种解决 SMR 问题的视角。

VSR 及其思想在一些系统中有所体现，例如早期的一些分布式存储系统，以及学术界对容错分布式系统的研究。理解 VSR 能够帮助我们更深入地理解分布式共识协议的本质，并为我们设计和实现自己的分布式系统提供灵感。例如，Raft 协议在设计时，也从 VSR 和 Paxos 中汲取了灵感，并试图提供一个“更容易理解”的共识算法，其日志复制和领导者选举机制与 VSR 有异曲同工之处。

展望

Viewstamped Replication，如同分布式共识领域的其他伟大协议一样，是人类在应对分布式系统复杂性方面智慧的结晶。它以其独特的方式，解决了在不可靠环境中构建可靠状态机的核心难题。通过对 VSR 的深入理解，我们不仅掌握了一种强大的共识算法，更重要的是，我们学会了如何从不同的角度审视和解决分布式系统中的挑战。无论是 Paxos、Raft 还是 VSR，它们都为我们构建高性能、高可用的分布式系统奠定了坚实的基础，共同推动着分布式计算领域的发展。未来，随着云计算和边缘计算的普及，对这些基础共识协议的理解和创新，将变得愈发重要。

我的讲座到此结束，感谢大家的聆听！