214x Filetype PDF File size 0.48 MB Source: www.usenix.org
Protocol-Aware Recovery for Consensus-Based Storage Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison; Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison; Vijay Chidambaram, University of Texas at Austin; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin - Madison https://www.usenix.org/conference/fast18/presentation/alagappan This paper is included in the Proceedings of the 16th USENIX Conference on File and Storage Technologies. February 12–15, 2018 • Oakland, CA, USA ISBN 978-1-931971-42-3 Open access to the Proceedings of the 16th USENIX Conference on File and Storage Technologies is sponsored by USENIX. Protocol-Aware Recovery for Consensus-Based Storage † Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee , Aws Albarghouthi, Vijay Chidambaram†, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau University of Wisconsin – Madison † University of Texas at Austin Abstract In this paper, we apply PAR to replicated state ma- Weintroduce protocol-aware recovery (PAR), a new ap- chine (RSM) systems. We focus on RSM systems for proach that exploits protocol-specific knowledge to cor- two reasons. First, correctly implementing recovery is rectly recover from storage faults in distributed sys- mostchallengingforRSMsystemsbecauseofthestrong tems. We demonstrate the efficacy of PAR through consistency and durability guarantees they provide [58]; the design and implementation of corruption-tolerant a small misstep in recovery could violate the guaran- replication (CTRL), a PAR mechanism specific to repli- tees. Second, the reliability of RSM systems is crucial: cated state machine (RSM) systems. We experimentally many systems entrust RSM systems with their critical show that the CTRL versions of two systems, LogCabin data [45]. For example, Bigtable, GFS, and other sys- and ZooKeeper, safely recover from storage faults and tems[7,26]storetheirmetadataonRSMsystemssuchas provide high availability, while the unmodified versions Chubby [16] or ZooKeeper [4]. Hence, protecting RSM can lose data or become unavailable. We also show that systems from storage faults such as data corruption will the CTRL versions have little performance overhead. improve the reliability of many dependent systems. Wefirst characterize the different approaches to han- 1 Introduction dling storage faults by developing the RSM recovery Failure recovery using redundancy is central to improved taxonomy, through experimental and qualitative analy- reliability of distributed systems [14,22,31,35,61,67]. sis of practical systems and methods proposed by prior Distributed systems recover from node crashes and net- research (§2). Our analyses show that most approaches work failures using copies of data and functionality on employed by currently deployed systems do not use any several nodes [6,47,55]. Similarly, bad or corrupted data protocol-level knowledgetoperformrecovery,leadingto ononenodeshouldberecoveredfromredundantcopies. disastrous outcomes such as data loss and unavailability. In a static setting where all nodes always remain Thus, to improve the resiliency of RSM systems to reachable and where clients do not actively update data, storage faults, we design a new protocol-aware recov- recovering corrupted data from replicas is straightfor- ery approach that we call corruption-tolerant replication ward; in such a setting, a node could repair its state by or CTRL (§3). CTRL constitutes two components: a lo- simply fetching the data from any other node. cal storage layer and a distributed recovery protocol; In reality, however, a distributed system is a dynamic while the storage layer reliably detects faults, the dis- environment, constantly in a state of flux. In such tributed protocol recovers faulty data from redundant settings, orchestrating recovery correctly is surprisingly copies. Both the components carefully exploit RSM- hard. As a simple example, consider a quorum-based specific knowledge to ensure safety (e.g., no data loss) system,inwhichapieceofdataiscorruptedononenode. and high availability. Whenthenodetries to recover its data, some nodes may CTRL applies several novel techniques to achieve fail and be unreachable, some nodes may have recently safety and high availability. For example, a crash- recovered from a failure and so lack the required data or corruption disentanglement technique in the storage hold a stale version. If enough care is not exercised, the layer distinguishes corruptions caused by crashes from node could “fix” its data from a stale node, overwriting disk faults; without this technique, safety violations or the new data, potentially leading to a data loss. unavailability could result. Next, a global-commitment To correctly recover corrupted data from redundant determination protocol in the distributed recovery sepa- copiesinadistributedsystem,weproposethatarecovery rates committed items from uncommitted ones; this sep- approach should be protocol-aware. A protocol-aware aration is critical: while recovering faulty committed recovery (PAR) approach is carefully designed based on items is necessary for safety, discarding uncommitted howthedistributed system performs updates to its repli- items quickly is crucial for availability. Finally, a novel cated data, elects the leader, etc. For instance, in the pre- leader-initiated snapshotting mechanism enables identi- vious example, a PAR mechanism would realize that a cal snapshots across nodes to greatly simplify recovery. faulty node has to query at least R (read quorum) other We implement CTRL in two storage systems that are nodes to safely and quickly recover its data. based on different consensus algorithms: LogCabin [43] USENIX Association 16th USENIX Conference on File and Storage Technologies 15 (based on Raft [50]) and ZooKeeper [4] (based on states by executing commands on a state machine (an in- ZAB [39]) (§4). Through experiments, we show that memory data structure on each node) [58]. Typically, CTRL versions provide safety and high availability in the clients interact with a single node (the leader) to exe- presence of storage faults, while the original systems re- cute operations on the state machine. Upon receiving main unsafe or unavailable in many cases; we also show a command, the leader durably writes the command to that CTRL induces minimal performance overhead (§5). an on-disk log and replicates it to the followers. When a majority of nodes have durably persisted the command 2 BackgroundandMotivation in their logs, the leader applies the command to its state Wefirst provide background on storage faults and RSM machine and returns the result to the client; at this point, systems. We then present the taxonomy of different ap- the command is committed. The commands in the log proaches to handling storage faults in RSM systems. have to be applied to the state machine in-order. Losing or overwriting committed commands violates the safety 2.1 Storage Faults in Distributed Systems property of the state machine. The replicated log is kept Disksandflashdevicesexhibitasubtleandcomplexfail- consistent across nodes by a consensus protocol such as ure model: a few blocks of data could become inaccessi- Paxos [41] or Raft [50]. ble or be silently corrupted [8,9,32,59]. Although such Becausethelogcangrowindefinitelyandexhaustdisk storage faults are rare compared to whole-machine fail- space, periodically, a snapshot of the in-memory state ures, in large-scale distributed systems, even rare failures machine is written to disk and the log is garbage col- become prevalent [60,62]. Thus, it is critical to reliably lected. When a node restarts after a crash, it restores detect and recover from storage faults. the system state by reading the latest on-disk snapshot Storage faults occur due to several reasons: media er- and the log. The node also recovers its critical metadata rors [10], program/read disturbance [60], and bugs in (e.g., log start index) from a structure called metainfo. firmware [9], device drivers [66], and file systems [27, Thus, each node maintains three critical persistent data 28]. Storage faults manifest in two ways: block errors structures: the log, the snapshots, and the metainfo. and corruption. Block errors (or latent sector errors) These persistent data structures could be corrupted arise when the device internally detects a problem with a due to storage faults. Practical systems try to safely block and throws an error upon access. Studies of both recover the data and remain available under such fail- flash [33,60] and hard drives [10,59] show that block er- ures [15, 17]. However, as we will show, none of the rors are common. Corruption could occur due to lost and current approaches correctly recover from storage faults, misdirected writes that may not be detected by the de- motivating the need for a new approach. vice. Studies [9,51] and anecdotal evidence [36,37,57] 2.3 RSMRecoveryTaxonomy showtheprevalence of data corruption in the real world. Many local file systems, on encountering a storage To understand the different possible ways to handling fault, simply propagate the fault to applications [11,54, storage faults in RSM systems, we analyze a broad range 64]. For example, ext4 silently returns corrupted data of approaches. We perform this analysis by two means: if the underlying device block is corrupted. In contrast, first, we analyze practical systems including ZooKeeper, a few file systems transform an underlying fault into a LogCabin, etcd [25], and a Paxos-based system [24] us- different one; for example, btrfs returns an error to appli- ing a fault-injection framework we developed (§5); sec- cations if the accessed block is corrupted on the device. ond, we analyze techniques proposed by prior research In either case, storage systems built atop local file sys- or used in proprietary systems [15,17]. tems should handle corrupted data and storage errors to Through our analysis, we classify the approaches into preserve end-to-end data integrity. two categories: protocol-oblivious and protocol-aware. One way to tackle storage faults is to use RAID-like The oblivious approaches do not use any protocol-level storage to maintain multiple copies of data on each node. knowledge to perform recovery. Upon detecting a However, many distributed deployments would like to fault, these approaches take a recovery action locally use inexpensive disks [22, 31]. Given that the data in on the faulty node; such actions interact with the dis- a distributed system is inherently replicated, it is waste- tributed protocols in unsafe ways, leading to data loss. ful to store multiple copies on each node. Hence, it is The protocol-aware approaches use some RSM-specific important for distributed systems to use the inherent re- knowledge to recover; however, they do not use this dundancy to recover from storage faults. knowledge correctly, leading to undesirable outcomes. 2.2 RSM-basedStorageSystems Ourtaxonomyisnotcompleteinthattheremaybeother techniques; however, to the best of our knowledge, we Our goal is to harden RSM systems to storage faults. have not observed other approaches apart from those in In an RSM system, a set of nodes compute identical our taxonomy. 16 16th USENIX Conference on File and Storage Technologies USENIX Association S 2 3 2 3 2 3 2 3 2 3 2 3 xity 1 ery S entionv 1 2 3 1 2 3 1 3 1 3 1 3 1 3 nodes 2 S 1 2 3 1 2 3 1 2 3 1 2 3 1 2 1 2 xtraRecoComple 3 ailabilityIntervew v S ast v) 1 2 3 1 2 3 1 2 3 2 3 1 2 3 4 Class Approach SafetyAPerformanceNoNoFLo(i)(ii)(iii)(i(v)(vi) S 1 2 3 1 3 √√√√ √ 1 2 3 1 2 3 NoDetection × na E E E E E E 5 √ √ √ √ (i) (ii) (iii) (iv) (v) (vi) viousCrash × × na U C U C U U ProtocolTruncate ×√√√√×√ CLCLLL Figure 1: SampleScenarios. Thefigureshowssamplescenariosin ObliDeleteRebuild ×√√×√×√ CLCLLL which current approaches fail. Faulty entries are striped. Crashed and √√√ √ lagging nodes are shown as gray and empty boxes, respectively. MarkNonVoting ×× × U C U C U U areReconfigure √×√×××√ UCUCUU To illustrate the problems, we use Figure 1. In all w Byzantine FT √××√×na× UCUUUU † ProtocolA √√√√√√√ cases, log entries 1, 2, and 3 are committed; losing these CTRL C C C C C C items will violate safety. Table 1 shows how each ap- proach behaves in Figure 1’s scenarios. As shown in E- Return Corrupted, L- Data Loss, U- Unavailable, C- Correct the table, all current approaches lead to safety violation Table 1: Recovery Taxonomy. The table shows how different (e.g., data loss), low availability, or both. A recovery approaches behave in Figure 1 scenarios. While all approaches are mechanism that effectively uses redundancy should be unsafe or unavailable, CTRL ensures safety and high availability. safe and available in all cases. Table 1 also compares the (possibly faulty) portions of data and continue operat- approaches along other axes such as performance, main- ing. The intuition behind Truncate is that if the faulty tenance overhead (intervention and extra nodes), recov- data is discarded, the node can continue to operate (un- ery time, and complexity. Although Figure 1 shows only like Crash), improving availability. faults in the log, the taxonomy applies to other structures However, we find that Truncate can cause a safety vi- including the snapshots and the metainfo. olation (data loss). Consider the scenario shown in Fig- NoDetection. The simplest reaction to storage faults is ure 2 in which entry 1 is corrupted on S ; S , S are lag- none at all: to trust every layer in the storage stack to 1 4 5 workreliably. For example, a few prototype Paxos-based ging and do not have any entry. Assume S2 is the leader. systems[24]donotusechecksumsfortheiron-diskdata; WhenS1readsitslog,itdetectsthecorruption; however, similarly, LogCabin does not protect its snapshots with S1 truncates its log, losing the corrupted entry and all checksums. NoDetection trivially violates safety; cor- subsequent entries (Figure 2(ii)). Meanwhile, S2 (leader) rupted data can be obliviously served to clients. How- and S3 crash. S1, S4, and S5 form a majority and elect S1 ever, deployed systems do use checksums and other in- the leader. Now the system does not have any knowledge tegrity strategies for most of their on-disk data. of committedentries 1, 2, and 3, resulting in a silent data Crash. A better strategy is to use checksums and han- loss. The system also commits new entries x, y, and z in dle I/O errors, and crash the node on detecting a fault. the place of 1, 2, and 3 (Figure 2(iii)). Finally, when S2 Crash may seem like a good strategy because it in- and S3 recover, they follow S1’s log (Figure 2(iv)), com- tends to prevent any damage that the faulty node may pletely removing entries 1, 2, and 3. inflict on the system. Our experiments show that the In summary, although the faulty node detects the cor- Crash approach is common: LogCabin, ZooKeeper, and ruption, it truncates its log, losing the data locally. When etcd crash sometimes when their logs are faulty. Also, this node forms a majority along with other nodes that ZooKeeper crashes when its snapshots are corrupted. are lagging, data is silently lost, violating safety. We find Although Crash preserves safety, it suffers from se- this safety violation in ZooKeeper and LogCabin. vere unavailability. Given that nodes could be unavail- Further, Truncate suffers from inefficient recovery. able due to other failures, even a single storage fault re- For instance, in Figure 1(i), S1 truncates its log after a sults in unavailability, as shown in Figure 1(i). Similarly, fault, losing entries 1, 2, and 3. Now to fix S1’s log, a single fault even in different portions of data on a ma- the leader needs to transfer all entries, increasing S1’s re- jority (e.g., Figure 1(v)) renders the system unavailable. coverytimeandwastingnetworkbandwidth. ZooKeeper Note that simply restarting the node does not help; stor- and LogCabin suffer from this slow recovery problem. age faults, unlike other faults, could be persistent: the DeleteRebuild. Another commonly employed action is node will encounter the same fault and crash again until to manually delete all data on the faulty node and restart manualintervention, which is error-prone and may cause the node. Unfortunately, similar to Truncate, DeleteRe- a data loss. Thus, it is desirable to recover automatically. build can violate safety; specifically, a node whose data Truncate. A more sophisticated action is to truncate is deleted could form a majority along with the lagging nodes, leading to a silent data loss. Surprisingly, admin- †Alogentrycontains a state-machine command and data. istrators often use this approach hoping that the faulty USENIX Association 16th USENIX Conference on File and Storage Technologies 17
no reviews yet
Please Login to review.