Alagappan Pdf 116073 | Fast18 Alagappan

Partial capture of text on file.

Protocol-Aware Recovery for
Consensus-Based Storage
Ramnatthan Alagappan and Aishwarya Ganesan, University of Wisconsin—Madison;
Eric Lee, University of Texas at Austin; Aws Albarghouthi, University of Wisconsin—Madison;
Vijay Chidambaram, University of Texas at Austin; Andrea C. Arpaci-Dusseau and
Remzi H. Arpaci-Dusseau, University of Wisconsin - Madison
https://www.usenix.org/conference/fast18/presentation/alagappan
This paper is included in the Proceedings of the
16th USENIX Conference on File and Storage Technologies.
February 12–15, 2018 • Oakland, CA, USA
ISBN 978-1-931971-42-3
Open access to the Proceedings of
the 16th USENIX Conference on
File and Storage Technologies
is sponsored by USENIX.
Protocol-Aware Recovery for Consensus-Based Storage
†
Ramnatthan Alagappan, Aishwarya Ganesan, Eric Lee , Aws Albarghouthi,
Vijay Chidambaram†, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau
University of Wisconsin – Madison † University of Texas at Austin
Abstract In this paper, we apply PAR to replicated state ma-
Weintroduce protocol-aware recovery (PAR), a new ap- chine (RSM) systems. We focus on RSM systems for
proach that exploits protocol-speciﬁc knowledge to cor- two reasons. First, correctly implementing recovery is
rectly recover from storage faults in distributed sys- mostchallengingforRSMsystemsbecauseofthestrong
tems. We demonstrate the efﬁcacy of PAR through consistency and durability guarantees they provide [58];
the design and implementation of corruption-tolerant a small misstep in recovery could violate the guaran-
replication (CTRL), a PAR mechanism speciﬁc to repli- tees. Second, the reliability of RSM systems is crucial:
cated state machine (RSM) systems. We experimentally many systems entrust RSM systems with their critical
show that the CTRL versions of two systems, LogCabin data [45]. For example, Bigtable, GFS, and other sys-
and ZooKeeper, safely recover from storage faults and tems[7,26]storetheirmetadataonRSMsystemssuchas
provide high availability, while the unmodiﬁed versions Chubby [16] or ZooKeeper [4]. Hence, protecting RSM
can lose data or become unavailable. We also show that systems from storage faults such as data corruption will
the CTRL versions have little performance overhead. improve the reliability of many dependent systems.
Weﬁrst characterize the different approaches to han-
1 Introduction dling storage faults by developing the RSM recovery
Failure recovery using redundancy is central to improved taxonomy, through experimental and qualitative analy-
reliability of distributed systems [14,22,31,35,61,67]. sis of practical systems and methods proposed by prior
Distributed systems recover from node crashes and net- research (§2). Our analyses show that most approaches
work failures using copies of data and functionality on employed by currently deployed systems do not use any
several nodes [6,47,55]. Similarly, bad or corrupted data protocol-level knowledgetoperformrecovery,leadingto
ononenodeshouldberecoveredfromredundantcopies. disastrous outcomes such as data loss and unavailability.
In a static setting where all nodes always remain Thus, to improve the resiliency of RSM systems to
reachable and where clients do not actively update data, storage faults, we design a new protocol-aware recov-
recovering corrupted data from replicas is straightfor- ery approach that we call corruption-tolerant replication
ward; in such a setting, a node could repair its state by or CTRL (§3). CTRL constitutes two components: a lo-
simply fetching the data from any other node. cal storage layer and a distributed recovery protocol;
In reality, however, a distributed system is a dynamic while the storage layer reliably detects faults, the dis-
environment, constantly in a state of ﬂux. In such tributed protocol recovers faulty data from redundant
settings, orchestrating recovery correctly is surprisingly copies. Both the components carefully exploit RSM-
hard. As a simple example, consider a quorum-based speciﬁc knowledge to ensure safety (e.g., no data loss)
system,inwhichapieceofdataiscorruptedononenode. and high availability.
Whenthenodetries to recover its data, some nodes may CTRL applies several novel techniques to achieve
fail and be unreachable, some nodes may have recently safety and high availability. For example, a crash-
recovered from a failure and so lack the required data or corruption disentanglement technique in the storage
hold a stale version. If enough care is not exercised, the layer distinguishes corruptions caused by crashes from
node could “ﬁx” its data from a stale node, overwriting disk faults; without this technique, safety violations or
the new data, potentially leading to a data loss. unavailability could result. Next, a global-commitment
To correctly recover corrupted data from redundant determination protocol in the distributed recovery sepa-
copiesinadistributedsystem,weproposethatarecovery rates committed items from uncommitted ones; this sep-
approach should be protocol-aware. A protocol-aware aration is critical: while recovering faulty committed
recovery (PAR) approach is carefully designed based on items is necessary for safety, discarding uncommitted
howthedistributed system performs updates to its repli- items quickly is crucial for availability. Finally, a novel
cated data, elects the leader, etc. For instance, in the pre- leader-initiated snapshotting mechanism enables identi-
vious example, a PAR mechanism would realize that a cal snapshots across nodes to greatly simplify recovery.
faulty node has to query at least R (read quorum) other We implement CTRL in two storage systems that are
nodes to safely and quickly recover its data. based on different consensus algorithms: LogCabin [43]
USENIX Association 16th USENIX Conference on File and Storage Technologies 15
(based on Raft [50]) and ZooKeeper [4] (based on states by executing commands on a state machine (an in-
ZAB [39]) (§4). Through experiments, we show that memory data structure on each node) [58]. Typically,
CTRL versions provide safety and high availability in the clients interact with a single node (the leader) to exe-
presence of storage faults, while the original systems re- cute operations on the state machine. Upon receiving
main unsafe or unavailable in many cases; we also show a command, the leader durably writes the command to
that CTRL induces minimal performance overhead (§5). an on-disk log and replicates it to the followers. When
a majority of nodes have durably persisted the command
2 BackgroundandMotivation in their logs, the leader applies the command to its state
Weﬁrst provide background on storage faults and RSM machine and returns the result to the client; at this point,
systems. We then present the taxonomy of different ap- the command is committed. The commands in the log
proaches to handling storage faults in RSM systems. have to be applied to the state machine in-order. Losing
or overwriting committed commands violates the safety
2.1 Storage Faults in Distributed Systems property of the state machine. The replicated log is kept
Disksandﬂashdevicesexhibitasubtleandcomplexfail- consistent across nodes by a consensus protocol such as
ure model: a few blocks of data could become inaccessi- Paxos [41] or Raft [50].
ble or be silently corrupted [8,9,32,59]. Although such Becausethelogcangrowindeﬁnitelyandexhaustdisk
storage faults are rare compared to whole-machine fail- space, periodically, a snapshot of the in-memory state
ures, in large-scale distributed systems, even rare failures machine is written to disk and the log is garbage col-
become prevalent [60,62]. Thus, it is critical to reliably lected. When a node restarts after a crash, it restores
detect and recover from storage faults. the system state by reading the latest on-disk snapshot
Storage faults occur due to several reasons: media er- and the log. The node also recovers its critical metadata
rors [10], program/read disturbance [60], and bugs in (e.g., log start index) from a structure called metainfo.
ﬁrmware [9], device drivers [66], and ﬁle systems [27, Thus, each node maintains three critical persistent data
28]. Storage faults manifest in two ways: block errors structures: the log, the snapshots, and the metainfo.
and corruption. Block errors (or latent sector errors) These persistent data structures could be corrupted
arise when the device internally detects a problem with a due to storage faults. Practical systems try to safely
block and throws an error upon access. Studies of both recover the data and remain available under such fail-
ﬂash [33,60] and hard drives [10,59] show that block er- ures [15, 17]. However, as we will show, none of the
rors are common. Corruption could occur due to lost and current approaches correctly recover from storage faults,
misdirected writes that may not be detected by the de- motivating the need for a new approach.
vice. Studies [9,51] and anecdotal evidence [36,37,57] 2.3 RSMRecoveryTaxonomy
showtheprevalence of data corruption in the real world.
Many local ﬁle systems, on encountering a storage To understand the different possible ways to handling
fault, simply propagate the fault to applications [11,54, storage faults in RSM systems, we analyze a broad range
64]. For example, ext4 silently returns corrupted data of approaches. We perform this analysis by two means:
if the underlying device block is corrupted. In contrast, ﬁrst, we analyze practical systems including ZooKeeper,
a few ﬁle systems transform an underlying fault into a LogCabin, etcd [25], and a Paxos-based system [24] us-
different one; for example, btrfs returns an error to appli- ing a fault-injection framework we developed (§5); sec-
cations if the accessed block is corrupted on the device. ond, we analyze techniques proposed by prior research
In either case, storage systems built atop local ﬁle sys- or used in proprietary systems [15,17].
tems should handle corrupted data and storage errors to Through our analysis, we classify the approaches into
preserve end-to-end data integrity. two categories: protocol-oblivious and protocol-aware.
One way to tackle storage faults is to use RAID-like The oblivious approaches do not use any protocol-level
storage to maintain multiple copies of data on each node. knowledge to perform recovery. Upon detecting a
However, many distributed deployments would like to fault, these approaches take a recovery action locally
use inexpensive disks [22, 31]. Given that the data in on the faulty node; such actions interact with the dis-
a distributed system is inherently replicated, it is waste- tributed protocols in unsafe ways, leading to data loss.
ful to store multiple copies on each node. Hence, it is The protocol-aware approaches use some RSM-speciﬁc
important for distributed systems to use the inherent re- knowledge to recover; however, they do not use this
dundancy to recover from storage faults. knowledge correctly, leading to undesirable outcomes.
2.2 RSM-basedStorageSystems Ourtaxonomyisnotcompleteinthattheremaybeother
techniques; however, to the best of our knowledge, we
Our goal is to harden RSM systems to storage faults. have not observed other approaches apart from those in
In an RSM system, a set of nodes compute identical our taxonomy.
16 16th USENIX Conference on File and Storage Technologies USENIX Association
S
2 3 2 3 2 3 2 3 2 3 2 3 xity
1 ery
S entionv
1 2 3 1 2 3 1 3 1 3 1 3 1 3 nodes
2
S
1 2 3 1 2 3 1 2 3 1 2 3 1 2 1 2 xtraRecoComple
3 ailabilityIntervew
v
S ast v)
1 2 3 1 2 3 1 2 3 2 3
1 2 3
4 Class Approach SafetyAPerformanceNoNoFLo(i)(ii)(iii)(i(v)(vi)
S 1 2 3 1 3 √√√√ √
1 2 3 1 2 3 NoDetection × na E E E E E E
5 √ √ √ √
(i) (ii) (iii) (iv) (v) (vi) viousCrash × × na U C U C U U
ProtocolTruncate ×√√√√×√ CLCLLL
Figure 1: SampleScenarios. Theﬁgureshowssamplescenariosin ObliDeleteRebuild ×√√×√×√ CLCLLL
which current approaches fail. Faulty entries are striped. Crashed and √√√ √
lagging nodes are shown as gray and empty boxes, respectively. MarkNonVoting ×× × U C U C U U
areReconﬁgure √×√×××√ UCUCUU
To illustrate the problems, we use Figure 1. In all w Byzantine FT √××√×na× UCUUUU
† ProtocolA √√√√√√√
cases, log entries 1, 2, and 3 are committed; losing these CTRL C C C C C C
items will violate safety. Table 1 shows how each ap-
proach behaves in Figure 1’s scenarios. As shown in E- Return Corrupted, L- Data Loss, U- Unavailable, C- Correct
the table, all current approaches lead to safety violation Table 1: Recovery Taxonomy. The table shows how different
(e.g., data loss), low availability, or both. A recovery approaches behave in Figure 1 scenarios. While all approaches are
mechanism that effectively uses redundancy should be unsafe or unavailable, CTRL ensures safety and high availability.
safe and available in all cases. Table 1 also compares the (possibly faulty) portions of data and continue operat-
approaches along other axes such as performance, main- ing. The intuition behind Truncate is that if the faulty
tenance overhead (intervention and extra nodes), recov- data is discarded, the node can continue to operate (un-
ery time, and complexity. Although Figure 1 shows only like Crash), improving availability.
faults in the log, the taxonomy applies to other structures However, we ﬁnd that Truncate can cause a safety vi-
including the snapshots and the metainfo. olation (data loss). Consider the scenario shown in Fig-
NoDetection. The simplest reaction to storage faults is ure 2 in which entry 1 is corrupted on S ; S , S are lag-
none at all: to trust every layer in the storage stack to 1 4 5
workreliably. For example, a few prototype Paxos-based ging and do not have any entry. Assume S2 is the leader.
systems[24]donotusechecksumsfortheiron-diskdata; WhenS1readsitslog,itdetectsthecorruption; however,
similarly, LogCabin does not protect its snapshots with S1 truncates its log, losing the corrupted entry and all
checksums. NoDetection trivially violates safety; cor- subsequent entries (Figure 2(ii)). Meanwhile, S2 (leader)
rupted data can be obliviously served to clients. How- and S3 crash. S1, S4, and S5 form a majority and elect S1
ever, deployed systems do use checksums and other in- the leader. Now the system does not have any knowledge
tegrity strategies for most of their on-disk data. of committedentries 1, 2, and 3, resulting in a silent data
Crash. A better strategy is to use checksums and han- loss. The system also commits new entries x, y, and z in
dle I/O errors, and crash the node on detecting a fault. the place of 1, 2, and 3 (Figure 2(iii)). Finally, when S2
Crash may seem like a good strategy because it in- and S3 recover, they follow S1’s log (Figure 2(iv)), com-
tends to prevent any damage that the faulty node may pletely removing entries 1, 2, and 3.
inﬂict on the system. Our experiments show that the In summary, although the faulty node detects the cor-
Crash approach is common: LogCabin, ZooKeeper, and ruption, it truncates its log, losing the data locally. When
etcd crash sometimes when their logs are faulty. Also, this node forms a majority along with other nodes that
ZooKeeper crashes when its snapshots are corrupted. are lagging, data is silently lost, violating safety. We ﬁnd
Although Crash preserves safety, it suffers from se- this safety violation in ZooKeeper and LogCabin.
vere unavailability. Given that nodes could be unavail- Further, Truncate suffers from inefﬁcient recovery.
able due to other failures, even a single storage fault re- For instance, in Figure 1(i), S1 truncates its log after a
sults in unavailability, as shown in Figure 1(i). Similarly, fault, losing entries 1, 2, and 3. Now to ﬁx S1’s log,
a single fault even in different portions of data on a ma- the leader needs to transfer all entries, increasing S1’s re-
jority (e.g., Figure 1(v)) renders the system unavailable. coverytimeandwastingnetworkbandwidth. ZooKeeper
Note that simply restarting the node does not help; stor- and LogCabin suffer from this slow recovery problem.
age faults, unlike other faults, could be persistent: the DeleteRebuild. Another commonly employed action is
node will encounter the same fault and crash again until to manually delete all data on the faulty node and restart
manualintervention, which is error-prone and may cause the node. Unfortunately, similar to Truncate, DeleteRe-
a data loss. Thus, it is desirable to recover automatically. build can violate safety; speciﬁcally, a node whose data
Truncate. A more sophisticated action is to truncate is deleted could form a majority along with the lagging
nodes, leading to a silent data loss. Surprisingly, admin-
†Alogentrycontains a state-machine command and data. istrators often use this approach hoping that the faulty
USENIX Association 16th USENIX Conference on File and Storage Technologies 17

The words contained in this file might help you see if this file matches what you are looking for:

...Protocol aware recovery for consensus based storage ramnatthan alagappan and aishwarya ganesan university of wisconsin madison eric lee texas at austin aws albarghouthi vijay chidambaram andrea c arpaci dusseau remzi h https www usenix org conference fast presentation this paper is included in the proceedings th on file technologies february oakland ca usa isbn open access to sponsored by abstract we apply par replicated state ma weintroduce a new ap chine rsm systems focus proach that exploits specic knowledge cor two reasons first correctly implementing rectly recover from faults distributed sys mostchallengingforrsmsystemsbecauseofthestrong tems demonstrate efcacy through consistency durability guarantees they provide design implementation corruption tolerant small misstep could violate guaran replication ctrl mechanism repli tees second reliability crucial cated machine experimentally many entrust with their critical show versions logcabin data example bigtable gfs other zookeeper ...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area