recentpopularlog in

jabley : distributedsystems   83

« earlier  
Recovering Shared Objects Without Stable Storage
This paper considers the problem of building fault-tolerant shared objects when processes can crash
and recover but lose their persistent state on recovery. This Diskless Crash-Recovery (DCR) model
matches the way many long-lived systems are built. We show that it presents new challenges, as
operations that are recorded at a quorum may not persist after some of the processes in that quorum
crash and then recover.
To address this problem, we introduce the notion of crash-consistent quorums, where no recoveries
happen during the quorum responses. We show that relying on crash-consistent quorums enables
a recovery procedure that can recover all operations that successfully finished. Crash-consistent quorums
can be easily identified using a mechanism we term the crash vector, which tracks the causal
relationship between crashes, recoveries, and other operations.
We apply crash-consistent quorums and crash vectors to build two storage primitives. We give
a new algorithm for multi-writer, multi-reader atomic registers in the DCR model that guarantees
safety under all conditions and termination under a natural condition. It improves on the best prior
protocol for this problem by requiring fewer rounds, fewer nodes to participate in the quorum, and
a less restrictive liveness condition. We also present a more efficient single-writer, single-reader
atomic set—a virtual stable storage abstraction. It can be used to lift any existing algorithm from
the traditional Crash-Recovery model to the DCR model. We examine a specific application, state
machine replication, and show that existing diskless protocols can violate their correctness guarantees,
while ours offers a general and correct solution.
paper  filetype:pdf  comp-sci  distributedsystems  consensus  resilience  crash 
october 2017 by jabley
Lineage-driven Fault Injection
Failure is always an option; in large-scale data management systems,
it is practically a certainty. Fault-tolerant protocols and components
are notoriously difficult to implement and debug. Worse
still, choosing existing fault-tolerance mechanisms and integrating
them correctly into complex systems remains an art form, and programmers
have few tools to assist them.
We propose a novel approach for discovering bugs in fault-tolerant
data management systems: lineage-driven fault injection. A lineagedriven
fault injector reasons backwards from correct system outcomes
to determine whether failures in the execution could have
prevented the outcome. We present MOLLY, a prototype of lineagedriven
fault injection that exploits a novel combination of data lineage
techniques from the database literature and state-of-the-art
satisfiability testing. If fault-tolerance bugs exist for a particular
configuration, MOLLY finds them rapidly, in many cases using an
order of magnitude fewer executions than random fault injection.
Otherwise, MOLLY certifies that the code is bug-free for that configuration.
testing  paper  netflix  distributedsystems  filetype:pdf 
may 2016 by jabley
« earlier      
per page:    204080120160

Copy this bookmark:





to read