recentpopularlog in


« earlier   
Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems
We highlight one often-overlooked cause of performance failure: limpware – “limping” hardware whose performance degrades significantly compared to its specification. We report anecdotes of degraded disks and network components seen in large-scale production. To measure the system-level impact of limpware, we assembled limpbench, a set of benchmarks that combine dataintensive load and limpware injections. We benchmark five cloud systems (Hadoop, HDFS, ZooKeeper, Cassandra, and HBase) and find that limpware can severely impact distributed operations, nodes, and an entire cluster. From this, we introduce the concept of limplock, a situation where a system progresses slowly due to the presence of limpware and is not capable of failing over to healthy components. We show how each cloud system that we analyze can exhibit operation, node, and cluster limplock. We conclude that many cloud systems are not limpware tolerant.
distributed-systems  papers 
12 days ago by foodbaby
302 Found
Working with Asynchronous Celery Tasks – lessons learned - Added August 14, 2018 at 02:31PM
celery  distributed-systems  python  read2of 
23 days ago by xenocid
Understanding Blockchain Fundamentals, Part 1: Byzantine Fault Tolerance
Understanding Blockchain Fundamentals, Part 1: Byzantine Fault Tolerance - Added June 19, 2018 at 11:57AM
blockchain  distributed-systems  read2of 
29 days ago by xenocid
Serf by HashiCorp
Serf is a decentralized solution for cluster membership, failure detection, and orchestration. Lightweight and highly available.
clustering  messaging  cluster  distributed-systems  decentralization  devops  devtools  discovery  distributed  google 
4 weeks ago by vrobin
Protocol aware recovery for consensus based storage
Within a replicated state machine system, there are three critical persistent data structures: the log, the snapshots, and the metainfo. The log maintains the history of commands, snapshots are used to allow garbage collection of the log and prevent it from growing indefinitely, and the metainfo contains critical metadata such as the log start index. Any of these could be corrupted due to storage faults. None of the current approaches analysed by the authors could correctly recover from such faults.
the-morning-paper  distributed-systems  software-development  adrian-colyer  computer-science  raft  paxos 
4 weeks ago by chriskrycho

Copy this bookmark:

to read