recentpopularlog in


« earlier   
Incident Analysis | USENIX
In an effort to better learn from what happened across all products and services, Google launched an initiative in 2014 to gather data from all outages and incidents that occurred on production systems for trend analysis into system and user impacts, incident timelines, and root causes. The data is then used to drive improvements across systems, processes, and tools to improve the balance between system stability and development velocity. This talk aims to share Google's approach to setting up and running such an analysis program, some preliminary results, and lessons learned.
incident  sre  postmortem 
3 days ago by rkip
Markers of Progress in Incident Analysis – Adaptive Capacity Labs
These markers or indicators include:

More people will decide to attend post-incident review meetings. Meeting attendance will grow. Engineers will report that they learn things about their systems there (and in the incident analysis write-ups that result) that they can’t anywhere else.

Post-incident review meeting attendance will include people from engineering and customer support not directly involved in the incident under discussion.

Engineers will actively seek focused incident analysis training. They will express interest in topics related to accident investigation and read more on these topics on their own time.

Tools that aid incident analysis and post-incident review meeting preparation, or enrich the post-incident artifacts will appear and be refined.

The number of “orphan” post-incident “action items” (in JIRA or other task-tracking systems) will trend downward. Orphan items will be “adopted” by being reviewed and cross-referenced to incidents and post-incident analysis write-ups.

Post-incident analysis document content will become richer (e.g. include diagrams drawn by participants in post-incident review meetings, the actual transcripts of the incident response and handling, contributions from customer support staff).

The number of unique readers of post-incident analysis write-ups will grow over time. Even months after the analysis is published there will be new views of the document(s). Comments, replies, highlights, tags, and other metadata regarding the content will come from an ever broader audience and spark new dialogue between readers.

Incident analysis documents will be used in new-hire onboarding or training as vehicles to describe in rich detail the histories of involved technologies, the challenges and risks faced by teams, and configuration of systems and dependencies.

Incident analysis document content will be written and organized to make the incident features (sources, conditions, difficulties in handling, etc.) explicit enough that future readers will be able to easily find and understand them. There will be regular evaluation of past incident analysis documents that confirm this.

Engineering teams will use incident analysis documents as primary training materials.

Explicit references to specific incident analysis documents will appear more frequently in company internal documents. Citations of specific incidents in project/product “roadmap” documents, “runbooks”, hiring plans, new systems design proposals, etc., are evidence that the authors understand both the value and the relevance of experience with incidents.

Incident analysis documents originating in engineering groups will routinely be reviewed by those in other groups (such as customer support). Comments from these groups will be included and cross-referenced in the post-incident documents.

Post-incident documents originating in other groups (such as customer support) will routinely be reviewed by engineering groups.
incidents  postmortem  statistics 
6 days ago by liruoko
Gregory Szorc's Digital Home | Mercurial's Journey to and Reflections on Python 3
Mercurial 5.2 was released on November 5, 2019. It is the first version of Mercurial that supports Python 3. This milestone comes nearly 11 years after Python 3.0 was first released on December 3, 2008.

Speaking as a maintainer of Mercurial and an avid user of Python, I feel like the experience of making Mercurial work with Python 3 is worth sharing because there are a number of lessons to be learned.

This post is logically divided into two sections: a mostly factual recount of Mercurial's Python 3 porting effort and a more opinionated commentary of the transition to Python 3 and the Python language ecosystem as a whole. Those who don't care about the mechanics of porting a large Python project to Python 3 may want to skip the next section or two.
python  python2  python3  migration  postmortem  mercurial 
5 weeks ago by bezthomas
Michael Akilian: Worker-in-the-loop Retrospective
Over the last ten years, many companies have created human-in-the-loop services that combine a mix of humans and algorithms. Now that some time has passed, we can tease out some patterns from their collective successes and failures. As someone who started a company in this space, my hope is that this retrospective can help prospective founders, investors, or companies navigating this space save time and fund more impactful projects.

A service is considered human-in-the-loop if it organizes its workflows with the intent to introduce models or heuristics that learn from the work of the humans executing the workflows. In this post, I will make reference to two common forms of human-in-the-loop:

User-in-the-loop (UITL): The end-user is interacting with suggestions from a software heuristic/ML system.
Worker-in-the-loop (WITL): A worker is paid to monitor suggestions from a software heuristic/ML system developed by the same company that pays the worker, but for the ultimate benefit of an end-user.
techtariat  reflection  business  tech  postmortem  automation  startups  hard-tech  ai  machine-learning  human-ml  cost-benefit  analysis  thinking  business-models  things  dimensionality  exploratory  markets  labor  economics  tech-infrastructure  gig-econ 
6 weeks ago by nhaliday
10 Things I learned as VP Growth at a hypergrowth Fintech in London!
postmortem  vpgrowth  london  from twitter_favs
6 weeks ago by thomasj
Michael Akilian
Co-founder of Clara Labs retrospective on worker in the loop startup.
machinelearning  automation  postmortem 
6 weeks ago by look

Copy this bookmark:

to read