recentpopularlog in

cote : sre   9

Scale and velocity are driving the next generation of DevOps
This massive growth in scale has required an evolution in practices and organization to achieve success. Most of what technologists are aware of in this regard is labeled “DevOps,” but there is more nuance to it than that. The way infrastructure capacity is allocated becomes decoupled from specific hardware, so the infrastructure team has to adapt new tools. The way databases and message busses, among other things, are operated and made available to applications has become more “self-service”, and thus those teams have to see themselves as service providers rather than as infrastructure teams.
DevOps  SRE  Platform  cloudnative  links  via:Workflow 
10 weeks ago by cote
Only the good meetings
> Finding a cadence upon which to work as an engineer can be difficult. As engineers are generally averse to meetings, oftentimes we wind up with sporadic meetings and a lot of people who are unclear on their priorities and goals. On the other side, we can find ourselves in environments that are extremely meeting heavy, and engineers often left wondering when there will be time to actually do the work they believed they were hired to do. The establishment of only necessary meetings, at specifically defined times, allows engineers to plan their time to minimize context switching, and and to maximize the time invested in their meetings with one another.
agile  meetings  sre 
january 2019 by cote
SRE: The Biggest Lie Since Kanban
> That’s why SRE is a Big Lie – because it enables people to say they’re doing a thing that could help their organization succeed, and their dev and ops engineers to have a better career and life while doing so – but not really do it. Yes, there have been Big Lies before, which is why I cite Kanban as another example – but even if the new criminal is pretty much like the old criminal, you still put their picture up on the post office wall.


> If something you’re selling is profoundly misused it’s your responsibility to be more up front about the issues.
devops  rants  sre 
october 2018 by cote
Preliminary Analysis of the Site Reliability Engineer Survey
If the response takes too long to get to your phone, the system might as well be "unavailable":

'If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.'
sre  monitoring  metrics  itmanagement  availability  SLAs  suveys 
july 2018 by cote
Monitoring SRE's Golden Signals
Lists out how to get the metrics from various systems and software.
sre  monitoring  metrics  itmanagement 
july 2018 by cote
How to Monitor the SRE Golden Signals
[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
devops  metrics  sre  monitoring 
july 2018 by cote
Splunk acquires VictorOps to take it – and you – into site reliability engineering
“Adding these tools to Splunk’s roster, Mann said, means it can now monitor apps, provide an environment in which to fix them and allow the deeper investigations that figure out root cause of problems and allow re-designs of infrastructure and code to stop them recurring.”
monitoring  logs  m&a  splunk  sre 
june 2018 by cote
Full Cycle Developers at Netflix
How Netflix thinks about standardized platforms and tools, plus their adaptation of DevOps and SRE.

“Full cycle developers apply engineering discipline to all areas of the life cycle. They evaluate problems from a developer perspective and ask questions like “how can I automate what is needed to operate this system?” and “what self-service tool will enable my partners to answer their questions without needing me to be involved?” This helps our teams scale by favoring systems-focused rather than humans-focused thinking and automation over manual approaches.”
cases  sre  devops  netflix  platforms  cloudnative 
may 2018 by cote

Copy this bookmark:

to read