theBrc007 Profile Banner
Brian Christie Profile
Brian Christie

@theBrc007

Followers
634
Following
20K
Media
37
Statuses
2K

SRE in SF. Statistics & Monitoring. Human Factors. Lean applied to Software. All tweets my own

Worldwide
Joined September 2007
Don't wanna be here? Send us removal request.
@theBrc007
Brian Christie
4 years
Tons of valuable insights about AWS & Snowflake in this thread, worth a read if you’re interested in speculating on the future of cloud.
@bernhardsson
Erik Bernhardsson
4 years
I wrote a blog post about how I think the cloud service stack will change in the next 5-10 years:
2
0
4
@theBrc007
Brian Christie
4 years
The Tail at Scale 10/. Let’s say in an hypothetical extreme case, loading a page takes 100 API calls. This would mean that every single page load would take ✨at least ✨ the 99th percentile latency, and maybe even slower!.
0
0
5
@theBrc007
Brian Christie
4 years
The Tail at Scale 9/. Why does the 99th percentile latency (probably) affect more than 1% of your users?. Open your website in the Network Inspector. How many API calls are made when the page is loaded?.For most apps, loading a single page result in many requests.
1
0
4
@theBrc007
Brian Christie
4 years
The Tail at Scale 8/ . @Linkerd has some community discussion about implementing Hedged Requests, it would take a bit more work since I don’t think the proxy has support yet like Envoy does. There’s some great reference links here for further reading:.
github.com
Summary Add support to linkerd to enable backup requests also known at hedge requests. Context Backup requests can be used to reduce the overall latency of a system, specifically tail latency. If a...
1
0
2
@theBrc007
Brian Christie
4 years
The Tail at Scale 7/ . @IstioMesh does not yet support configuring Request Hedging in Envoy Proxy, but perhaps someone out there is working on it?.
github.com
Describe the feature request Envoy supports request hedging, which can reduce tail latency, from v1.11.0, but it looks like Istio doesn't support to generate such Envoy configuration. Could thi...
2
0
2
@theBrc007
Brian Christie
4 years
The Tail at Scale 6/. The technique has started to make its way out of the literature and into open source systems. It’s still early days but it’s very exciting to see where this goes!. @EnvoyProxy now supports Request Hedging: .
2
1
4
@theBrc007
Brian Christie
4 years
The Tail at Scale 5/ . The duplicate “Tied Requests” essentially hide the true tail-latency of our service from our users, by paying the price of some additional network traffic.
1
0
3
@theBrc007
Brian Christie
4 years
The Tail at Scale 4/. The result of having the Tied Requests is that if Backend A is running at high latency, there’s a probability that Backend B will be able to serve the request more quickly. Using this technique we can reduce the visible tail-latency that users see.
1
0
2
@theBrc007
Brian Christie
4 years
The Tail at Scale 3/. Then the other servers either abort processing that request, or deprioritize it. By having the servers cross communicate the status, the extra request can be aborted before duplicative computation effort is wasted.
1
0
3
@theBrc007
Brian Christie
4 years
The Tail at Scale 2/.Correction: “Tied Requests with cross-server status updates” is the exact term. This works by issuing simultaneous requests to two backends, embedding & tagged w/ backend IDs. The first backend to start processing updates the others about the status.
1
0
3
@theBrc007
Brian Christie
4 years
What percent of users does your 99th percentile latency affect?.1%? Probably more!. One of my favorite techniques called Hedged Requests with cross-server cancellation was first introduced in a terrific paper “The Tail at Scale”.1/ .
2
3
24
@theBrc007
Brian Christie
4 years
What’s your template for incident action items? Here’s mine.
@theBrc007
Brian Christie
4 years
@ahidalgosre I try to evaluate three categories for every incident:.- Mitigate: What do we need to do now to fix this, and very soon to fix the duct tape?.- Detect: Are we happy with how we got paged?.- Prevent: How do we fix this ✨category✨ of problem so it doesn’t wake us up again.
0
0
4
@theBrc007
Brian Christie
4 years
Come argue with us on NFT’s, DeFi, “Web 3.0” and if it’s going somewhere and facts 📠 or it’s all straight cap 🧢.
0
0
0
@theBrc007
Brian Christie
4 years
The Goal: 7/ . When we measure incidents with a goal of reducing the number, there will be a collective pressure to achieve the goal. But we’ve achieved The Wrong Goal. How do we achieve the right goal?. Apply the process of continual improvement to refining what we measure.
0
1
3
@theBrc007
Brian Christie
4 years
The Goal: 6/. When we start measuring # of incidents, as if by magic there are fewer incidents!. But we have incentivized a behavior that we didn’t intend to: declaring less incidents. Now we are all professionals here, surely nobody is intentionally not fixing problems.
1
0
4
@theBrc007
Brian Christie
4 years
The Goal: 5/. Does measuring # of incidents improve reliability?.Probably not!. What we want to measure is the quality of service we are providing to our customers. But what we are actually measuring is a proxy for reliability that is subjective.
1
0
1
@theBrc007
Brian Christie
4 years
The Goal: 4/. In complex software systems I’ve often seen this come up with incident related measurements. Say we start feel the system breaks too often and decide to aim to reduce the number of incidents we have. Will this result in a more reliable system?.
1
0
2
@theBrc007
Brian Christie
4 years
The Goal: 3/. Another key for me has been understanding how important all aspects of measurement are. Measurement drives behavior, and what gets measured gets done. This is tricky because measuring slightly the wrong thing, can have a magnified negative impact.
1
0
4
@theBrc007
Brian Christie
4 years
The Goal: 2/.The single most important lesson?. In any complex system, find the one *single* bottleneck, and do whatever it takes to remove that bottleneck. Then a new bottleneck will emerge and it probably won’t be what you thought it was. Repeat the process.
1
1
5
@theBrc007
Brian Christie
4 years
15 years ago I read a book that was so profound the lessons have directly contributed to every strategic decision I make in Reliability Engineering. Eli Goldratt’s “The Goal: A Process of Ongoing Improvement”. The lessons are deceivingly simple:.1/.
Tweet card summary image
amazon.com
Written in a fast-paced thriller style, 'The Goal' contains a serious message for all managers in industry and explains the ideas which underline the Theory of Constraints developed by the author.
1
2
15