Brian Christie @theBrc007 X Profile

Brian Christie

@theBrc007

Followers

634

Following

20K

Media

37

Statuses

2K

SRE in SF. Statistics & Monitoring. Human Factors. Lean applied to Software. All tweets my own

Worldwide

Joined September 2007

Don't wanna be here? Send us removal request.

Brian Christie

@theBrc007

4 years

Tons of valuable insights about AWS & Snowflake in this thread, worth a read if you’re interested in speculating on the future of cloud.

Erik Bernhardsson

@bernhardsson

4 years

I wrote a blog post about how I think the cloud service stack will change in the next 5-10 years:

2

0

4

Brian Christie

@theBrc007

4 years

The Tail at Scale 10/. Let’s say in an hypothetical extreme case, loading a page takes 100 API calls. This would mean that every single page load would take ✨at least ✨ the 99th percentile latency, and maybe even slower!.

0

5

Brian Christie

@theBrc007

4 years

The Tail at Scale 9/. Why does the 99th percentile latency (probably) affect more than 1% of your users?. Open your website in the Network Inspector. How many API calls are made when the page is loaded?.For most apps, loading a single page result in many requests.

1

0

4

Brian Christie

@theBrc007

4 years

The Tail at Scale 8/ . @Linkerd has some community discussion about implementing Hedged Requests, it would take a bit more work since I don’t think the proxy has support yet like Envoy does. There’s some great reference links here for further reading:.

github.com

Summary Add support to linkerd to enable backup requests also known at hedge requests. Context Backup requests can be used to reduce the overall latency of a system, specifically tail latency. If a...

1

0

2

Brian Christie

@theBrc007

4 years

The Tail at Scale 7/ . @IstioMesh does not yet support configuring Request Hedging in Envoy Proxy, but perhaps someone out there is working on it?.

github.com

Describe the feature request Envoy supports request hedging, which can reduce tail latency, from v1.11.0, but it looks like Istio doesn't support to generate such Envoy configuration. Could thi...

2

0

2

Brian Christie

@theBrc007

4 years

The Tail at Scale 6/. The technique has started to make its way out of the literature and into open source systems. It’s still early days but it’s very exciting to see where this goes!. @EnvoyProxy now supports Request Hedging: .

2

1

4

Brian Christie

@theBrc007

4 years

The Tail at Scale 5/ . The duplicate “Tied Requests” essentially hide the true tail-latency of our service from our users, by paying the price of some additional network traffic.

1

0

3

Brian Christie

@theBrc007

4 years

The Tail at Scale 4/. The result of having the Tied Requests is that if Backend A is running at high latency, there’s a probability that Backend B will be able to serve the request more quickly. Using this technique we can reduce the visible tail-latency that users see.

1

0

2

Brian Christie

@theBrc007

4 years

The Tail at Scale 3/. Then the other servers either abort processing that request, or deprioritize it. By having the servers cross communicate the status, the extra request can be aborted before duplicative computation effort is wasted.

1

0

3

Brian Christie

@theBrc007

4 years

The Tail at Scale 2/.Correction: “Tied Requests with cross-server status updates” is the exact term. This works by issuing simultaneous requests to two backends, embedding & tagged w/ backend IDs. The first backend to start processing updates the others about the status.

1

0

3

Brian Christie

@theBrc007

4 years

What percent of users does your 99th percentile latency affect?.1%? Probably more!. One of my favorite techniques called Hedged Requests with cross-server cancellation was first introduced in a terrific paper “The Tail at Scale”.1/ .

2

3

24

Brian Christie

@theBrc007

4 years

What’s your template for incident action items? Here’s mine.

Brian Christie

@theBrc007

4 years

@ahidalgosre I try to evaluate three categories for every incident:.- Mitigate: What do we need to do now to fix this, and very soon to fix the duct tape?.- Detect: Are we happy with how we got paged?.- Prevent: How do we fix this ✨category✨ of problem so it doesn’t wake us up again.

0

4

Brian Christie

@theBrc007

4 years

Come argue with us on NFT’s, DeFi, “Web 3.0” and if it’s going somewhere and facts 📠 or it’s all straight cap 🧢.

0

Brian Christie

@theBrc007

4 years

The Goal: 7/ . When we measure incidents with a goal of reducing the number, there will be a collective pressure to achieve the goal. But we’ve achieved The Wrong Goal. How do we achieve the right goal?. Apply the process of continual improvement to refining what we measure.

0

1

3

Brian Christie

@theBrc007

4 years

The Goal: 6/. When we start measuring # of incidents, as if by magic there are fewer incidents!. But we have incentivized a behavior that we didn’t intend to: declaring less incidents. Now we are all professionals here, surely nobody is intentionally not fixing problems.

1

0

4

Brian Christie

@theBrc007

4 years

The Goal: 5/. Does measuring # of incidents improve reliability?.Probably not!. What we want to measure is the quality of service we are providing to our customers. But what we are actually measuring is a proxy for reliability that is subjective.

1

0

1

Brian Christie

@theBrc007

4 years

The Goal: 4/. In complex software systems I’ve often seen this come up with incident related measurements. Say we start feel the system breaks too often and decide to aim to reduce the number of incidents we have. Will this result in a more reliable system?.

1

0

2

Brian Christie

@theBrc007

4 years

The Goal: 3/. Another key for me has been understanding how important all aspects of measurement are. Measurement drives behavior, and what gets measured gets done. This is tricky because measuring slightly the wrong thing, can have a magnified negative impact.

1

0

4

Brian Christie

@theBrc007

4 years

The Goal: 2/.The single most important lesson?. In any complex system, find the one *single* bottleneck, and do whatever it takes to remove that bottleneck. Then a new bottleneck will emerge and it probably won’t be what you thought it was. Repeat the process.

1

5

Brian Christie

@theBrc007

4 years

15 years ago I read a book that was so profound the lessons have directly contributed to every strategic decision I make in Reliability Engineering. Eli Goldratt’s “The Goal: A Process of Ongoing Improvement”. The lessons are deceivingly simple:.1/.

amazon.com

Written in a fast-paced thriller style, 'The Goal' contains a serious message for all managers in industry and explains the ideas which underline the Theory of Constraints developed by the author.

1

2

15