Wide logging: Stripe's canonical log line pattern (blog.alcazarsec.com)

by alcazar 12 comments 21 points
Read article View on HN

12 comments

[−] formerly_proven 59d ago
It's strange how undercooked logging is (not just in standard libraries, but in general) given everyone and every app does it.
[−] alcazar 59d ago
It's one of those things that keeps getting reinvented over and over. Most people don't want to put in the effort to perfect it.

For the longest time, I've done the bare minimum and just used console.log/print for whatever I thought I needed in that particular piece of code; without any coherent system.

[−] bradleybuda 59d ago
This is all good advice but it's not what I would traditionally think of as "logging" - it's something more like what the industry calls APM or tracing.

> For a web service, that usually means one log event at the end of every request.

The point of a log, as I have seen it used, is to understand what's going on during a request. Yes, some of that is timing / success / failure information, but you also need:

- Logical tracing - at a high level, what decisions did we make (authz, business logic) when serving this request

- Partial failures or warnings - the request may have succeeded, but we may have failed or timed out a call to a dependent service and delivered a partially degraded response as a result

- Tracepoints to support active development - how often do we hit a certain codepath (to help developers roll out or deprecate features)

It's useful to distinguish between "context" (things we know about a request), timing, success, and plain old logging (using human language to describe to an operator what the system is doing). Context should be included with every line - I strongly agree with OP's proposal of wide events, don't make operators ever JOIN if you can avoid it - but it won't always be the same because we learn things about the request as we process it. Yes, if you wait until the end of a request, you will have maximal context, but at the loss of fine-grained detail during processing.

[−] alcazar 59d ago
I don't believe this should replace traditional logging but rather complement it. It transforms breadcrumbs into unified records that are easier to query during an incident.

I completely agree with your plain-old logging list. I would add canonical log lines in addition to the normal log traces. They serve as the summary at the end.

[−] zamalek 59d ago
This is just structured logging. Slog in go, tracing in rust, serilog in C#, etc. You should be combining this with the likes of otel or datadog, which will keep those fields structured and filterable.
[−] stephantul 59d ago
I’ve never been so conflicted about an article. It has clearly been generated by an llm, but still has useful content. It’s a good article, but…
[−] deskamess 59d ago
Are there any restrictions on how short the error_slug should be? The meat of some of my errors can be pretty long (for example an ffmpeg error). There are also many phases to a job - call them tasks. Can a canonical log line be a collection of task log lines?
[−] joaonmatos 59d ago
This is not much different than CloudWatch metric events. At Amazon we output a service log or Embedded Metrics Format event with these things.
[−] whorleater 59d ago
Honestly I'm kinda down on most of FB's tech, but them pioneering Scuba was pretty sick. It has all the wide logging/apm/tracing/etc, and it's hard to describe how useful it is if you haven't seen it in action.