I'm Too Lazy to Check Datadog Every Morning, So I Made AI Do It (quickchat.ai)

by piotrgrudzien 16 comments 30 points
Read article View on HN

16 comments

[−] Xeoncross 62d ago

> Total alerts/errors found: 7

Apps written in an exceptions language (Java, JavaScript, PHP, etc..) are really annoying to monitor as everything that isn't the happy path triggers an 'error'/'fatal' log/metric.

Yes, you can technically work around it with (near) Go-level error verbosity (try/catches everywhere on every call) but I've never seen a team actually do that.

Modern languages that don't throw exceptions for every error like Rust, Go, and Zig make much more sane telemetry reports in my experience.

On this note, a login failure is not an error, it's a warning because there is no action to take. It's an expected outcome. Errors should be actionable. WARN should be for things that in aggregate (like login failures) point to an issue.

[−] Spivak 62d ago

> On this note, a login failure is not an error

Login failure is like the most important error you'll track. A login failure isn't necessarily actionable but a spike of thousands of them for sure is. No single system has been more responsible for causing outages in my career than auth. And I get that it's annoying when they appear in your Rollbar but sometimes Login Failed is the only signal you get that something is wrong.

Some 3rd party IdP saying "nope" can be innocuous when it's a few people but a huge problem when it's because they let their cert/application token expire.

And I can already hear the "it should be a metric with an alert" and you're absolutely right. Except that it requires that devs take the positive action of updating the metric on login failures vs doing nothing and letting the exception propagate up. And you just said login failures aren't errors and "bad password" obviously isn't an error so no need to update the metric on that and cause chatty alerts. Except of course that one time a dev accidentally changed the hashing algorithm. Everyone was really bad at typing their password that day for some reason.

[−] Xeoncross 59d ago

> Login failure is like the most important error you'll track. A login failure isn't necessarily actionable but a spike of thousands of them for sure is.

Sounds like you agree with me. Re-read my comment. Errors are actionable individually. Warnings are actionable in aggregate.

You don't have to treat logs and metrics as separate, you can have rules on log counts without emitting a metric.

[−] SkiFire13 62d ago
Rather than login failures I would monitor login successes. A sharp decrease of successes likely points to some issue, but an increase in login failures might easily be someone trying tons of random credentials on your website (still not ideal, but much harder to act on)
[−] Spivak 62d ago
Creating this metric/alert is practically a rite of passage for junior ops people who then get paged around 5pm.
[−] sgarman 62d ago
I don't understand the workflow of having multiple new bugs everyday that need fixed. Is there bad code being shipped? Are there 1000 devs and it's just this persons' job to fix everyone's bugs? Is this an extremely old and complicated codebase they are improving? Not trying to be snarky - I just don't understand how every day there is new bugs that are just error messages.

If there are new bugs every day that need fixed is the AI really good enough to know the fix from just an error?

[−] danpalmer 62d ago
Why would one need to check Datadog every morning? Wouldn't alerts fire if there was something to do?