An NSFW filter for Marginalia search (marginalia.nu)

by speckx 16 comments 104 points
Read article View on HN

16 comments

[−] marginalia_nu 46d ago
This was a very meandering project, and trying to corral it into some sort of coherent narrative was a bit of an undertaking on its own. Hopefully it makes some sense.
[−] BrunoBernardino 46d ago
Hi Viktor! Really cool write-up, thanks! Uruky is already using the nsfw param, but set to 0 or 1, and I see in your example this looks like a new value option (2) that's "better" than 1? How "safe" is it to implement it as the value to send when someone wants SFW results?
[−] marginalia_nu 46d ago
0 disables all filtering

1 filters 'harmful' sites per the UT1 blacklists

2 is 1 + the new NSFW filter.

The new filter works pretty good in my assessment. It's not infallible, but it gives significantly cleaner results.

And if you do find queries it fails to sanitize, I'd love to hear about them.

[−] IncreasePosts 46d ago
Can you add 3, which only returns content flagged as NSFW?

So I can make sure I know what sites to stay away from, of course

[−] marginalia_nu 46d ago
Wouldn't work very well, in that you'd get awful recall.

The way the filter is implemented, it runs after the query has been executed. I'd have to run it at document processing time, code in a pseudo-keyword for the label, and then add that to the query.

It's doable, but I question whether the juice is worth the squeeze.

[−] BrunoBernardino 46d ago
Thanks, already implemented and tested a couple of queries and it does look good!
[−] VorpalWay 46d ago
Looks like a cool search engine! Hadn't heard about it before.

But the search page says "Simple technology, no AI". With this change, that is no longer true though, is it? Of course the definition of "AI" is extremely vague. Once upon a time, A-star search was considered AI after all.

[−] marginalia_nu 46d ago
The 'No AI' statement is about gen AI, which is I think what most people think of when you say AI.

But sure if I was looking for government or research funding, then for sure this would be AI. Not just AI, but the literal state of the art AI. Dario wakes in a flop sweat every night, terrified of my breakneck advances in single hidden layer classifiers that are probably at least 30% sentient. It would be so much AI I can't even hold all this AI.

[−] ChadNauseam 46d ago
Does marginalia_nu not use embedding models as part of search? I guess I assumed it would. If you have embeddings anyway, decision trees on the embedding vector (e.g. catboost) tend to work pretty well. Fine-tuning modernbert works even better but probably won't meet the criteria of "really fast and run well on CPUs". That said, the approach described in the article seems to work well enough and obviously provides extremely cheap inference
[−] 8organicbits 46d ago
Have you seen many examples of websites labeling themselves, perhaps using rating meta tags ()? Self-labeling seems valuable in some ways, but I don't think I've seen it catch on.
[−] GenericDev 46d ago
[dead]