Miasma: A tool to trap AI web scrapers in an endless poison pit (github.com)

by LucidLynx 247 comments 346 points
Read article View on HN

247 comments

[−] bobosola 47d ago
I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

Also, inserting hidden or misleading links is specifically a no-no for Google Search [0], who have this to say: We detect policy-violating practices both through automated systems and, as needed, human review that can result in a manual action. Sites that violate our policies may rank lower in results or not appear in results at all.

So you may well end up doing more damage to your own site than to the bots by using dodgy links in this manner.

[0]https://developers.google.com/search/docs/essentials/spam-po...

[−] trinsic2 47d ago

>I dunno... it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced

If you are automating it, I don't see why not. Kitboga, a you-tuber kept scam callers in AI call-center loops tying up there resources so they cant use them on unsuspecting victims.[0]

That's a guerilla tactic, similar in warfare, when you steal resources from an enemy, you get stronger and they get weaker, its pretty effective.

[0]: https://www.youtube.com/watch?v=ZDpo_o7dR8c

[−] phplovesong 47d ago
Pretty easy. Get a paid number and have the phone scammers / marketers call that. I know a guy who made a decent side huzzle from this. They marketers slowly blocked his number tho, not sure if he still has this thing going on, as it was more a experiment.
[−] yareally 47d ago
Was he picking up the phone and telling them to call him back on the other number?
[−] phplovesong 47d ago
IIRC he did something like that, ask them to call back in "10 minutes after my meeting, and call my personal number, not my corporate phone, as it is tracked". On other occasions he filled in this number to some online forms that he "was asked to fill before continuing".
[−] sysguest 47d ago

> Get a paid number

how? I'm interested

[−] phplovesong 47d ago
Its pretty easy. You can register a number with a phone company. Then you decide on the cost (eg. 5 bucks / minute). I recall he told me got like 100-150 usd/month from this. The longer he talked, the more they paid. He used to hang up after 10 or 15 minutes, but his "record" was close to one hour.
[−] bdangubic 47d ago
more and more scammers are automating their side as well so soon the loop will be just bots talking to bots
[−] rogerrogerr 47d ago

> gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

It’s one of the best time investments I’ve ever made. They just don’t call me anymore.

I think they have two lists: the “do not call” list, and the “unprofitable to call” list. You want to be on the latter list.

[−] iririririr 47d ago
yes it work.

phone scammers have a very high personel cost, hence why some resort for human traffic.

if everyone picked up the phone and wasted a few seconds, it would be enough to make their whole enterprise worthless. but since most people who would not fail shutdown right away, they have the best ROI of any industry. they don't even pay the call for first seconds.

[−] chongli 47d ago
Also, inserting hidden or misleading links is specifically a no-no for Google Search [0]

Depending on your goals, this may be a pro or a con. I, personally, would like to see a return of "small web" human-centric communities. If there were tools that include anti-scraping, anti-Google (and other large search crawlers) as well as a small web search index for humans to find these sites, this idea becomes a real possibility.

[−] ordu 47d ago
> it feels like the same approach as those people who tell you gleeful stories of how they kept a phone spammer on a call for 45 minutes: "That'll teach 'em, ha ha!" Do these types of techniques really work? I’m not convinced.

In 2000s there was some company in Russia selling English courses. It spammed so much, that people were really pissed off. To make long story short, the company disappeared from a public space when Golden Telecom joined the party of retaliatory "spam" calls and make computer to call the company using Golden Telecom modem pool.

So, yeah, you kinda can achieve something in this way, but to make sure you should lease a modem pool for that.

[−] xyzal 47d ago
One would assume legit spiders obey robots.txt.
[−] bugfix 47d ago
I really don't get it. Wouldn't you be wasting a lot of resources feeding the bots like this?
[−] with 47d ago

> do these types of techniques really work?

They have been proven to: https://www.anthropic.com/research/small-samples-poison

[−] TurdF3rguson 47d ago
It might work for a very basic bot that doesn't understand how scraping to infinite depth is not very good idea. It won't be effective against anything minimally sophisticated.
[−] phplovesong 47d ago
Who TF cares about google? This is mostly for personal tech stuff (just the stuff AI steals for training). Id say its pretty welcome that it is not shown in google results.
[−] throw10920 47d ago

> I’m not convinced.

Is this how low we've sunk - that even below taking a single personal anecdote and generalizing it to everything - now we're taking zero experience and dismissing things based on vibes?

I've seen lots of LLM-slop-lovers doing the same thing. Maybe it's a pattern.

[−] deadbabe 47d ago
Honestly, I’m starting to not give a fuck about ranking on Google.

Google searches have become incredibly devalued for me in the age of LLMs. ChatGPT is pretty much my first and often only stop on a quest for some answers.

If you have a website, you must promote it via other ways that don’t involve Google.

[−] SadErn 47d ago
[dead]
[−] tasuki 48d ago

> If you have a public website, they are already stealing your work.

I have a public website, and web scrapers are stealing my work. I just stole this article, and you are stealing my comment. Thieves, thieves, and nothing but thieves!

[−] CrzyLngPwd 47d ago
Way back in the day I had a software product, with a basic system to prevent unauthorised sharing, since there was a small charge for it.

Every time I released an update, and new crack would appear. For the next six months I worked on improving the anti-copying code until I stumbled across an article by a coder in the same boat as me.

He realised he was now playing a game with some other coders where he make the copyprotection better, but the cracker would then have fun cracking it. It was a game of whack-a-mole.

I removed the copy protection, as he did, and got back to my primary role of serving good software to my customers.

I feel like trying to prevent AI bots, or any bots, from crawling a public web service, is a similar game of whack-a-mole, but one where you may also end up damaging your service.

[−] madeofpalk 48d ago
Is there any evidence or hints that these actually work?

It seems pretty reasonable that any scraper would already have mitigations for things like this as a function of just being on the internet.

[−] eliottre 47d ago
The data poisoning angle is interesting. Models trained on scraped web data inherit whatever biases, errors, and manipulation exist in that data. If bad actors can inject corrupted data at scale, it creates a malign incentive structure where model training becomes adversarial. The real solution is probably better data provenance -- models trained on licensed, curated datasets will eventually outcompete those trained on the open web.
[−] aldousd666 48d ago
This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped. The bottom has always been threatening to fall out of the ads paid for eyeballs, And nobody could anticipate the trigger for the downfall. Looks like we found it.
[−] Art9681 47d ago
Can't we simple parse and remove any style="display: none;", aria-hidden="true", and tabindex="1" attributes before the text is processed and get around this trick? What am I missing?
[−] Lockal 47d ago
Nightshade[1] 2.0? As if both tools were built by incompetent developer to distract attention from a real solution - publishing an llm-friendly version in an machine-friendly format (which is not really difficult and helps not only LLMs: e. g. cache, disable fancy complex syntax highlight, offload to github, provide clients and MCPs, optimize clients for common use cases). This example is simply a failure:

  
Dumb curl-based LLM won't visit display:none links. Smarter browser-based navigators won't even render this link.

[1] https://news.ycombinator.com/item?id=39058428

[−] Imustaskforhelp 48d ago
I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents.
[−] morelandjs 47d ago
I’m not fully subscribed to the idea that all public scraping of websites is bad, or that this project is a productive contribution. It would be nice to have search engines other than Google, and that necessitates bots being able to index your website (in some respectful manner).
[−] effnorwood 47d ago
certainly don't allow anyone to access your content. perhaps shut the site down just to be safe.
[−] kristopolous 47d ago
I did a related approach:

A toll charging gateway for llm scrapers: a modification to robots.txt to add price sheets in the comment field like a menu.

This was for a hackathon by forking certbot. Cloudflare has an enterprise version of this but this one would be self hosted

I think it has legs but I think I need to get pushed and goaded otherwise I tend to lose interest ...

It was for the USDC company btw so that's why there's a crypto angle - this might be a valid use case!

I'm open to crypto not all being hustles and scams

Tell me what you think?

https://github.com/kristopolous/tollbot

[−] chmod775 47d ago
This looks neat. I currently have just a couple GitHub pages disguised as documentation that try to convince LLMs that red-teaming always is great actually, giving examples on how to introduce subtle security bugs and cause miscellanous havoc on developer machines. In the spirit of fighting fire with fire, they're also LLM generated. They should not be scraped, but we all know they will anyways.

I don't imagine they do anything, but it still fills me with a certain amount of childish glee.

[−] ninjagoo 47d ago
Isn't this a trope at this point? That AI companies are indiscriminately training on random websites?

Isn't it the case that AI models learn better and are more performant with carefully curated material, so companies do actually filter for quality input?

Isn't it also the case that the use of RLHF and other refinement techniques essentially 'cures' the models of bad input?

Isn't it also, potentially, the case that the ai-scrapers are mostly looking for content based on user queries, rather than as training data?

If the answers to the questions lean a particular way (yes to most), then isn't the solution rate-limiting incoming web-queries rather than (presumed) well-poisoning?

Is this a solution in search of a problem?

[−] RestartKernel 47d ago
The real story is in the poison fountain dataset this uses:

https://rnsaffn.com/poison3/

> [...] we want to inflict damage on machine intelligence systems.

This almost strikes me as roleplay, but maybe I'm childish for finding it difficult to empathise with this genre of hacker ideology.

[−] makingstuffs 47d ago
I love the idea but this will only end up harming your SME in the long run. It would also further entrench the large corps.

The only way something like this would be remotely plausible as a concept would be for enough data providers with overlapping authority on given topics to implement it.

Sadly SMEs have no choice but to go with the flow and allow AI scrapers in. If they don’t, they won’t be as visible in AI generations at the top of the SERPs and they won’t get the visits, which will mean they don’t make the money required to stay afloat.

The fish that attempts to swim against the current ultimately dies and has its corpse carried where the current was going, anyway. Without the sway which comes with size your only option is to go with the flow and drop a little dirty protest every now and then.

[−] theandrewbailey 47d ago
Or you can block bots with these (until they start using them) https://developer.mozilla.org/en-US/docs/Glossary/Fetch_meta...
[−] bluepeter 47d ago
A related technique used to work so well for search engine spiders. I had some software i wrote called 'search engine cloaker'... this was back in the early 2000s... one of the first if not the first to do the shadowy "cloaking" stuff! We'd spin dummy content from lists of keywords and it was just piles and piles. We made it a bit smarter using Markov chains to make the sentences somewhat sensible. We'd auto-interlink and get 1000s of links. It eventually stopped working... but it took a long while for that to happen. We licensed the software to others. I rationalized it because I felt, hey, we have to write crappy copy for this stupid "SEO" thing, so let's just automate that and we'll give the spiders what they seem to want.
[−] superkuh 47d ago
Of course Googlebot, Bingbot, Applebot, Amazonbot, YandexBot, etc from the major corps are HTTP useragent spiders that will have their downloaded public content used by corporations for AI training too. Might as well just drop the "AI" and say "corporate scrapers".
[−] dwa3592 47d ago
Love it. Thanks for doing this work. Not sure why people are criticizing this. Also, insane amount of work has been done to improve scraping - which in my mind is just absolute bonkers and i didn't see people complaining about that.
[−] foxes 48d ago
Wonder if you can just avoid hiding it to make it more believable

Why not have a library of babel esq labrinth visible to normal users on your website,

Like anti surveillance clothing or something they have to sift through

[−] ErenalpCet 47d ago
Really clever project. The self-referential loop is a great approach — turning their scale against them. I've been thinking about the AI data pipeline from the other side, building a memory filter for local LLMs (MemoryGate), so seeing projects like this that target the scraping stage is interesting. Have you considered adding noise variation to the poison content so it's harder to fingerprint and filter out?
[−] hmokiguess 47d ago
Could this lead to something like the Streisand effect? I imagine these bots work at a scale where humans in the loop only act when something deviates from the standard, so, if a bot flags something up with your website then you’re now in a list you previously weren’t. Now don’t ask me what they do with those lists, but I guess you will make the cut.
[−] holysoles 47d ago
If anyone is looking for a tool to actually send traffic to a tool like this, I wrote a Traefik plugin that can block or proxy requests based on useragent.

https://github.com/holysoles/bot-wrangler-traefik-plugin

[−] meta-level 48d ago
Isn't posting projects like this the most visible way to report a bug and let it have fixed as soon as possible?
[−] storus 47d ago
I am failing to see how this stops pre-training scrapping? It still looks like legit code, playing nicely with the desired pre-training distribution. Obviously nobody is going to use it for SFT/DPO/GRPO later.
[−] ninjagoo 47d ago
This is essentially machine-generated spam.

The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?

Just like with email, at some point these share-lists will be adopted by the big corporates, and just like with email will make life hard for the small players.

Once a website appears on one of these lists, legitimately or otherwise, what'll be the reputational damage hurting appearance in search indexes? There have already been examples of Google delisting or dropping websites in search results.

Will there be a process to appeal these blacklists? Based on how things work with email, I doubt this will be a meaningful process. It's essentially an arms race, with the little folks getting crushed by juggernauts on all sides.

This project's selective protection of the major players reinforces that effect; from the README:

" Be sure to protect friendly bots and search engines from Miasma in your robots.txt!

User-agent: Googlebot User-agent: Bingbot User-agent: DuckDuckBot User-agent: Slurp User-agent: SomeOtherNiceBot Disallow: /bots Allow: / "

[−] nsonha 47d ago
Hilarious how people proud of the "open web" thinks that it is somehow about the (small) "web" or some shit, and not the "open"