The bot situation on the internet is worse than you could imagine (gladeart.com)

by ohjeez 166 comments 238 points
Read article View on HN

166 comments

[−] lm411 47d ago
AI companies and notably AI scrapers are a cancer that is destroying what's left of the WWW.

I was hit with a pretty substantial botnet "distributed scraping" attack yesterday.

- About 400,000 different IP addresses over about 3 hours

- Mostly residential IP addresses

- Valid and unique user agents and referrers

- Each IP address would make only a few requests with a long delay in between requests

It would hit the server hard until the server became slow to respond, then it would back off for about 30 seconds, then hit hard again. I was able to block most of the requests with a combination of user agent and referrer patterns, though some legit users may be blocked.

The attack was annoying, but, the even bigger problem is that the data on this website is under license - we have to pay for it, and it's not cheap. We are able to pay for it (barely) with advertising revenue and some subscriptions.

If everyone is getting this data from their "agent" and scrapers, that means no advertising revenue, and soon enough no more website to scrape, jobs lost, nowhere for scrapers to scrape for the data, nowhere for legit users to get the data for free, etc.

[−] everdrive 47d ago
Thanks for sharing the perspective here. I think a lot of folks on HN have rightly said that a lot of the problems with the modern internet are due to the ad-supported business model. I don't think you were ever going to move away from it voluntarily -- too many people support it, even if they grumble about it.

But maybe (and likely for worse) LLMs will finally kill this model.

[−] shimman 47d ago
Do you not run Anubis or have strict fail2ban rules? I just straight up ban IPs forever if they lookup files that will never exist on my servers. That plus Anubis with the strictest settings.

https://anubis.techaro.lol/

[−] ctoth 47d ago
If you don't mind me asking, what sort of data are you licensing? I noticed that you explicitly don't mention it.
[−] afinlayson 47d ago
At some point there needs to be a check if it's a real human... But it's a cat and mouse game - any way we create to keep bots off gets a work around by clever engineers.
[−] Saris 45d ago
What I don't understand is why a bot/scraper needs to load every page and image multiple times in the same hour or whatever session it's doing on my site. If I have say 10 pages and 100 images, surely 110 requests should be all it needs to load everything.
[−] wiseowise 47d ago
Don’t worry, man, once AGI is here you’ll get your allowance (or whatever the hyperscalers plan is).
[−] PearlRiver 47d ago
Unfortunately nobody cares about destroying the internet if it gets them a Lambo.

Greed and ignorance have taken over the tech industry.

[−] oasisbob 47d ago
Knew it was getting bad, but Meta's facebookexternalhit bot changed their behavior recently.

In addition to pulling responses with huge amplification (40x, at least, for posting a single Facebook post to an empty audience), it's sending us traffic with fbclids in the mix. No idea why.

They're also sending tons of masked traffic from their ASN (and EC2), with a fully deceptive UserAgent.

The weirdest part though is that it's scraping mobile-app APIs associated with the site in high volume. We see a ton of other AI-training focused crawlers do this, but was surprised to see the sudden change in behavior on facebookexternalhit ... happened in the last week or so.

Everyone is nuts these days. Got DoSed by Amazonbot this month too. They refuse to tell me what happened, citing the competitive environment.

[−] pinkmuffinere 47d ago
I’ve been sitting on this page for two minutes and it’s still not sure whether I’m a bot lol. What did I do in a past life to deserve this :(
[−] salomonk_mur 47d ago
I'm surprised at the effectiveness of simple PoW to stop practically all activity.

I'll implement Anubis at low difficulty for all my projects and leave a decent llms.txt referenced in my sitemap and robots.txt so LLMs can still get relevant data for my site while.keeping bad bots out. I'm getting thousands of requests from China that have really increased costs, glad it seems the fix is rather easy.

[−] simonw 47d ago

> These bots are almost certainly scraping data for AI training; normal bad actors don't have funding for millions of unique IPs thrown at a page. They probably belong to several different companies. Perhaps they sell their scraped data to AI companies, or they are AI companies themselves. We can't tell, but we can guess since there aren't all that many large AI corporations out there.

Is the theory here that OpenAI, Anthropic, Gemini, xAI, Qwen, Z.ai etc are all either running bad scrapers via domestic proxies in Indonesia, or are buying data from companies that run those scrapers?

I want to know for sure. Who is paying for this activity? What does the marketplace for scraped data look like?

[−] NooneAtAll3 47d ago

> Before it was enabled, it was getting several hundred-thousand requests each day. As soon as Anubis became active in there, it decreased to about 11 requests after 24 hours

I love experimental data like this. So much better than gut reaction that was spammed when anubis was just introduced

[−] rz2k 47d ago
On my computer, with Firefox it uses 14 CPU cores, consumes an extra 35 Watts, and the progress bar barely moves. Is this site mining cryptocurrency?

On Safari or Orion it is merely extremely slow to load.

I definitely wouldn't use any of this on a site that you don't want delisted for cryptojacking.

[−] JeanMarcS 47d ago
I'm getting this patern a lot on Prestashop websites, where thousand, to not say hundreds of thousand, of request are coming from bots not announcing themselves in the User-agent, and coming from different IP's

Very annoying. And you can't filter them because they look like legitimate trafic.

On a page with differents options (such as color, size, etc...) they'll try all the combinaisons, eating all the ressources.

[−] cullenking 47d ago
We started building out a set of spam/fraud/bot management tooling. If you have any decent infrastructure in place already, this is a pretty manageable task with a mismash of techniques. ASN based blocking (ip lookup databases can be self hosted and contain ASN) for the obvious ones like alibaba etc, subnet blocking for the less obvious (see pattern, block subnet, alleviates but doesn't solve problems).

If you have a logging stack, you can easily find crawler/bot patterns, then flag candidate IP subnets for blocking.

It's definitely whackamole though. We are experimenting with blocking based on risk databases, which run between $2k and $10k a year depending on provider. These map IP ranges to booleans like is_vpn, is_tor, etc, and also contain ASN information. Slightly suspicious crawling behavior or keyword flagging combined with a hit in that DB, and you have a high confidence block.

All this stuff is now easy to homeroll with claude. Before it would have been a major PITA.

[−] bob1029 47d ago

> safari can't open the page

What is the point of these anti bot measures if organic HN traffic can nuke your site regardless? If this is about protecting information from being acquired by undesirable parties, then this site is currently operating in the most ideal way possible.

The information will eventually be ripped out. You cannot defeat an army with direct access to TSMC's wafer start budget and Microsoft's cloud infrastructure. I would find a different hill to die on. This is exactly like the cookie banners. No one is winning anything here. Publishing information to the public internet is a binary decision. If you need to control access, you do what Netflix and countless others have done. You can't have it both ways.

[−] LeoPanthera 47d ago
Is Anubus being set to difficulty 8 on this page supposed to be a joke? I gave up after about 20 seconds.
[−] siva7 47d ago
So the elephant in the room: How much of HN is bot generated? Those who know have every incentive not to share and those who don't have no way to figure it out. At this point i have to assume that every new account is a bot
[−] sltkr 47d ago
Looks like Anubis is also blocking robots.txt which seems to defeat the point of having robots.txt in the first place.
[−] tromp 47d ago

> let webWorkerURL =

${options.basePrefix}/.within.website/x/cmd/anubis/static/js/worker/sha256-${workerMethod}.mjs?cacheBuster=${options.version};

It looks like it's computing sha256 hashes. Such an ASIC friendly PoW has the downside that someone with ASICs would be able to either overwhelm the site or drive up the difficulty so high that CPUs can never get through.

[−] Retr0id 47d ago
Maybe my imagination is just too accurate but this didn't tell me anything I didn't expect to hear.

> Here is a massive log file for some activity in the Data Export tar pit:

A bit of a privacy faux pas, no? Some visitors may be legitimate.

[−] gostsamo 47d ago
I don't know if they have issue with my ff+ubo, but it is almost a minute that anubis is blocking me. screw them.
[−] goodmythical 47d ago
Looks like they've gone ahead and implemented the easiest fool-proof method of preventing scraping as the site is currently not loading across mutliple devices.

Not even a 404, just not available at all.

[−] xeyownt 47d ago
Not sure what they are doing, but they don't seem to do it well.
[−] lizknope 47d ago

> The IPs of these bots here actually do not come from datacenters or VPNs most of the time; the overwhelming majority come from residential and mobile networks.

So I started searching for what these residential proxy networks actually are.

https://datadome.co/bot-management-protection/how-proxy-prov...

[−] jwr 47d ago
An interesting and sad aspect of the war on bots and scraping that is being waged is that we are hurting ourselves in the process, too. Many tasks I'm trying to get my AI assistant to do cannot be done quickly, because sites defensively prohibit access to their content. I'm not scraping: it's my agent trying to fetch a page or two to perform a task for me (such as check pricing or availability).

We need a better solution.

[−] timshell 47d ago
My grad school research was on computational models of human/machine cognition, and I'm now commercializing it as a 'proof-of-human API' for bot detection, spam reduction, and identity verification.

One of the mistakes people assume is that AI capability means humanness. If you know exactly where to look, you can start to identify differences between improving frontier models and human cognition.

One concrete example from a forthcoming blog post of mine:

[begin]

In fact, CAPTCHAs can still be effective if you know where to look.

We ran 75 trials -- 388 total attempts -- benchmarking three frontier AI agents against reCAPTCHA v2 image challenges. We looked across two categories: static, where each image grid is an individual target, and cross-tile challenges, where an object spans multiple tiles.

On static challenges, the agents performed respectably. Claude Sonnet 4.5 solved 47%. Gemini 2.5 Pro: 56%. GPT-5: 23%.

On cross-tile challenges: Claude scored 0%. Gemini: 2%. GPT-5: 1%.

In contrast, humans find cross-tile challenges easier than static ones. If you spot one tile that matches the target, your visual system follows the object into adjacent tiles automatically.

Agents find them nearly impossible. They evaluate each tile independently, produce perfectly rectangular selections, and fail on partial occlusion and boundary-spanning objects. They process the grid as nine separate classification problems. Humans process it as one scene.

The challenges hardest for humans -- ambiguous static grids where the target is small or unclear -- are easiest for agents. The challenges easiest for humans -- follow the object across tiles -- are hardest for agents. The difficulty curves are inverted. Not because agents are dumb, but because the two systems solve the problem with fundamentally different architectures.

Faking an output means producing the right answer. Faking a process means reverse-engineering the computational dynamics of a biological brain and reproducing them in real time. The first problem can be reduced to a machine learning classifier. The second is an unsolved scientific problem.

The standard objection is that any test can be defeated with sufficient incentive. But fraudsters weren't the ones who built the visual neural networks that defeated text CAPTCHAs -- researchers were. And they aren't solving quantum computing to undermine cryptography. The cost of spoofing an iris scan is an engineering problem. The cost of reproducing human cognition is a scientific one. These are not the same category of difficulty.

[end]

[−] rekabis 47d ago
Taking a 2024 report on bot loads on the Internet is like taking a 1950s Car & Driver article for modern vehicle stats.

That’s how fast the landscape is changing.

And remember: while the report might have been released in 2024, it takes time to conduct research and publish. A good chunk of its data was likely from 2023 and earlier.

[−] alexspring 47d ago
You can build some great anti-bot mechanisms with simple https://github.com/abrahamjuliot/creepjs logic. A normal user will often appear 31% or lower 'like headless score', mobile is a bit different. You'll still have trouble against sophisticated infra: https://x.com/_alexspring/status/2037968450753335617
[−] arjie 47d ago
My blog gets this degree of scraping too. I have some 5 million requests over the same period as they say they got 7 million over and I barely noticed before I put Cloudflare in front to cache and now I don’t notice at all. I have the Cloudflare AI stuff turned off and mostly use it through the tunnel so I don’t have to expose my local IP.

Is this actually a problem? Most of my requests claim to be Amazonbot but someone showed me they weren’t and I’ve forgotten how.

[−] neurostimulant 47d ago

> The IPs of these bots here actually do not come from datacenters or VPNs most of the time; the overwhelming majority come from residential and mobile networks. Asian and Indonesian countries are where nearly all of them reside.

It's really awful as an indonesian, my indonesian isp regularly got blocked by HN as well :(

[−] plandis 47d ago
At first glance this seems like a crypto miner.

Maybe I’m a bot, I gave up waiting before the progress bar was even 1% done.

[−] VladVladikoff 47d ago

>How can you protect your sites from these bots?

JA4 fingerprinting works decently for the residential proxies.

[−] nubinetwork 47d ago
The user agents in that screenshot are fake, nobody would be running Chrome 106 on windows 10... run a php script on every page that checks for valid combinations and 400 the rest.
[−] mcv 47d ago
Worse than I could imagine? I imagine that bots might destroy the internet. Not just the internet as we know it; I mean make the internet completely unusable to any human being.
[−] throwawaypath 47d ago
So many Anbuis comments here. There are various Anubis bypass extensions available. I haven't seen Anubis prompts in years.
[−] chrsw 47d ago
I'm willing to wait a while for pages to load if it's an effective means of taking back the web from the bots.
[−] dmix 47d ago
As soon as I see that anime bot thing which this website is using I close the tab. More annoying than Cloudflare.
[−] ColinWright 47d ago
Quote:

> "The idea is that at individual scales the additional load is ignorable, ..."

Three minutes, one pixel of progress bar, 2 CPUs at 100%, load average 4.3 ...

The site is not protected by Anubis, it's blocked by it.

Closed.

[−] qwertyforce 47d ago
noticed that firefox gives 2x kHashes/s more than chrome (1000 vs 500)
[−] m3kw9 47d ago
Employ constant faceID can deter it
[−] RobRivera 47d ago
Yea it's pretty bad
[−] ricardobeat 47d ago
I cannot get past the bot check (190kH/s), is it mining crypto on my laptop?
[−] abujazar 47d ago
What a great way to not get any traffic at all.
[−] garganzol 47d ago
Everybody says that bots put websites down, while marketing oriented folks start practicing AO (agent optimization) - to make their offerings even more available and penetrating.

Good luck banning yourself from the future.

[−] Frank-Landry 47d ago
This sounds like something a bot would say.
[−] vondur 47d ago
Ok. So I get a page saying it’s verifying I’m not a bot with some kink of measurements per second and I don’t get through. Is that the point?
[−] raincole 47d ago
I don't get what it is or whether it's a satire or not.

If a webstie takes so long to verify me I'll bounce. That's it.

[−] AndrewKemendo 47d ago
The final Eternal September
[−] hnarn 46d ago

> NOTE: Use a VPN on these pages if you don't want your IP shown in the logs, but it won't be significant amongst the millions of others anyways

Is this supposed to be a joke? Is the author expecting users to travel back in time and use a VPN so their IP is scrubbed from logs that will get published at any time, because that's something the author just obviously has the right to do?

> The EDPB explicitly identifies IP addresses as being personal data due to their ability to identify individual data subjects.[1]

Dickhead.

[1]: https://techgdpr.com/blog/is-an-ip-address-considered-person...

[−] Bombthecat 47d ago
I very very very much doubt that lol

I know / we know lol

[−] qcautomation 47d ago
[dead]