My platform has 24M pages on 8 domains and these NASTY crawlers insist on visiting every single one of them. For every 1 real visitor there are at least 300 requests from residential proxies. And that's after I blocked complete countries like Russia, China, Taiwan and Singapore.
Even Cloudflares bot filter only blocks some of them.
I'm using honeypot URLs right now to block all crawlers that ignore rel="nofollow", but they appear to have many millions of devices. I wouldn't be surprised if there are a gazillion residential routers, webcams and phones that are hacked to function as a simple doorways.
Have you considered recaptcha v2 and similar? Proof of work might slow them down. Sounds pretty bad. Would be great if Cloudflare, Datadome, etc. were doing this for you and thus banning these devices for everyone.
> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.
This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.
Yeah that's a good observation. XML's closing tags give the model structural anchors during generation — it knows where it is in the nesting. JSON doesn't have that, so the deeper the nesting the more likely the model loses track of brackets.
We see this especially with arrays of objects where each object has optional nested fields. For complex nested objects, the model can get all items well formatted but one with an invalid field of wrong type. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.
This looks pretty interesting! I haven't used it yet, but looked through the code a bit, it looks like it uses turndown to convert the html to markdown first, then it passes that to the LLM so assuming that's a huge reduction in tokens by preprocessing. Do you have any data on how often this can cause issues? ie tables or other information being lost?
Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?
Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.
Maybe it's time scrapers actually paid publishers via something like HTTP 402 for their data instead of an arms race with Cloudflare on one side and residential proxies on the other.
I need to extract article content, determine it's sentiment towards a keyword and output a simple json with article name, url, sentiment and some text around the found keyword.
Currently I'm having problems with the json output, it's not reliable enough and produces a lot of false json.
It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.
I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.
49 comments
Even Cloudflares bot filter only blocks some of them.
I'm using honeypot URLs right now to block all crawlers that ignore rel="nofollow", but they appear to have many millions of devices. I wouldn't be surprised if there are a gazillion residential routers, webcams and phones that are hacked to function as a simple doorways.
Things are really getting out of hand.
> LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes.
This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.
We see this especially with arrays of objects where each object has optional nested fields. For complex nested objects, the model can get all items well formatted but one with an invalid field of wrong type. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.
Also, a model can always use a proxy to turn your tool calls into XML
And feed you back json right away and you wouldn't even know if any transformation did take place.
> Avoid detection with built-in anti-bot patches and proxy configuration for reliable web scraping.
And it doesn't care about robots.txt.
Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?
Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.
I need to extract article content, determine it's sentiment towards a keyword and output a simple json with article name, url, sentiment and some text around the found keyword.
Currently I'm having problems with the json output, it's not reliable enough and produces a lot of false json.
It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.
I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.