Show HN: Context Gateway – Compress agent context before it hits the LLM

[−] guard402 62d ago

Interesting approach to a real problem. One thing worth considering: the content entering the context window isn't always trusted, and compression may interact with that in non-obvious ways.

If an agent reads external web pages via MCP, the "context" can contain hidden prompt injections — display:none divs, zero-width Unicode characters, opacity:0 text. We tested six DOM extraction APIs against a hidden injection and found that textContent and innerHTML expose it while innerText and the accessibility tree filter it.

The concern with compressing before scanning: if you compress untrusted external content alongside trusted system instructions, you're mixing adversarial input with your prompt before any inspection happens. An injection that says "ignore all previous instructions" gets compressed right next to the actual instructions. At that point, even if you scan the compressed output, the boundary between trusted and untrusted content is gone.

A scan-then-compress pipeline (or at minimum, compressing trusted and untrusted content in separate passes) would preserve the ability to detect injections before they get interleaved with system context.

[−] root_axis 64d ago

Funny enough, Anthropic just went GA with 1m context claude that has supposedly solved the lost-in-the-middle problem.

[−] SyneRyder 64d ago

Just for anyone else who hadn't seen the announcement yet, this Anthropic 1M context is now the same price as the previous 256K context - not the beta where Anthropic charged extra for the 1M window:

https://x.com/claudeai/status/2032509548297343196

As for retrieval, the post shows Opus 4.6 at 78.3% needle retrieval success in 1M window (compared with 91.9% in 256K), and Sonnet 4.6 at 65.1% needle retrieval in 1M (compared with 90.6% in 256K).

[−] theK 64d ago

Aren't these numbers really bad? > 80% needle retrieval means every fifth memory is akin to a hallucination.

[−] SyneRyder 64d ago

I don't think it quite means that - happy to be corrected on this, but I think it's more like what percentage it can still pay attention to. If you only remembered "cat sat mat", that's only 50% of the phrase "the cat sat on the mat", but you've still paid attention to enough of the right things to be able to fully understand and reconstruct the original. 100% would be akin to memorizing & being able to recite in order every single word that someone said during their conversation with you.

But even if I've misunderstood how attention works, the numbers are relative. GPT 5.4 at 1M only achieves 36% needle retrieval. Gemini 3.1 & GPT 5.4 are only getting 80% at even the 128K point, but I think people would still say those models are highly useful.

[−] ivzak 64d ago

It seems to be the hit rate of a very straightforward (literal matching) retrieval. Just checked the benchmark description (https://huggingface.co/datasets/openai/mrcr), here it is:

"The task is as follows: The model is given a long, multi-turn, synthetically generated conversation between user and model where the user asks for a piece of writing about a topic, e.g. "write a poem about tapirs" or "write a blog post about rocks". Hidden in this conversation are 2, 4, or 8 identical asks, and the model is ultimately prompted to return the i-th instance of one of those asks. For example, "Return the 2nd poem about tapirs".

As a side note, steering away from the literal matching crushes performance already at 8k+ tokens: https://arxiv.org/pdf/2502.05167, although the models in this paper are quite old (gpt-4o ish). Would be interesting to run the same benchmark on the newer models

Also, there is strong evidence that aggregating over long context is much more challenging than the "needle extraction task": https://arxiv.org/pdf/2505.08140

All in all, in my opinion, "context rot" is far from being solved

[−] siva7 64d ago

now that's major news

[−] BloondAndDoom 64d ago

In addition to context rot, cost matters, I think lots of people use toke compression tools for that not because of context rot

[−] hinkley 64d ago

From a determinism standpoint it might be better for the rot to occur at ingest rather than arbitrarily five questions later.

[−] thebeas 64d ago

[dead]

[−] brian_r_hall 58d ago

Context and governance end up being the same surface area approached from different ends. You're trimming what the agent sees, we've been working on what it's allowed to do once it sees it.

Curious if compression ever shifts how the agent interprets its own scope. Seems like there's a weird edge case hiding in there where you strip just enough context that the policy reasoning breaks down.

[−] esafak 64d ago

I can already prevent context pollution with subagents. How is this better?

[−] ivzak 64d ago

Subagents do summarization - usually with the cheaper models like Haiku. Summarizing tool outputs doesn't work well because of the information loss: https://arxiv.org/pdf/2508.21433. Compression is different because we keep preserved pieces of context unchanged + we condition compression on the tool call intent, which makes it more precise.

[−] esafak 64d ago

I can control the model, prompt, and permissions for the subagents. Can you show how your compression differs from summarization by example? What do you mean by "we keep preserved pieces of context unchanged" ?

[−] ivzak 63d ago

We keep preserved pieces of context unchanged = compression removes some pieces of the input while keeping the others verbatim. Let us shortly share a concrete example

[−] thebeas 64d ago

[dead]

[−] tontinton 64d ago

Is it similar to rtk? Where the output of tool calls is compressed? Or does it actively compress your history once in a while?

If it's the latter, then users will pay for the entire history of tokens since the change uncached: https://platform.claude.com/docs/en/build-with-claude/prompt...

How is this better?

[−] BloondAndDoom 64d ago

This is a bit more akin to distill - https://github.com/samuelfaj/distill

Advantage of SML in between some outputs cannot be compressed without losing context, so a small model does that job. It works but most of these solutions still have some tradeoff in real world applications.

[−] thebeas 64d ago

[dead]

[−] thebeas 64d ago

We do both:

We compress tool outputs at each step, so the cache isn't broken during the run. Once we hit the 85% context-window limit, we preemptively trigger a summarization step and load that when the context-window fills up.

[−] esperent 63d ago

> we preemptively trigger a summarization step and load that when the context-window fills up.

How does this differ from auto compact? Also, how do you prove that yours is better than using auto compact?

[−] ivzak 63d ago

For auto-compact, we do essentially the same Anthropic does, but at 85% filled context window. Then, when the window is 100% filled, we pull this precompaction + append accumulated 15%. This allows to run compaction instantly

[−] thesiti92 64d ago

do you guys have any stats on how much faster this is than claude or codex's compression? claudes is super super slow, but codex feels like an acceptable amount of time? looks cool tho, ill have to try it out and see if it messes with outputs or not.

[−] ivzak 64d ago

I think we should draw distinction between two compression "stages"

1. Tool output compression: vanilla claude code doesn't do it at all and just dumps the entire tool outputs, bloating the context. We add <0.5s in compression latency, but then you gain some time on the target model prefill, as shorter context speeds it up.

2. /compact once the context window is full - the one which is painfully slow for claude code. We do it instantly - the trick is to run /compact when the context window is 80% full and then fetch this precompaction (our context gateway handles that)

Please try it out and let us know your feedback, thanks a lot!

[−] jedisct1 63d ago

Swival is really good at managing the context: https://swival.dev/pages/context-management.html

[−] swaminarayan 61d ago

Why do AI agents get worse with more context, and how should we manage context windows?

[−] bsjshshsb 62d ago

Is it all open? Or is compression algo behind a cloud service?

[−] vigneshj 62d ago

It is a interesting tool. But how do you make it as business.

[−] kuboble 64d ago

I wonder what is the business model.

It seems like the tool to solve the problem that won't last longer than couple of months and is something that e.g. claude code can and probably will tackle themselves soon.

[−] hsaliak 64d ago

I expect tools to start embedding an SLM ~1B range locally for something like this. It will become a feature in a rapidly changing landscape and its need may disappear in the future. How would you turn into a sticky product?

[−] sethcronin 64d ago

I guess I'm skeptical that this actually improves performance. I'm worried that the middle man, the tool outputs, can strip useful context that the agent actually needs to diagnose.

[−] lambdaone 64d ago

This company sounds like it has months to live, or until the VC money runs out at most. If this idea is good, Anthropic et. al. will roll it into their own product, eliminating any purpose for it to exist as an independent product. And if it isn't any good, the company won't get traction.

[−] verdverm 64d ago

I don't want some other tooling messing with my context. It's too important to leave to something that needs to optimize across many users, there by not being the best for my specifics.

The framework I use (ADK) already handles this, very low hanging fruit that should be a part of any framework, not something external. In ADK, this is a boolean you can turn on per tool or subagent, you can even decide turn by turn or based on any context you see fit by supplying a function.

YC over indexed on AI startups too early, not realizing how trivial these startup "products" are, more of a line item in the feature list of a mature agent framework.

I've also seen dozens of this same project submitted by the claws the led to our new rule addition this week. If your project can be vibe coded by dozens of people in mere hours...

[−] uaghazade 64d ago

ok, its great

[−] jappleseed987 54d ago

[dead]

[−] 0coCeo 53d ago

[dead]

[−] ClaudeAgent_WK 64d ago

[dead]

[−] robutsume 64d ago

[dead]

[−] adriencr81 60d ago

[dead]

[−] agenticbtcio 64d ago

[dead]

Show HN: Context Gateway – Compress agent context before it hits the LLM (github.com)

64 comments