Interesting, I’ve never needed 1M, or even 250k+ context. I’m usually under 100k per request.
About 80% of my code is AI-generated, with a controlled workflow using dev-chat.md and spec.md. I use Flash for code maps and auto-context, and GPT-4.5 or Opus for coding, all via API with a custom tool.
Gemini Pro and Flash have had 1M context for a long time, but even though I use Flash 3 a lot, and it’s awesome, I’ve never needed more than 200k.
For production coding, I use
- a code map strategy on a big repo. Per file: summary, when_to_use, public_types, public_functions. This is done per file and saved until the file changes. With a concurrency of 32, I can usually code-map a huge repo in minutes. (Typically Flash, cheap, fast, and with very good results)
- Then, auto context, but based on code lensing. Meaning auto context takes some globs that narrow the visibility of what the AI can see, and it uses the code map intersection to ask the AI for the proper files to put in context. (Typically Flash, cheap, relatively fast, and very good)
- Then, use a bigger model, GPT 5.4 or Opus 4.6, to do the work. At this point, context is typically between 30k and 80k max.
What I’ve found is that this process is surprisingly effective at getting a high-quality response in one shot. It keeps everything focused on what’s needed for the job.
Higher precision on the input typically leads to higher precision on the output. That’s still true with AI.
For context, 75% of my code is Rust, and the other 25% is TS/CSS for web UI.
Anyway, it’s always interesting to learn about different approaches. I’d love to understand the use case where 1M context is really useful.
Yeah this is the simpler and also effective strategy. A lot of people are building sophisticated AST RAG models. But you really just need to ask Claude to generally build a semantic index for each large-ish piece of code and re-use it when getting context.
You have to make sure the semantic summary takes up significantly less tokens than just reading the code or its just a waste of token/time.
Then have a skill that uses git version logs to perform lazy summary cache when needed.
It seems like a very good use of LLMs. You should write a blog post with detail of your process with examples for people who are not into all AI tools as much. I only use Web UI. Lots of what you are saying is beyond me, but it does sound like clever strategy.
Yeah we all converge to the same workflow, in my ai coding agent I'm working on now, I've added an "index" tool that uses tree-sitter to compress and show the AI a skeleton of a code file.
I'm curious, what does your workflow look like? I saw a plan prompt there, but no specs. When you want to change something, implement a new feature etc, do you just prompt requirements, have it write the plan and then have it work on it?
Fair point, but because I spent a year building and refining my custom tool, this is now the reality for all of my AI requests.
I prompt, press run, and then I get this flow:
dev setup (dev-chat or plan)
code-map (incremental 0s 2m for initial)
auto-context (~20s to 40s)
final AI query (~30s to 2m)
For example, just now, in my Rust code (about 60k LOC), I wanted to change the data model and brainstorm with the AI to find the right design, and here is the auto-context it gave me:
- Reducing 381 context files ( 1.62 MB)
- Now 5 context files ( 27.90 KB)
- Reducing 11 knowledge files ( 30.16 KB)
- Now 3 knowledge files ( 5.62 KB)
The knowledge files are my "rust10x" best practices, and the context files are the source files.
How do you re-evaluate your approach? I'm asking because the landscape, at least from my lens, was completely different a year ago. So I fear that as the foundation shifts whatever learnings, approaches and mental models I have risk being obsolete and starts to work against me.
The problem of evaluating is hard enough as it is without layers of indirection built on top of it.
I built myself an AST based solution for that during the last 6 months roughly. I always wondered whether grep and agent-based discovery will be the end of it and thought it just has to be better with a more deterministic approach.
In the end it's hard to measure but personally I feel that my agent rarely misses any context for a given task, so I'm pretty happy with it.
I used a different approach than tree-sitter because I thought I found a nice way to get around having to write language-specific code. I basically use VSCode as a language backend and wrote some logic to basically rebuild the AST tree from VSCode's symbol data and other API.
That allows me to just install the correct language extension and thus enable support for that specific language. The extension has to provide symbol information which most do through LSP.
In the end it was way more effort than just using tree-sitter, however, and I'm thinking of doing a slow migration to that approach sooner or later.
Anyways, I created an extension that spins up an mcp server and provides several tools that basically replace the vanilla discovery tools in my workflow.
The approach is similar to yours, I have an overview tool which runs different centrality ranking metrics over the whole codebase to get the most important symbols and presents that as an architectural overview to the LLM.
Then I have a "get-symbol-context" tool which allows the AI to get all the information that the AST holds about a single symbol, including a parameter to include source code which completely replaces grepping and file reading for me.
The tool also specifies which other symbols call the one in question and which others it calls, respectively.
But yeah, sorry for this being already a quite long comment, if you want to give it a try, I published it on the VSCode marketplace a couple of days ago, and it's basically free right now, although I have to admit that I still want to try to earn a little bit of money with it at some point.
Right now, the daily usage limit is 2000 tool calls per day, which should be enough for anybody.
I looked at your solution and extension README, and it's very interesting and well thought out.
The fact that you've been using it for six months and that it performs well says a lot. At the end of the day, that's what counts.
I like your idea of piggybacking on top of the LSP services, and I can imagine that this was quite a bit of work. Doing it as an MCP server makes it usable across different tools.
I also really like the name "Context Master."
In my case, it's much more niche since it's for the tool I built. Though it's open source, the key difference is that the "indexing" is only agentic at this point.
I can see value in mixing the two. LSP integration scares me because of the amount of work involved, and tree-sitter seems like a good path.
In that case, in the code map, for each item, there could be both the LLM response info and some deterministic info, for example, from tree-sitter.
That being said, the current approach works so well that I think I am going to keep using and fine-tuning it for a while, and bring in deterministic context only when or if I need it.
Anyway, what you built looks great. If it works, that's great.
Thanks for taking the time to check it out and for the kind words! I really appreciate it.
I totally get sticking with your current approach. Your workflow sounds very intriguing as well. A combination of both approaches might really be very interesting :) Adding an LLM interpretation layer on top of my graph is also something I'm actively considering.
Thanks for the great discussion, and best of luck with your tool and workflow!
This is really interesting; ive done very high level code maps but the entire project seems wild, it works?
So, small model figures out which files to use based on the code map, and then enriches with snippets, so big model ideally gets preloaded with relevant context / snippets up front?
So, I have a pro@coder/.cache/code-map/context-code-map.json.
I also have a .tmpl-code-map.jsonl in the same folder so all of my tasks can add to it, and then it gets merged into context-code-map.json.
I keep mtime, but I also compute a blake3 hash, so if mtime does not match, but it is just a "git restore," I do not redo the code map for that file. So it is very incremental.
Then the trick is, when sending the code map to AI, I serialize it in a nice, simple markdown format.
- path/to/file.rs
- summary: ...
- when to use: ...
- public types: .., .., ..
- public functions: .., .., ..
- ...
So the AI does not have to interpret JSON, just clean, structured markdown.
Funny, I worked on this addition to my tool for a week, planning everything, but even today, I am surprised by how well it works.
I have zero sed/grep in my workflow. Just this.
My prompt is pro@coder/coder-prompt.md, the first part is YAML for the globs, and the second part is my prompt.
There is a TUI, but all input and output are files, and the TUI is just there to run it and see the status.
I think you've kind of hit on the more successful point here, which is that you should be keeping things focused in a sufficiently focused area to have better success and not necessarily needing more context.
> - a code map strategy on a big repo. Per file: summary, when_to_use, public_types, public_functions. This is done per file and saved until the file changes. With a concurrency of 32, I can usually code-map a huge repo in minutes. (Typically Flash, cheap, fast, and with very good results)
Thanks, but why use any AI to generate this? I would say: you document your functions-in-code, types are provided from the compiler service, so it should all be deterministically available in seconds iso minutes, without burning tokens. Am I missing something?
- Using a tree-sitter/AST-like approach, I could extract types, functions, and perhaps comments, and put them into an index map.
- Cons:
- The tricky part of this approach is that what I extract can be pretty large per file, for example, comments.
- Then, I would probably need an agentic synthesis step for those comments anyway.
2) Agentic
- Since Flash is dirt cheap, I wanted to experiment and skip #1, and go directly to #2.
- Because my tool is built for concurrency, when set to 32, it's super fast.
- The price is relatively low, perhaps $1 or $2 for 50k LOC, and 60 to 90 seconds, about 30 to 45 minutes of AI work.
- What I get back is relatively consistent by file, size-wise, and it's just one trip per file.
So, this is why I started with #2.
And then, the results in real coding scenarios have been astonishing.
Way above what I expected.
The way those indexes get combined with the user prompt gets the right files 95% of the time, and with surprisingly high quality.
So, I might add deterministic aspects to it, but since I think I will need the agentic step anyway, I have deprioritized it.
Well, out of all the workflows I have seen, this one is rather nice, might give it a try.
I imagine if the context were being commited and kept up-to-date with CI would work for others to use as well.
However, I'm a little confused on the autocontext/globs narrowing part. Do you, the developer, provide them? Or you feed the full code map to flash + your prompt so it returns the globs based on your prompt?
Also, in general, is your map of a file relatively smaller than the file itself, even for very small files?
- The ..-code-map.json files are per "developer folder," which would create too many conflicts if they were kept in Git.
- I have two main globs, which are lists of globs: knowledge_globs and context_globs. Knowledge can be absolute and should be relatively static. context_globs have to be relative to the workspace, since they are the working files.
- As a dev, you provide them in the top YAML section of the coder-prompt.md.
- The auto-context sub-agent calls the code-map sub-agent. Sub-agents can add to or narrow the given globs, and that is the goal of the auto-context agent.
It looks complicated, but it actually works like a charm.
Hopefully, I answered some of your questions.
I need to make a video about it.
But regardless, I really think it's not about the tools, it's about the techniques. This is where the true value is.
Your code map compresses signal on the context side. Same principle applies on the prompt side: prompts that front-load specifics (file, error, expected behavior) resolve in 1-2 turns. Vague ones spiral into 5-6. 1M context doesn't change that — it just gives you more room for the spiral.
This is interesting but don't you worry that you're competing with entire companies (e.g. Anthropic) and thus it's a losing battle? Since you're re-implementing a bunch of stuff they either do in their harness or have decided it was better not to do?
I think it's worth remembering that for any offering like that, it necessarily needs to be ~one-size-fits-all, while what you come up with.. doesn't.
They're solving a different problem than you. So I think it's very plausible that you could come up with something that, for your use case, performs considerably better than their "defaults".
Personally I don't see aipack's pro@coder and other approaches (claude code, cursor, copilot, etc...) as competitors anymore. I use both approaches to solve different problems. I keep using the agentic solutions (claude code style) for more operational tasks, a bit like "smart interfaces to terminal", and pro@coder for coding / engineering tasks where I need a much tighter control over long running work sessions.
This is fascinating. I feel like this is converging into the concept of a traditional "IDE". So much of your setup reminds me of IDEs indexing, doing static analysis, building ASTs, etc. before a developer starts writing code.
> Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.
For Claude Code users this is huge - assuming coherence remains strong past 200k tok.
Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.
I'd been on Codex for a while and with Codex 5.2 I:
1) No longer found the dumb zone
2) No longer feared compaction
Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.
I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.
● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.
● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.
● IMPLEMENT. Execute in a fresh context window.
The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.
For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.
When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.
It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.
Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?
I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.
I've used it many times for long-running investigations. When I'm deep in the weeds with a ton of disassembly listings and memory dumps and such, I don't really want to interrupt all of that with a compaction or handoff cycle and risk losing important info. It seems to remain very capable with large contexts at least in that scenario.
I've been using the 1M window at work through our enterprise plan as I'm beginning to adopt AI in my development workflow (via Cline). It seems to have been holding up pretty well until about 700k+. Sometimes it would continue to do okay past that, sometimes it started getting a bit dumb around there.
(Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)
The quality with the 1M window has been very poor for me, specifically for coding tasks. It constantly forgets stuff that has happened in the existing conversation. n=1, ymmv
Well, the question is what is contributing to the usage. Because as the context grows, the amount of input tokens are increasing. A model call with 800K token as input is 8 times more expensive than a model call with 100K tokens as input. Especially if we resume a conversation and caching does not hit, it would be very expensive with API pricing.
It’s interesting because my career went from doing higher level language (Python) to lower language (C++ and C). Opus and the like is amazing at Python, honestly sometimes better than me but it does do some really stupid architectural decisions occasionally. But when it comes to embedded stuff, it’s still like a junior engineer. Unsure if that will ever change but I wonder if it’s just the quality and availability of training data. This is why I find it hard to believe LLMs will replace hardware engineers anytime soon (I was a MechE for a decade).
Is there a writeup anywhere on what this means for effective context? I think that many of us have found that even when the context window was 100k tokens the actual usable window was smaller than that. As you got closer to 100k performance degraded substantially. I'm assuming that is still true but what does the curve look like?
Claude Code 2.1.75 now no longer delineates between base Opus and 1M Opus: it's the same model. Oddly, I have Pro where the change supposedly only for Max+ but am still seeing this to be case.
EDIT: Don't think Pro has access to it, a typical prompt just hit the context limit.
The removal of extra pricing beyond 200k tokens may be Anthropic's salvo in the agent wars against GPT 5.4's 1M window and extra pricing for that.
Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter.
I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.
It was also the first AI I felt, "Damn, this thing is smarter than me."
The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.
This is super exciting. I've been poking at it today, and it definitely changes my workflow -- I feel like a full three or four hour parallel coding session with subagents is now generally fitting into a single master session.
The stats claim Opus at 1M is about like 5.4 at 256k -- these needle long context tests don't always go with quality reasoning ability sadly -- but this is still a significant improvement, and I haven't seen dramatic falloff in my tests, unlike q4 '25 models.
p.s. what's up with sonnet 4.5 getting comparatively better as context got longer?
This is incredible. I just blew through $200 last night in a few hours on 1M context. This is like the best news I've heard all year in regards to my business.
What is OpenAIs response to this? Do they even have 1M context window or is it still opaque and "depends on the time of day"
I'm very happy about this change. For long sessions with Claude it was always like a punch to the gut when a compaction came along. Codex/GPT-5.4 is better with compactions so I switched to that to avoid the pain of the model suddenly forgetting key aspects of the work and making the same dumb errors all over again. I'm excited to return to Claude as my daily driver!
This is amazing. I have to test it with my reverse engineering workflow. I don't know how many people use CC for RE but it is really good at it.
Also it is really good for writing SketchUp plugins in ruby. It one shots plugins that are in some versions better then commercial one you can buy online.
CC will change development landscape so much in next year. It is exciting and terrifying in same time.
Do long sessions also burn through token budgets much faster?
If the chat client is resending the whole conversation each turn, then once you're deep into a session every request already includes tens of thousands of tokens of prior context. So a message at 70k tokens into a conversation is much "heavier" than one at 2k (at least in terms of input tokens). Yes?
I've been using Claude Code directly on my production servers to debug complex I/O bottlenecks and database locks. The ability of the latest models to hold the entire project context while suggesting real-time fixes is a game changer for solo founders. It helped me stabilize a security tool I’m building when other agents kept hallucinating.
All while their usage limits are so excessively shitty that I paid them 50$ just two days back cause I ran out of usage and they still blocked from using it during a critical work week (and did not refund my 50$ despite my emails and requests and route me to s*ty AI bot.). Anyway, I am using Copilot and OpenCode a lot more these days which is much better.
Compared to yesterday my Claude Max subscription burns usage like absolutely crazy (13% of weekly usage from fresh reset today with just a handful prompts on two new C++ projects, no deps) and has become unbearably slow (as in 1hr for a prompt response). GGWP Anthropic, it was great while it lasted but this isn't worth the hundreds of dollars.
Finally, I don't have to constantly reload my Extra Usage balance when I already pay $200/mo for their most expensive plan. I can't believe they even did that. I couldn't use 1M context at all because I already pay $200/mo and it was going to ask me for even more.
Next step should be to allow fast mode to draw from the $200/mo usage balance. Again, I pay $200/mo, I should at least be able to send a single message without being asked to cough up more. (One message in fast mode costs a few dollars each) One would think $200/mo would give me any measure of ability to use their more expensive capabilities but it seems it's bucketed to only the capabilities that are offered to even free users.
I've been avoiding context beyond 100k tokens in general. The performance is simply terrible. There's no training data for a megabyte of your very particular context.
If you are really interested in deep NIAH tasks, external symbolic recursion and self-similar prompts+tools are a much bigger unlock than more context window. Recursion and (most) tools tend to be fairly deterministic processes.
I generally prohibit tool calling in the first stack frame of complex agents in order to preserve context window for the overall task and human interaction. Most of the nasty token consumption happens in brief, nested conversations that pass summaries back up the call stack.
My main frustration with long-context coding sessions isn't just the limit itself, it's that after the fact it's hard to tell which turns
actually caused the context to bloat or the session to go off track.
It's painful enough I have to build a tool to help myself understand the context/turn data correlation. I have to manual compact now
Could be pure coincidence, but my Claude Code session last night was an absolute nightmare. It kept forgetting things it had done earlier in the session and why it had done them, messed up a git merge so badly that it lost the CLAUDE.md file along with a lot of other stuff, and then started running commands on the host machine instead of inside the container because it no longer had a CLAUDE.md to tell it not to. Last night was the first time I've ever sworn at it.
What about response coherence with longer context? Usually in other models with such big windows I see the quality to rapidly drop as it gets past a certain point.
Can someone help me with insights about large context models? Are there relationships that pop up at the beginning and end of long context windows that don't transitively follow from intermediate points? Is there value in the training over these longer windows vs using the more basic/closer weight distributions over different sliding windows?
I'm fairly sure that your best throughput is single-prompt single-shot runs with Claude (and that means no plan, no swarms, etc) -- just with a high degree of work in parallel.
So for me this is a pretty huge change as the ceiling on a single prompt just jumped considerably. I'm replaying some of my less effective prompts today to see the impact.
Awesome.... With Sonnet 4.5, I had Cline soft trigger compaction at 400k (it wandered off into the weeds at 500k). But the stability of the 4.6 models is notable. I still think it pays to structure systems to be comprehensible in smaller contexts (smaller files, concise plans), but this is great.
The stuff I built with Opus 4.6 in the past 2.5 weeks:
Full clone of Panel de Pon/Tetris attack with full P2P rollback online multiplayer:
https://panel-panic.com
An emulator of the MOS 6502 CPU with visual display of the voltage going into the DIP package of the physical CPU:
https://larsdu.github.io/Dippy6502/
I'm impressed as fuck, but a part of me deep down knows that I know fuck all about the 6502 or its assembly language and architecture, and now I'll probably never be motivated to do this project in a way that I would've learned all the tings I wanted to learn.
> Standard pricing now applies across the full 1M window for both models, with no long-context premium.
Does that mean it's likely not a Transformer with quadratic attention, but some other kind of architecture, with linear time complexity in sequence length? That would be pretty interesting.
I am currently mass translating millions of records with short descriptions. Somehow tokens are consumed extremely fast. I have 3 max memberships. And all 3 of them are hitting the 5 hour limit in about 5 to 10 minutes. Still don't understand why this is happening.
519 comments
About 80% of my code is AI-generated, with a controlled workflow using dev-chat.md and spec.md. I use Flash for code maps and auto-context, and GPT-4.5 or Opus for coding, all via API with a custom tool.
Gemini Pro and Flash have had 1M context for a long time, but even though I use Flash 3 a lot, and it’s awesome, I’ve never needed more than 200k.
For production coding, I use
- a code map strategy on a big repo. Per file: summary, when_to_use, public_types, public_functions. This is done per file and saved until the file changes. With a concurrency of 32, I can usually code-map a huge repo in minutes. (Typically Flash, cheap, fast, and with very good results)
- Then, auto context, but based on code lensing. Meaning auto context takes some globs that narrow the visibility of what the AI can see, and it uses the code map intersection to ask the AI for the proper files to put in context. (Typically Flash, cheap, relatively fast, and very good)
- Then, use a bigger model, GPT 5.4 or Opus 4.6, to do the work. At this point, context is typically between 30k and 80k max.
What I’ve found is that this process is surprisingly effective at getting a high-quality response in one shot. It keeps everything focused on what’s needed for the job.
Higher precision on the input typically leads to higher precision on the output. That’s still true with AI.
For context, 75% of my code is Rust, and the other 25% is TS/CSS for web UI.
Anyway, it’s always interesting to learn about different approaches. I’d love to understand the use case where 1M context is really useful.
You have to make sure the semantic summary takes up significantly less tokens than just reading the code or its just a waste of token/time.
Then have a skill that uses git version logs to perform lazy summary cache when needed.
Here's the implementation for the interested: https://github.com/tontinton/maki/blob/main/maki-code-index%...
I've always wanted to explore how to fit tree-sitter into this workflow. It's great to know that this works well too.
Thanks for sharing the code.
(Here is the AIPack runtime I built, MIT: https://github.com/aipack-ai/aipack), and here is the code for pro@coder (https://github.com/aipack-ai/packs-pro/tree/main/pro/coder) (AIPack is in Rust, and AI Packs are in md / lua)
i said well yeah, but its too sophiscated to be practical
I prompt, press run, and then I get this flow: dev setup (dev-chat or plan) code-map (incremental 0s 2m for initial) auto-context (~20s to 40s) final AI query (~30s to 2m)
For example, just now, in my Rust code (about 60k LOC), I wanted to change the data model and brainstorm with the AI to find the right design, and here is the auto-context it gave me:
- Reducing 381 context files ( 1.62 MB)
- Now 5 context files ( 27.90 KB)
- Reducing 11 knowledge files ( 30.16 KB)
- Now 3 knowledge files ( 5.62 KB)
The knowledge files are my "rust10x" best practices, and the context files are the source files.
(edited to fix formatting)
The problem of evaluating is hard enough as it is without layers of indirection built on top of it.
In the end it's hard to measure but personally I feel that my agent rarely misses any context for a given task, so I'm pretty happy with it.
I used a different approach than tree-sitter because I thought I found a nice way to get around having to write language-specific code. I basically use VSCode as a language backend and wrote some logic to basically rebuild the AST tree from VSCode's symbol data and other API.
That allows me to just install the correct language extension and thus enable support for that specific language. The extension has to provide symbol information which most do through LSP.
In the end it was way more effort than just using tree-sitter, however, and I'm thinking of doing a slow migration to that approach sooner or later.
Anyways, I created an extension that spins up an mcp server and provides several tools that basically replace the vanilla discovery tools in my workflow.
The approach is similar to yours, I have an overview tool which runs different centrality ranking metrics over the whole codebase to get the most important symbols and presents that as an architectural overview to the LLM.
Then I have a "get-symbol-context" tool which allows the AI to get all the information that the AST holds about a single symbol, including a parameter to include source code which completely replaces grepping and file reading for me.
The tool also specifies which other symbols call the one in question and which others it calls, respectively.
But yeah, sorry for this being already a quite long comment, if you want to give it a try, I published it on the VSCode marketplace a couple of days ago, and it's basically free right now, although I have to admit that I still want to try to earn a little bit of money with it at some point.
Right now, the daily usage limit is 2000 tool calls per day, which should be enough for anybody.
Would love to hear what you think :)
<https://marketplace.visualstudio.com/items?itemName=LuGoSoft...>
The fact that you've been using it for six months and that it performs well says a lot. At the end of the day, that's what counts.
I like your idea of piggybacking on top of the LSP services, and I can imagine that this was quite a bit of work. Doing it as an MCP server makes it usable across different tools.
I also really like the name "Context Master."
In my case, it's much more niche since it's for the tool I built. Though it's open source, the key difference is that the "indexing" is only agentic at this point.
I can see value in mixing the two. LSP integration scares me because of the amount of work involved, and tree-sitter seems like a good path.
In that case, in the code map, for each item, there could be both the LLM response info and some deterministic info, for example, from tree-sitter.
That being said, the current approach works so well that I think I am going to keep using and fine-tuning it for a while, and bring in deterministic context only when or if I need it.
Anyway, what you built looks great. If it works, that's great.
I totally get sticking with your current approach. Your workflow sounds very intriguing as well. A combination of both approaches might really be very interesting :) Adding an LLM interpretation layer on top of my graph is also something I'm actively considering.
Thanks for the great discussion, and best of luck with your tool and workflow!
So, small model figures out which files to use based on the code map, and then enriches with snippets, so big model ideally gets preloaded with relevant context / snippets up front?
Where does code map live? Is it one big file?
I also have a
.tmpl-code-map.jsonlin the same folder so all of my tasks can add to it, and then it gets merged into context-code-map.json.I keep mtime, but I also compute a blake3 hash, so if mtime does not match, but it is just a "git restore," I do not redo the code map for that file. So it is very incremental.
Then the trick is, when sending the code map to AI, I serialize it in a nice, simple markdown format.
- path/to/file.rs - summary: ... - when to use: ... - public types: .., .., .. - public functions: .., .., ..
- ...
So the AI does not have to interpret JSON, just clean, structured markdown.
Funny, I worked on this addition to my tool for a week, planning everything, but even today, I am surprised by how well it works.
I have zero sed/grep in my workflow. Just this.
My prompt is pro@coder/coder-prompt.md, the first part is YAML for the globs, and the second part is my prompt.
There is a TUI, but all input and output are files, and the TUI is just there to run it and see the status.
1) Deterministic
2) Agentic So, this is why I started with #2.And then, the results in real coding scenarios have been astonishing.
Way above what I expected.
The way those indexes get combined with the user prompt gets the right files 95% of the time, and with surprisingly high quality.
So, I might add deterministic aspects to it, but since I think I will need the agentic step anyway, I have deprioritized it.
I imagine if the context were being commited and kept up-to-date with CI would work for others to use as well.
However, I'm a little confused on the autocontext/globs narrowing part. Do you, the developer, provide them? Or you feed the full code map to flash + your prompt so it returns the globs based on your prompt?
Also, in general, is your map of a file relatively smaller than the file itself, even for very small files?
- I have two main globs, which are lists of globs: knowledge_globs and context_globs. Knowledge can be absolute and should be relatively static. context_globs have to be relative to the workspace, since they are the working files.
- As a dev, you provide them in the top YAML section of the coder-prompt.md.
- The auto-context sub-agent calls the code-map sub-agent. Sub-agents can add to or narrow the given globs, and that is the goal of the auto-context agent.
It looks complicated, but it actually works like a charm.
Hopefully, I answered some of your questions.
I need to make a video about it.
But regardless, I really think it's not about the tools, it's about the techniques. This is where the true value is.
They're solving a different problem than you. So I think it's very plausible that you could come up with something that, for your use case, performs considerably better than their "defaults".
And that does not prevent mixing and matching the two, as some comments in this thread suggest.
Anyway, it's a great time for production coding.
> Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.
For Claude Code users this is huge - assuming coherence remains strong past 200k tok.
No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ
1) No longer found the dumb zone
2) No longer feared compaction
Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.
I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.
His fix for "the dumb zone" is the RPI Framework:
● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.
● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.
● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.
For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.
I tried to ask questions about path of exile 2. And even with web research on it gave completely wrong information... Not only outdated. Wrong
I think context decay is a bigger problem then we feel like.
(Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)
Normally buying the bigger plan gives some sort of discount.
At Claude, it's just "5 times more usage 5 times more cost, there you go".
EDIT: Don't think Pro has access to it, a typical prompt just hit the context limit.
The removal of extra pricing beyond 200k tokens may be Anthropic's salvo in the agent wars against GPT 5.4's 1M window and extra pricing for that.
I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.
It was also the first AI I felt, "Damn, this thing is smarter than me."
The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.
The stats claim Opus at 1M is about like 5.4 at 256k -- these needle long context tests don't always go with quality reasoning ability sadly -- but this is still a significant improvement, and I haven't seen dramatic falloff in my tests, unlike q4 '25 models.
p.s. what's up with sonnet 4.5 getting comparatively better as context got longer?
What is OpenAIs response to this? Do they even have 1M context window or is it still opaque and "depends on the time of day"
Also it is really good for writing SketchUp plugins in ruby. It one shots plugins that are in some versions better then commercial one you can buy online.
CC will change development landscape so much in next year. It is exciting and terrifying in same time.
If the chat client is resending the whole conversation each turn, then once you're deep into a session every request already includes tens of thousands of tokens of prior context. So a message at 70k tokens into a conversation is much "heavier" than one at 2k (at least in terms of input tokens). Yes?
As soon as I saw the announcement , tried again and created a working design skill that can create design artifacts following the brand guidelines.
While these improvements seem incremental, they have a compounding effect on usefulness.
My AI doomsday calculator just got decremented by anothet 6 months.
Next step should be to allow fast mode to draw from the $200/mo usage balance. Again, I pay $200/mo, I should at least be able to send a single message without being asked to cough up more. (One message in fast mode costs a few dollars each) One would think $200/mo would give me any measure of ability to use their more expensive capabilities but it seems it's bucketed to only the capabilities that are offered to even free users.
If you are really interested in deep NIAH tasks, external symbolic recursion and self-similar prompts+tools are a much bigger unlock than more context window. Recursion and (most) tools tend to be fairly deterministic processes.
I generally prohibit tool calling in the first stack frame of complex agents in order to preserve context window for the overall task and human interaction. Most of the nasty token consumption happens in brief, nested conversations that pass summaries back up the call stack.
Do long context windows make much sense then or is this just a way of getting people to use more tokens?
So for me this is a pretty huge change as the ceiling on a single prompt just jumped considerably. I'm replaying some of my less effective prompts today to see the impact.
(And, yeah, I'm all Claude Code these days...)
Full clone of Panel de Pon/Tetris attack with full P2P rollback online multiplayer: https://panel-panic.com
An emulator of the MOS 6502 CPU with visual display of the voltage going into the DIP package of the physical CPU: https://larsdu.github.io/Dippy6502/
I'm impressed as fuck, but a part of me deep down knows that I know fuck all about the 6502 or its assembly language and architecture, and now I'll probably never be motivated to do this project in a way that I would've learned all the tings I wanted to learn.
> Standard pricing now applies across the full 1M window for both models, with no long-context premium.
Does that mean it's likely not a Transformer with quadratic attention, but some other kind of architecture, with linear time complexity in sequence length? That would be pretty interesting.