Good company-ready RAG benefits a lot from some basic pre-processing/labeling of the data instead of solely dumping unstrucuted data into a vector database and calling it a day. Different heuristics and different schemas of embedded data go a long way in ensuring quality and flexibility of querying.
Then you can do ReAG, which let's you reason on top of the top K intelligently.
And things like memory knowledge graph services as well, can help reduce your search space, and provide extra context over time that gets updated, beyond just treating static docs as sources of truth. You can give it more context as to how it should interpret older docs, vs. newer docs, and allowing users (based on correctness or not) to help audit the what is embedded in your RAG systems.
I appreciate the thorough write up, but doing RAG systems seriously requires much more than just embeddings and a basic chromadb set up.
Happy to share any thoughts here or on a call if anyone wants to chat.
I agree, I attempted a similar project a year ago and the retrival part is so critical. To work half decent you need some serious strategy for metadata, chunking, etc. E.g. how do you deal with tim series data? Like i am not looking for any quarterly numbers but the ones from Q2 2025, Or the research report from 4 weeks ago... And how do you deal with images. We had heaps of companiy knowledge in pptx which you can convert to text but what about pictures in the presentations. Our analyst presentations sometimes consist mostly of charts and visuals, how are they embedded?
Also imo for 90% of the time companies dont need a RAG system but a good search / retrival system.
This article is interesting cause of its scale, but does not touch on how to properly use RAG best practices. We wrote up this blog post on how to actually build a smart enterprise AI RAG based on the latest research if it's interesting to anyone: https://bytevagabond.com/post/how-to-build-enterprise-ai-rag...
It's based on different chunking strategies that scale cheaply and advanced retrieval
And some have been saying that RAGs are obsolete—that the context window of a modern LLM is adequate (preferable?). The example I recently read was that the contexts are large enough for the entire "The Lord of the Rings" books.
That may be, but then there's an entire law library, the entirety of Wikipedia (and the example in this article of 451 GB). Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.
Is there a 'sqlite equivalent' for RAG? e.g. something I could give Claude w/o a backend and say use command X to add a document, command Y to search, all in a flat file?
What ended up being the main bottleneck in your pipeline—embedding throughput, cost, or something else? Did you explore parallelizing vectorization (e.g., multiple workers) or did that not help much in practice?
I'd argue the author missed a trick here by using a fancy embedding model without any re-ranking. One of the benefits of a re-ranker (or even a series of re-rankers!) is that you can embed your documents using a really small and cheap model (this also often means smaller embeddings).
Why did you opt for semantic search, and not plain old full text search? I built an "AI Agent for a Commerce Website" as a take-home exercise yesterday, and I chose to simply give the model a tool that does a full text search over products, powered by MiniSearch, and I think it works reasonably well. I believe this is also what Claude Code does.
After a couple years of multi-modal LLM proving out product, I now consider RAG to be essentially "AI Lite", or just AI-inspired vector search.
It isn't really "AI" in the way ongoing LLM conversations are. The context is effectively controlled by deterministic information, and as LLMs continue improve through various context-related techniques like re-prompting, running multiple models, etc. that deterministic "re-basing" of context will stifle the output.
So I say over time it will be treated as less and less "AI" and more "AI adjacent".
The significance is that right now RAG is largely considered to be an "AI pipeline strategy" in its own right compared others that involve pure context engineering.
But when the context size of LLMs grows much larger (with integrity), when it can, say, accurately hold thousands and thousands of lines of code in context with accuracy, without having to use RAG to search and find, it will be doing a lot more for us. We will get the agentic automation they are promising and not delivering (due to this current limitation).
This article came just in the nick of time. I'm in fandoms that lean heavily into fanfiction, and there's a LOT out there on Ao3. Ao3 has the worst search (and yo can't even search your account's history!), so I've been wanting to create something like this as a tool for the fandom, where we can query "what was the fic about XYZ where ABC happened?" and get hopefully helpful responses. I'm very tired of not being able to do this, and it would be a fun learning experience.
I've already got the data mostly structured because I did some research on the fandom last year, charting trends and such, so I don't even need to massage the data. I've got authors, dates, chapters, reader comments, and full text already in a local SQLite db.
Reading this blog post scared me a bit. The use case I proposed was building a "simple" RAG chatbot for some (~50 confluence docs and somewhat growing) on elasticsearch and another process that my team handles. I was just planning on using a stack like streamlit, text-embedding-3-small,FAISS for the vector store and it to be driven by a python script.
Didn't seem too expensive or too hard based on the handful of queries my team would be using it for, and it was a "low hanging fruit" pain point for my team that I thought could be improved by a RAG chatbot. That on top of the fact that Atlassian Rovo did not do a good job of not going to external sources when we had the answer in our existing internal docs.
Great write-up. Thank you! I’m contemplating a similar RAG architecture for my engineering firm, but we’re dealing with roughly 20x the data volume (estimating around 9TB of project files, specs, and PDFs).
I've been reading about Google's new STATIC framework (sparse matrix constrained decoding) and am really curious about the shift toward generative retrieval for massive speedups well beyond this approach.
For those who have scaled RAG into the multi-terabyte range: is it actually worth exploring generative retrieval approaches like STATIC to bypass standard dense vector search, or is a traditional sharded vector DB (Milvus, Pinecone, etc.) still the most practical path at this scale?
I would guess the ingestion pain is still the same.
Nice writeup. I’m curious why you went with chromadb and not pgvector. I haven’t built a rag system myself, but I’ve always understood the initial doc parsing to be a major challenge alone, so kudos there!
Additionally, I also thought it was customary to store a pointer to the source in the same row as the vector (i.e. vector+ doc path + page#/paragraph/etc.) OR just store the original text chunk (though based on your disk reqs doesn’t sound like it would have been feasible).
Glad you’re having good results! Maybe you’ve inspired me to finally try out a similar setup myself!
What would it look like to regularly react to source data changes? Seems like a big missing piece. Event based? regular cadence? Curious what people choose. Great post though.
Cool work! Would be so interested in what would happen if you would put the data and you plan / features you wanted in a Claude Code instance and let it go. You did carefully thinking, but those models now also go really far and deep. Would be really interested in seeing what it comes up with. For that kind of data getting something like a Mac mini or whatever (no not with OpenClaw) would be damn interesting to see how fast and far you can go.
So 95% of the post is „regular software engineering” like, yes you cannot just process 1TB of data, you need to split it up then even if you split it up you might have limited budget for processing so think how you fit in that, make checkpoints and make sure you have logs.
Not dismissing the value of the blog post. Just underlining for „non engineers”.
I made something similar in my project. My more difficult task has been choice the right approach to chunking long documents. I used both structural and semantic chunking approach. The semantic one helped to better store vectors in vectorial DB. I used QDrant and openAi embedding model.
Thanks for an interesting read! Are you monitoring usage, and what kind of user feedback have you received? Always curious if these projects end up used because, even with the perfect tech, if the data is low quality, nobody is going to bother
103 comments
Then you can do ReAG, which let's you reason on top of the top K intelligently.
And things like memory knowledge graph services as well, can help reduce your search space, and provide extra context over time that gets updated, beyond just treating static docs as sources of truth. You can give it more context as to how it should interpret older docs, vs. newer docs, and allowing users (based on correctness or not) to help audit the what is embedded in your RAG systems.
I appreciate the thorough write up, but doing RAG systems seriously requires much more than just embeddings and a basic chromadb set up.
Happy to share any thoughts here or on a call if anyone wants to chat.
Unfortunately, many people are looking for a fire and forget solution over an existing rats nest of documentation debt..
> Happy to share any thoughts here
Please do.
It's based on different chunking strategies that scale cheaply and advanced retrieval
That may be, but then there's an entire law library, the entirety of Wikipedia (and the example in this article of 451 GB). Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.
I love those site features!
In a submission of a few days ago there was something similar.
I love it when a website gives a hint to the old web :)
https://github.com/dprkh/fufus/
It isn't really "AI" in the way ongoing LLM conversations are. The context is effectively controlled by deterministic information, and as LLMs continue improve through various context-related techniques like re-prompting, running multiple models, etc. that deterministic "re-basing" of context will stifle the output.
So I say over time it will be treated as less and less "AI" and more "AI adjacent".
The significance is that right now RAG is largely considered to be an "AI pipeline strategy" in its own right compared others that involve pure context engineering.
But when the context size of LLMs grows much larger (with integrity), when it can, say, accurately hold thousands and thousands of lines of code in context with accuracy, without having to use RAG to search and find, it will be doing a lot more for us. We will get the agentic automation they are promising and not delivering (due to this current limitation).
I've already got the data mostly structured because I did some research on the fandom last year, charting trends and such, so I don't even need to massage the data. I've got authors, dates, chapters, reader comments, and full text already in a local SQLite db.
Didn't seem too expensive or too hard based on the handful of queries my team would be using it for, and it was a "low hanging fruit" pain point for my team that I thought could be improved by a RAG chatbot. That on top of the fact that Atlassian Rovo did not do a good job of not going to external sources when we had the answer in our existing internal docs.
Am I still on the right path?
I would guess the ingestion pain is still the same.
This new world is astounding.
Additionally, I also thought it was customary to store a pointer to the source in the same row as the vector (i.e. vector+ doc path + page#/paragraph/etc.) OR just store the original text chunk (though based on your disk reqs doesn’t sound like it would have been feasible).
Glad you’re having good results! Maybe you’ve inspired me to finally try out a similar setup myself!
Not dismissing the value of the blog post. Just underlining for „non engineers”.
Also I wonder if it's now better to use Claude Agent SDK instead of RAG. If anyone has tried this, I would be interested in hearing more.
Did you look at Turbopuffer btw?