I’ve got a project right now, separate vector DB, Elasticsearch, graph store, all for an agent system.
When you say Antfly combines all three, what does that actually look like at query time? Can I write one query that does semantic similarity + full-text + graph traversal together, or is it more like three separate indexes that happen to live in the same binary?
Does it ship with a CLI that's actually good? I’m pivoting away from MCP. Like can I pipe stuff in, run queries, manage indexes from the terminal without needing to write a client? That matters more to me than the MCP server honestly.
And re: Termite + single binary, is the idea that I can just run antfly swarm, throw docs and images at it, and have a working local RAG setup with no API keys? If so, that might save me a lot of docker-compose work.
Who's actually running this distributed vs. single-node? Curious what the typical user experience looks like.
Exactly the use case I built it for!
I wanted a world where you could build your indexes and the query planner could just be smart enough to use them in a single query. I've not quite nailed down the agentic query planner side 100% (it's getting there), but the JSON query DSL allows you to pipeline, join, fuse all the full-text, semantic, graph, reranking, pruning (score/token pruning) all in one query.
The CLI is my primary development tool with antfly, I am definitely looking for feedback on what people would like to see there, it's a little chonky with the flags --pruner e.g. requires writing the JSON for the config because I didn't want users to have to memorize 1000 subflags. It's definitely a first class citizen.
With respect to "Termite + single binary" that's exactly right, Termite handles chunking, multimodal chunking, embeddings (sparse + dense), reranking, fused chunking/embedding models, and we're excitedly getting more support for a variety of onnx based llms/ner models to help with data extraction use cases (functiongemma/gliner2/etc) so you don't have to setup 10 different services for testing vs deployment.
We run Antfly ourselves for our https://platform.searchaf.com (cheeky search AntFly) Algolia style search product in a distributed setup, and some users run Antfly in single node with large instances (more at the Postgres size datasets with millions of documents vs. large multitenant depoys). But we really wanted to build something with a more seamless experience of going back and forth between a distributed vs single node instance than elasticsearch or postgres can offer.
Hope that helps! Let me know if I can help you with anything!
A quick note, on platform.searchaf.com
The account creation process hits a snag with verify-email links received on email giving a 404. hope it helps.
On a parallel note, It would be nice to put an architecture diagram in the github repo.
Are there particular aspects of the current implementation which you want to actively improve/rearchitect/change?
I agree with the goals set out for the project and can testify that elasticsearch's DX is pretty annoying.
Having said that, distributed indexing with pluggable ingestion/query custom indexes may be a good goal to aim for.
- Finite State Transducers (FST) or Finite state automata based memory efficient indexes for specific data mimetypes
- adding hashing based search semantic search indexes.
And even changing the indexer/reranker implementation would help make things super hackable.
Oh thanks for the 404 on the verify link (I abstracted out the auth OIDC for cross domain login and must have missed a path).
Yes good call, I tried to start that on the website with a react-flows based architectural flow chart a little bit but it's a bit high level, and not consumable directly in github markdown files but I'll work on that!
That's exactly the direction I've been working on, the reranking, embedders and chunkers are all plugable and the schema design (using jsonschema for our "schema-ish" approach allows for fine-grained index backend hints for individual data types etc.) I'll work on getting a good architecture doc up today and tomorrow!
10:30 AM
The Termite bundling is the most interesting part. Packaging embedding and reranking inference alongside the database means no separate model server to manage and no network hop for every vector op.
Curious about resource contention though: if a heavy indexing job saturates Termite, does that affect query latency on the Raft side? And how does Termite handle model cold starts in single-process mode?
On the license: the ELv2 framing is honest and the "can't offer as managed service" carve-out is pretty standard at this point. Won't bother most people reading this.
In regards to contention, the answer is definitely dependent on how you host. We've had a lot of experience running different ML workloads and from an SRE perspective we knew you'd need a variety of different styles of hosting the models depending on read/write patterns of your usage. Termite and the proxy service/operator allow for all styles of model loading, either preloading and compiling to prevent cold starts or lazy loading to protect memory, with different pooling strategies and caching strategies for bundling multiple models running in the same Termite container.
If a heavy indexing job is running on a CPU only single-node deployment, it won't be using Raft (no replication). If it's running with GPU it doesn't share resources with the DB anyways really significantly there. If it's running distributed, also no issue with contention really.
As a longtime Raft user (via hashicorp/raft), I'm curious about your Raft implementation! You mention etcd's Raft library, but it isn't natively Multi-Raft is it? Is your implementation similar to https://tikv.org/deep-dive/scalability/multi-raft/ ? I'd love to hear about your experience implementing and testing it!
Comment on the Pause method indicates that waits for in flight Batch operations (by obtaining the lock) but Batch doesn’t appear to hold the lock during the batch operation. Am I missing something?
Of course the two most visionary people I worked with at Lytics went and built this. Just in time... this is the vector database I actually need. Termite is the killer feature for me, native ML inference in a single binary means I can stop duct-taping together embedding APIs for my projects. Excited to spend the upcoming weekends hacking on the Antfly ecosystem.
40 comments
I’ve got a project right now, separate vector DB, Elasticsearch, graph store, all for an agent system.
When you say Antfly combines all three, what does that actually look like at query time? Can I write one query that does semantic similarity + full-text + graph traversal together, or is it more like three separate indexes that happen to live in the same binary?
Does it ship with a CLI that's actually good? I’m pivoting away from MCP. Like can I pipe stuff in, run queries, manage indexes from the terminal without needing to write a client? That matters more to me than the MCP server honestly.
And re: Termite + single binary, is the idea that I can just run
antfly swarm, throw docs and images at it, and have a working local RAG setup with no API keys? If so, that might save me a lot of docker-compose work.Who's actually running this distributed vs. single-node? Curious what the typical user experience looks like.
Exactly the use case I built it for! I wanted a world where you could build your indexes and the query planner could just be smart enough to use them in a single query. I've not quite nailed down the agentic query planner side 100% (it's getting there), but the JSON query DSL allows you to pipeline, join, fuse all the full-text, semantic, graph, reranking, pruning (score/token pruning) all in one query.
The CLI is my primary development tool with antfly, I am definitely looking for feedback on what people would like to see there, it's a little chonky with the flags --pruner e.g. requires writing the JSON for the config because I didn't want users to have to memorize 1000 subflags. It's definitely a first class citizen.
With respect to "Termite + single binary" that's exactly right, Termite handles chunking, multimodal chunking, embeddings (sparse + dense), reranking, fused chunking/embedding models, and we're excitedly getting more support for a variety of onnx based llms/ner models to help with data extraction use cases (functiongemma/gliner2/etc) so you don't have to setup 10 different services for testing vs deployment.
We run Antfly ourselves for our https://platform.searchaf.com (cheeky search AntFly) Algolia style search product in a distributed setup, and some users run Antfly in single node with large instances (more at the Postgres size datasets with millions of documents vs. large multitenant depoys). But we really wanted to build something with a more seamless experience of going back and forth between a distributed vs single node instance than elasticsearch or postgres can offer.
Hope that helps! Let me know if I can help you with anything!
On a parallel note, It would be nice to put an architecture diagram in the github repo. Are there particular aspects of the current implementation which you want to actively improve/rearchitect/change?
I agree with the goals set out for the project and can testify that elasticsearch's DX is pretty annoying. Having said that, distributed indexing with pluggable ingestion/query custom indexes may be a good goal to aim for. - Finite State Transducers (FST) or Finite state automata based memory efficient indexes for specific data mimetypes - adding hashing based search semantic search indexes.
And even changing the indexer/reranker implementation would help make things super hackable.
Yes good call, I tried to start that on the website with a react-flows based architectural flow chart a little bit but it's a bit high level, and not consumable directly in github markdown files but I'll work on that!
That's exactly the direction I've been working on, the reranking, embedders and chunkers are all plugable and the schema design (using jsonschema for our "schema-ish" approach allows for fine-grained index backend hints for individual data types etc.) I'll work on getting a good architecture doc up today and tomorrow!
Curious about resource contention though: if a heavy indexing job saturates Termite, does that affect query latency on the Raft side? And how does Termite handle model cold starts in single-process mode?
On the license: the ELv2 framing is honest and the "can't offer as managed service" carve-out is pretty standard at this point. Won't bother most people reading this.
In regards to contention, the answer is definitely dependent on how you host. We've had a lot of experience running different ML workloads and from an SRE perspective we knew you'd need a variety of different styles of hosting the models depending on read/write patterns of your usage. Termite and the proxy service/operator allow for all styles of model loading, either preloading and compiling to prevent cold starts or lazy loading to protect memory, with different pooling strategies and caching strategies for bundling multiple models running in the same Termite container.
If a heavy indexing job is running on a CPU only single-node deployment, it won't be using Raft (no replication). If it's running with GPU it doesn't share resources with the DB anyways really significantly there. If it's running distributed, also no issue with contention really.
Let us know if you have any other questions!
Comment on the Pause method indicates that waits for in flight Batch operations (by obtaining the lock) but Batch doesn’t appear to hold the lock during the batch operation. Am I missing something?
I've seen it on a few products and it doesn't click with me how people are using it.
For fun I am making hybrid search too and would love to see how you merge the two list (semantic and keyword) and rerank the importance score.
Did you build this for yourself?