> Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.
Sure, not related to DSPy though, and completely tablestakes. Also not sure why the whole article assumes the only language in the world is Python.
> Separate prompts from code. Forces you to think about prompts as distinct things.
There's really no reason prompts must live in a file with a .md or .json or .txt extension rather than .py/.ts/.go/.., except if you indeed work at a company that decided it's a good idea to let random people change prod runtime behavior. If someone can think of a scenario where this is actually a good idea, feel free to elighten me. I don't see how it's any more advisable than editing code in prod while it's running.
> Composable units. Every LLM call should be testable, mockable, chainable.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And LiteLLM or ai (Vercel), the actually most used packages, aren't? You're comparing downloads with Langchain, probably the worst package to gain popularity of the last decade. It was just first to market, then after a short while most realized it's horrifically architected, and now it's just coasting on former name recognition while everyone who needs to get shit done uses something lighter like the above two.
> Eval infrastructure early. Day one. How will you know if a change helped?
Sure, to an extent. Outside of programming, most things where LLMs deliver actual value are very nondeterministic with no right answer. That's exactly what they offer. Plenty of which an LLM can't judge the quality of. Having basic evals is useful, but you can quickly run into their development taking more time than it's worth.
But above all.. the comments on this post immediately make clear that the biggest differentiator of DSPy is the prompt optimization. Yet this article doesn't mention that at all? Weird.
>the whole article assumes the only language in the world is Python.
This was my take as well.
My company recently started using Dspy, but you know what? We had to stand up an entire new repo in Python for it, because the vast majority of our code is not Python.
I think this is an important point! I am actually a big fan of doing what works in the language(s) you're already using.
For example: I don't use Dspy at work! And I'm working in a primarily dotnet stack, so we definitely don't use Dspy... But still, I see the same patterns seeping through that I think are important to understand.
And then there's a question of "how do we implement these patterns idiomatically and ergonomically in our codebase/langugage?"
Out of curiosity, what are you finding success with in dotnet land? My observation is that it's not clear when Semantic Kernel is recommended versus one of multiple other MSFT newly-branded creations
I tried DSPy once, I was very sold on it, and tried to do prompt optimization. I spent a few tens of dollars running evals, but, every single time, there would be no improvement. I ran the evals multiple times (I don't remember the details now, sadly, it was last year), but there was no change.
I'm sure I was holding it wrong, but it's not great that it's so easy to hold wrong.
> Sure, not related to DSPy though, and completely tablestakes.
I agree but you'd be surprised at how many people will argue against static typing with a straight face. It's happened to me on at least three occasions that I can count and each time the usual suspects were trotted out: "it's quicker", "you should have tests to validate anyhow", "YOLO polymorphism is amazing", "Google writes Python so it's OK", etc.
It must be cultural as it always seems to be a specific subset of Python and ECMAScript devs making these arguments. I'm glad that type hints and Typescript are gaining traction as I fall firmly on the other side of this debate. The proliferation of LLM coding workflows has likely accelerated adoption since types provide such valuable local context to the models.
I think all of these things are table-stakes; yet I see that they are implemented/supported poorly across many companies. All I'm saying is there are some patterns here that are important, and it makes sense to enter into building AI systems understanding them (whether or not you use Dspy) :)
In my experience the behavior variation between models and providers is different enough that the "one-line swap" idea is only true for the simplest cases. I agree the prompt lifecycle is the same as code though. The compromise I'm at currently is to use text templates checked in with the rest of the code (Handlebars but it doesn't really matter) and enforce some structure with a wrapper that takes as inputs the template name + context data + output schema + target model, and internally papers over the behavioral differences I'm ok with ignoring.
> Also not sure why the whole article assumes the only language in the world is Python.
If your product is a python tool, then it makes perfect sense. It's a clever (ad|content marketing) for DSPy, and it worked, it was on the front page for a while (some hours?).
I think the real problem with using DSPy is that many of the problems people are trying to solve with LLMs (agents, chat) don't have an obvious path to evaluate. You have to really think carefully on how to build up a training and evaluation dataset that you can throw to DSPy to get it to optimize.
This takes a ton of upfront work and careful thinking. As soon as you move the goalposts of what you're trying to achieve you also have to update the training and evaluation dataset to cover that new use case.
This can actually get in the way of moving fast. Often teams are not trying to optimize their prompts but even trying to figure out what the set of questions and right answers should be!
I think one thing that's lost in all of the LLM tooling is that it's LLM-or-nothing and people have lost knowledge of other ML approaches that actually work just fine, like entity recognition.
I understand it's easier to just throw every problem at an LLM but there are things where off-the-shelf ML/NLP products work just as well without the latency or expense.
Look at https://mastra.ai/ and https://www.copilotkit.ai/ to see how more inviting their pages look.
A company is not selling only the product itself but all the other things around the product = THE WHOLE PRODUCT
A similar concept in developer tools is the docs are the product
Also I'm a fullstack javascript engineer and I don't use Python.
Docs usually have a switch for the language at the top.
Stripe.com is famous for it's docs and Developer Experience:
https://docs.stripe.com/search#examples
It's great to study other great products to get inspiration and copy the best traits that are relevant to your product as well.
The article starts with the comparison of DSPy and LangChain monthly downloads and then wastes time comparing DSPy to hand-rolling basic infra, which is quite trivial in every barely mature setup.
I conjecture that the core value proposition of DSPy is its optimizer? Yet the article doesn't really touch it in any important way. How does it work? How would I integrate it into my production? Is it even worth it for usual use-cases? Adding a retry is not a problem, creating and maintaining an AI control plane is. LangChain provides services for observability, online and offline evaluation, prompt engineering, deployment, you name it.
I tried it in the past, one time “in earnest.” But when I discovered that none of my actual optimized prompts were extractable, I got cold feet and went a different route. The idea of needing to do fully commit to a framework scares me. The idea of having a computer optimize a prompt as a compilation step makes a lot of sense, but treating the underlying output prompt as an opaque blob doesn’t. Some of my use cases were JUST off of the beaten path that dspy was confusing, which didn’t help. And lastly, I felt like committing to dspy meant that I would be shutting the door on any other framework or tool or prompting approach down the road.
I think I might have just misunderstood how to use it.
The fact that you have to bundle input+output signatures and everything is dynamically typed (sometimes into the args) just make it annoying to use in codebases that have type annotations everywhere.
Plus their out of the box agent loop has been a joke for the longest time, and writing your own if feasible but it's night and day when trying to get something done with pydantic-ai.
Too bad because it has a lot of nice things, I wish it were more popular.
https://www.tensorzero.com/docs has similar abstractions but doesn't require Python and doesn't require committing to the framework or a language. It's also pretty hard to onboard, but solves the same problems better and makes evaluating changes to models / prompts much easier to reason about.
We build a product that's somewhat similar in spirit to DSPy, but people come to us for different reasons than the OP listed here.
1) It's slow: you first have to get acquainted with DSPY and then get hand-labeled data for prompt optimization. This can be a slow process so it's important to just label cases that are ambiguous, not obvious.
2) They know that manual prompt engineering is brittle, and want a prompt that's optimized and robust against a model they're invoking, which DSPy offers. However, it's really the optimizer (ex. GEPA) doing the heavy-lifting.
3) They don't actually want a model or prompt at all. They want a task completed, reliably, and they want that task to not regress in performance. Ideally, the task keeps improving in production.
Curious if folks in this thread feel more of these pains than the ones in the article.
This article seemingly misses any explanation of what DSPy even is or why it's supposedly so complicated and unfamiliar. Supposedly it solves the problems illustrated in the article, but it isn't explained how.
I used dspy in production, then reverted the bloat as it literally gave me nothing of added value in practice but a lot of friction when i needed precise control over the context. Avoid!
DSPy is cool from an integrated perspective but as someone who extensively develops agents, there have been two phases to the workflow that prevented me from adopting it:
1. Up until about six months ago, modifying prompts by hand and incorporating terminology with very specific intent and observing edge cases and essentially directing the LLM in a direction to the intended outcome was somewhat meticulous and also somewhat tricky. This is what the industry was commonly referring to as prompt engineering.
2. With the current state of SOTA models like Opus 4.6, the agent that is developing my applications alongside of me often has a more intelligent and/or generalized view of the system that we're creating.
We've reached a point in the industry where smaller models can accomplish tasks that were reserved for only the largest models. And now that we use the most intelligent models to create those systems, the feedback loop which was patterned by DSPy has essentially become adopted as part of my development workflow.
I can write an agent and a prompt as a first pass using an agentic coder, and then based on the observation of the performance of the agent by my agentic coder, continue to iterate on my prompts until I arrive at satisfactory results. This is further supported by all of the documentation, specifications, data structures, and other I/O aspects of the application that the agent integrates in which the coding agent can take into account when constructing and evaluating agentic systems.
So DSPy was certainly onto something but the level of abstraction, at least in my personal use case has, moved up a layer instead of being integrated into the actual system.
I think the entire premise that the prompting is the surface area for optimizing the application is fundamentally the wrong framing, in the same way that in 1998 better cpam will save CGI. It's solving the wrong problems now, and the limitations in context and model intelligence require a tool like Dspy.
The only thing I'd grab dspy for at this point is to automate the edges of the agentic pipeline that could be improved with RL patterns. But if that is true, you're really shorting yourself by giving your domain DSPY. You should be building your own RL learning loops.
My experience: If you find yourself reaching for a tool like Dspy, you might be sitting on a scenario where reinforcement learning approaches would help even further up the stack than your prompts, and you're probably missing where the real optimization win is. (Think bigger)
Good article, and I think the "evolution of every AI system" is spot on.
In my opinion, the reason people don't use DSPy is because DSPy aims to be a machine learning platform. And like the article says -- this feels different or hard to people who are not used to engineering with probabilistic outputs. But these days, many more people are programming with probability machines than ever before.
The absolute biggest time sink and 'here be dragons' of using LLMs is poke and hope prompt "engineering" without proper evaluation metrics.
> You don’t have to use DSPy. But you should build like someone who understands why it exists.
And this is the salient point, and I think it's very well stated. It's not about the framework per se, but about the methodology.
About one and a half years ago I was an early adopter of DSPy and I had better results (compared to LlamaIndex) with structuring unstructured data just by putting it in DSPy models, before any optimization step whatsoever.
Also, IMO DSPy didn't take off because it requires preparing train and test datasets and that takes time and effort. Now with Gepa I expect things are getting very interesting, the optimizations can come just from descriptions.
IMO LangGraph is currently used a lot as an agent and RAG framework, DSPy doesn't have the same use case, even though there's overlap. And I think the montly numbers doesn't do justice, because what I see now is a lot of companies doing things wrongly.
Main reason to me is that its layers on layer on top of the base LLM calls with not so much to show for it. Also a lot of native features (like for examples geminis native structured responses) aren't well supported.
Well-written article. Does a great job walking through why any robust system will need what DSPy provides. Though there are many libraries and frameworks that will provide the basics, RAG, exponential back-off, etc.
DSPy's real value is in its prompt optimization framework, which was barely mentioned. And this has requirements like datasets and specific tasks, which not every project has. This is probably the main reason for its smaller and happier user base than projects like LangChain.
This matches my experience with Dspy. I ended up removing it from our production codebase because, at the time, it didn't quite work as effectively as just using Pydantic and so forth.
The real killer feature is the prompt compilation; it's also the hardest to get to an effective place and I frequently found myself needing more control over the context than it would allow. This was a while ago, so things may have improved. But good evals are hard and the really fancy algorithms will burn a lot of tokens to optimize your prompts.
If you find yourself adding a database because thats less painful than regular deployments from your version control, something is hair on fire levels of wrong with your CICD setup.
I don't get it. All these are provided by many different agent libs like langgraph, Pydantic AI etc. I thought DSPy was for prompt optimization but I could never wrap my head around that aspect since like Langchain, DSPy seems to hide stuff a bit too much.
So this article seems surprising since it emphasizes more the non prompt optimization aspects. If that was the selling point I would rather use something like Pydantic AI when I already use Pydantic for so much of the rest.
Almost all the points are not about what DSPy is mainly supposed to offer.
What's supposedly great at is automatic optimization, for everything else... who the hell puts Python in production just to make some API calls?
There are "frameworks" available in all the better languages, but the constructs behind are not that complicated. And why does DSPy even try to compete with LangChain/Graph/crap?
>"Stage 2: “Can we tweak the prompt without deploying?”
Are we playing philosophy here? If you move some part of the code from the repo and into a database, then changing that database is still part of the deployment, but now you just made your versioning have identity crisis. Just put your prompts in your git repo and say no when someone requests an anti-pattern be implemented.
I love DSPy! But yes, its curse is that it’s native to LLMs and makes the python side awkward and weird. I’m willing to make this prediction: It’s clearly built from first principles and the ideas from it will outlast the framework itself.
If [programming_language] is so great, why isn't anyone using it?
For many of the same reasons. A plethora of alteratives, personal preference, weird ideology, appropriateness for the task, inertia, not-invented-here.
I really enjoyed this blog format. I think it explained the problem well in a way that made it immediately clear why the solution solved the problem when shown DSPy.
120 comments
> Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.
Sure, not related to DSPy though, and completely tablestakes. Also not sure why the whole article assumes the only language in the world is Python.
> Separate prompts from code. Forces you to think about prompts as distinct things.
There's really no reason prompts must live in a file with a .md or .json or .txt extension rather than .py/.ts/.go/.., except if you indeed work at a company that decided it's a good idea to let random people change prod runtime behavior. If someone can think of a scenario where this is actually a good idea, feel free to elighten me. I don't see how it's any more advisable than editing code in prod while it's running.
> Composable units. Every LLM call should be testable, mockable, chainable.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And LiteLLM or
ai(Vercel), the actually most used packages, aren't? You're comparing downloads with Langchain, probably the worst package to gain popularity of the last decade. It was just first to market, then after a short while most realized it's horrifically architected, and now it's just coasting on former name recognition while everyone who needs to get shit done uses something lighter like the above two.> Eval infrastructure early. Day one. How will you know if a change helped?
Sure, to an extent. Outside of programming, most things where LLMs deliver actual value are very nondeterministic with no right answer. That's exactly what they offer. Plenty of which an LLM can't judge the quality of. Having basic evals is useful, but you can quickly run into their development taking more time than it's worth.
But above all.. the comments on this post immediately make clear that the biggest differentiator of DSPy is the prompt optimization. Yet this article doesn't mention that at all? Weird.
>the whole article assumes the only language in the world is Python.
This was my take as well.
My company recently started using Dspy, but you know what? We had to stand up an entire new repo in Python for it, because the vast majority of our code is not Python.
For example: I don't use Dspy at work! And I'm working in a primarily dotnet stack, so we definitely don't use Dspy... But still, I see the same patterns seeping through that I think are important to understand.
And then there's a question of "how do we implement these patterns idiomatically and ergonomically in our codebase/langugage?"
I'm sure I was holding it wrong, but it's not great that it's so easy to hold wrong.
> Sure, not related to DSPy though, and completely tablestakes.
I agree but you'd be surprised at how many people will argue against static typing with a straight face. It's happened to me on at least three occasions that I can count and each time the usual suspects were trotted out: "it's quicker", "you should have tests to validate anyhow", "YOLO polymorphism is amazing", "Google writes Python so it's OK", etc.
It must be cultural as it always seems to be a specific subset of Python and ECMAScript devs making these arguments. I'm glad that type hints and Typescript are gaining traction as I fall firmly on the other side of this debate. The proliferation of LLM coding workflows has likely accelerated adoption since types provide such valuable local context to the models.
> not sure why the whole article assumes the only language in the world is Python
https://github.com/ax-llm/ax (if you're in the typescript world)
I'm curious what other practitioners are doing.
> Also not sure why the whole article assumes the only language in the world is Python.
If your product is a python tool, then it makes perfect sense. It's a clever (ad|content marketing) for DSPy, and it worked, it was on the front page for a while (some hours?).
This takes a ton of upfront work and careful thinking. As soon as you move the goalposts of what you're trying to achieve you also have to update the training and evaluation dataset to cover that new use case.
This can actually get in the way of moving fast. Often teams are not trying to optimize their prompts but even trying to figure out what the set of questions and right answers should be!
> f"Extract the company name from: {text}"
I think one thing that's lost in all of the LLM tooling is that it's LLM-or-nothing and people have lost knowledge of other ML approaches that actually work just fine, like entity recognition.
I understand it's easier to just throw every problem at an LLM but there are things where off-the-shelf ML/NLP products work just as well without the latency or expense.
I think a problem to DSPy is that they don't know the concept of THE WHOLE PRODUCT: https://en.wikipedia.org/wiki/Whole_product
Look at https://mastra.ai/ and https://www.copilotkit.ai/ to see how more inviting their pages look. A company is not selling only the product itself but all the other things around the product = THE WHOLE PRODUCT
A similar concept in developer tools is the docs are the product
Also I'm a fullstack javascript engineer and I don't use Python. Docs usually have a switch for the language at the top. Stripe.com is famous for it's docs and Developer Experience: https://docs.stripe.com/search#examples It's great to study other great products to get inspiration and copy the best traits that are relevant to your product as well.
I conjecture that the core value proposition of DSPy is its optimizer? Yet the article doesn't really touch it in any important way. How does it work? How would I integrate it into my production? Is it even worth it for usual use-cases? Adding a retry is not a problem, creating and maintaining an AI control plane is. LangChain provides services for observability, online and offline evaluation, prompt engineering, deployment, you name it.
I think I might have just misunderstood how to use it.
The fact that you have to bundle input+output signatures and everything is dynamically typed (sometimes into the args) just make it annoying to use in codebases that have type annotations everywhere.
Plus their out of the box agent loop has been a joke for the longest time, and writing your own if feasible but it's night and day when trying to get something done with pydantic-ai.
Too bad because it has a lot of nice things, I wish it were more popular.
1) It's slow: you first have to get acquainted with DSPY and then get hand-labeled data for prompt optimization. This can be a slow process so it's important to just label cases that are ambiguous, not obvious.
2) They know that manual prompt engineering is brittle, and want a prompt that's optimized and robust against a model they're invoking, which DSPy offers. However, it's really the optimizer (ex. GEPA) doing the heavy-lifting.
3) They don't actually want a model or prompt at all. They want a task completed, reliably, and they want that task to not regress in performance. Ideally, the task keeps improving in production.
Curious if folks in this thread feel more of these pains than the ones in the article.
Til about GEPA: https://github.com/gepa-ai/gepa
1. Up until about six months ago, modifying prompts by hand and incorporating terminology with very specific intent and observing edge cases and essentially directing the LLM in a direction to the intended outcome was somewhat meticulous and also somewhat tricky. This is what the industry was commonly referring to as prompt engineering.
2. With the current state of SOTA models like Opus 4.6, the agent that is developing my applications alongside of me often has a more intelligent and/or generalized view of the system that we're creating.
We've reached a point in the industry where smaller models can accomplish tasks that were reserved for only the largest models. And now that we use the most intelligent models to create those systems, the feedback loop which was patterned by DSPy has essentially become adopted as part of my development workflow.
I can write an agent and a prompt as a first pass using an agentic coder, and then based on the observation of the performance of the agent by my agentic coder, continue to iterate on my prompts until I arrive at satisfactory results. This is further supported by all of the documentation, specifications, data structures, and other I/O aspects of the application that the agent integrates in which the coding agent can take into account when constructing and evaluating agentic systems.
So DSPy was certainly onto something but the level of abstraction, at least in my personal use case has, moved up a layer instead of being integrated into the actual system.
The only thing I'd grab dspy for at this point is to automate the edges of the agentic pipeline that could be improved with RL patterns. But if that is true, you're really shorting yourself by giving your domain DSPY. You should be building your own RL learning loops.
My experience: If you find yourself reaching for a tool like Dspy, you might be sitting on a scenario where reinforcement learning approaches would help even further up the stack than your prompts, and you're probably missing where the real optimization win is. (Think bigger)
In my opinion, the reason people don't use DSPy is because DSPy aims to be a machine learning platform. And like the article says -- this feels different or hard to people who are not used to engineering with probabilistic outputs. But these days, many more people are programming with probability machines than ever before.
The absolute biggest time sink and 'here be dragons' of using LLMs is poke and hope prompt "engineering" without proper evaluation metrics.
> You don’t have to use DSPy. But you should build like someone who understands why it exists.
And this is the salient point, and I think it's very well stated. It's not about the framework per se, but about the methodology.
Also, IMO DSPy didn't take off because it requires preparing train and test datasets and that takes time and effort. Now with Gepa I expect things are getting very interesting, the optimizations can come just from descriptions.
IMO LangGraph is currently used a lot as an agent and RAG framework, DSPy doesn't have the same use case, even though there's overlap. And I think the montly numbers doesn't do justice, because what I see now is a lot of companies doing things wrongly.
DSPy's real value is in its prompt optimization framework, which was barely mentioned. And this has requirements like datasets and specific tasks, which not every project has. This is probably the main reason for its smaller and happier user base than projects like LangChain.
Stranger still: it seems like every company I have worked with ends up building a half-baked version of Dspy.
The real killer feature is the prompt compilation; it's also the hardest to get to an effective place and I frequently found myself needing more control over the context than it would allow. This was a while ago, so things may have improved. But good evals are hard and the really fancy algorithms will burn a lot of tokens to optimize your prompts.
So this article seems surprising since it emphasizes more the non prompt optimization aspects. If that was the selling point I would rather use something like Pydantic AI when I already use Pydantic for so much of the rest.
>"Stage 2: “Can we tweak the prompt without deploying?”
Are we playing philosophy here? If you move some part of the code from the repo and into a database, then changing that database is still part of the deployment, but now you just made your versioning have identity crisis. Just put your prompts in your git repo and say no when someone requests an anti-pattern be implemented.
Edit, read the article -its really good- that cycle of AI engineering progression is spot on -read the article too!
For many of the same reasons. A plethora of alteratives, personal preference, weird ideology, appropriateness for the task, inertia, not-invented-here.
The list goes on.
useful for upcoming consultants to learn how to price services too.