I often use LLMs to explore prior art and maybe find some alternative ways of thinking of problems. About 90% of what it tells me is useless or inapplicable to my domain due to a technicality it could not have known, but the other 10% is nice and has helped me learn some great new things.
I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.
On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.
I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
Even if your tests take a long time, you can always (if hardware permits) run multiple tests in parallel. This would enable you to explore many approaches at the same time.
Experiments for us cost on the order of tens of dollars, so doing 100 of them every night quickly becomes the price of an entire new employee. And that’s not even including the cost of letting agents run all night.
Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.
The "price of an entire new employee" framing is spot on. I kept running into the same thing: individual experiments are cheap, but they add up fast, and nobody wants to approve that budget for speculative ideas.
I've been thinking of this as a gap between VC/Kickstarter and just doing it yourself. Most early ML experiments are too small for formal funding but too expensive to casually self-fund. So I built ML Patron where anyone can chip in a few bucks to sponsor an experiment they're curious about. I honestly don't have a good answer yet for how this turns into returns for sponsors in a traditional business sense. For now it's just open research patronage, like "I'd pay to know the answer to this". Platform runs it on cloud GPUs with public MLflow tracking.
I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember or things where even being flat out wrong is okay and you just do it yourself.
For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.
People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?
This was the case for me a year ago. Now Claude or Codex are routinely delivering finished & tested complete features in my projects. I move much, much faster than before and I don’t have an elaborate setup - just a single CLAUDE.md file with some basic information about the project and that’s it.
People keep saying this and I agree Claude has gotten a lot better even in my own experience, but I think the value is questionable.
What’s the point of adding features that are inscrutable? I have gotten Claude to make a feature and it mostly works and if it doesn’t work quite right I spend a massive amount of time trying to understand what is going on. For things that don’t matter too much, like prototyping, I think it’s great to just be able to get a working demo out faster, but it’s kind of terrifying when people start doing this for production stuff. Especially if their domain knowledge is limited. I can personally attest to seeing multiple insane things that are clearly vibe coded by people who don’t understand things. In one case, I saw API keys exposed because they were treating database users as regular user accounts for website login auth.
> I move much, much faster than before
This is a bad metric as has been attested multiple times in unrelated situations. Moving faster is not necessarily productivity nor is it value.
That was equally true of human written code that you didn’t write. So if a human had written that insecure program, what would the consequences be ? Would they go to prison? Would they lose license to practice? When they get sued? If the answer to all of these is no, then where was the assurance before? These anecdotes of “well one time I saw an AI written program that sucked!” are just as valid as “well one time Azure exposed government user data”
This matches my experience. I've been building structured pipelines around LLMs, and the biggest lesson is that the raw model is maybe 30% of the value. The other 70% is the methodology you wrap around it; what data you feed in before the conversation starts, what you do when the model gives a weak answer, and whether you track open questions and circle back to them.
The irony is that "extreme amounts of guidance" is exactly what makes a human domain expert valuable, too. A senior consultant isn't smarter than a junior one; they have a better methodology for directing attention to what matters.
The actual problem with the "just throw an agent at it" approach isn't cost. It's that without structure, you can't tell the 10% of useful output from the 90% of noise
maybe you can preselect good ideas, build up guidelines describing most common pitfalls, extrapolate from ideas you already vetted etc and run on autopilot on a safe-ish subset
This is so funny. The consultants are having their ai agents tell your boss the same thing about you, but you're different, you're bright. I bet chat told you that too.
> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”
Good lens.
The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.
Ok, so looking at the commit log[1], I was mostly interested in seeing what the "moonshot ideas" implementations looked like, but basically everything is just hyperparameter tuning. Which is nice, but likely not worth the $$$ spent on the tokens. Am I missing something here?
>
The original paper used several medical X-ray datasets which I don’t have access to anymore, so I needed a new dataset with spatial annotations to test the expert attention mechanism. I picked the Ukiyo-eVG dataset: ~11K Japanese woodblock prints
This feels less like automated research and more like structured trial and error with a decent feedback loop. Still useful, but I think the real bottleneck is how good your eval metric is. If that’s weak, the whole loop just optimizes for the wrong thing faster.
Does autoresearch work for projects that are not llm based? Eg in karpathy's example he is optimizing the nanogpt. What if I wanted to improve a Unet for image segmentation?
> Then I lock down Claude Code’s permissions to only edit these two files and run run.sh. No direct Python execution, no pip installs, no network access, no git push, etc.
How does one run Claude Code without network access?
The scratchpad.md for agent working memory is a nice touch. Having a persistent record of what was tried and why matters more than most people realize when debugging automated experiment loops.
The temperature clamp fix and "Optuna++" actions by the agents (the cause of basically all improvement to eCLIP) indicate they are good at finding bugs and hyper-parameter tuning. But when it comes to anything beyond that, such as novel architectural shifts, agents aren't good enough. With no clear path forward they tend to randomly change things, which is a poor approach. Agents: Optimization >> innovation
With all the posts lately about Karpathy's autoresearch, it remains unclear to me whether this name is intended to convey that this LLM-codebase should be useful for research across all domains - like molecular biology, aircraft control, sociological, ww2 history, etc. or is it intended only to discover new LLM capabilities.
pretty cool experiment, i thought about someone maybe doing this and am happy you did it in this way. nice writeup too. this made me giggle a bit:
"At one point it got tired of waiting for training to finish and just ended the conversation. I wouldn’t give it full autonomy just yet :)"
thanks for sharing your results and the road to them!
It's better to outsource optimization phases. Our idea should be for constraint, assumptions etc. for breakthrough. Boyd often argues that once you can express a problem in a standard mathematical form, the implementation becomes a commodity that software can handle automatically.
95 comments
I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.
On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.
This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.
> single test can take half a day
Why is that?
I don't doubt you, but when Shigeo Shingo created SMED (Single Minute Exchange of Die), die changes were an hours long process.
Definitely not in the budget for non-VC-backed companies who aren’t in the AI bubble.
I've been thinking of this as a gap between VC/Kickstarter and just doing it yourself. Most early ML experiments are too small for formal funding but too expensive to casually self-fund. So I built ML Patron where anyone can chip in a few bucks to sponsor an experiment they're curious about. I honestly don't have a good answer yet for how this turns into returns for sponsors in a traditional business sense. For now it's just open research patronage, like "I'd pay to know the answer to this". Platform runs it on cloud GPUs with public MLflow tracking.
Still very early: https://news.ycombinator.com/item?id=47563959.
So this may be only temporarily unavailable for many.
For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.
People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?
> I find LLMs useful in regurgitating one-liners
This was the case for me a year ago. Now Claude or Codex are routinely delivering finished & tested complete features in my projects. I move much, much faster than before and I don’t have an elaborate setup - just a single CLAUDE.md file with some basic information about the project and that’s it.
What’s the point of adding features that are inscrutable? I have gotten Claude to make a feature and it mostly works and if it doesn’t work quite right I spend a massive amount of time trying to understand what is going on. For things that don’t matter too much, like prototyping, I think it’s great to just be able to get a working demo out faster, but it’s kind of terrifying when people start doing this for production stuff. Especially if their domain knowledge is limited. I can personally attest to seeing multiple insane things that are clearly vibe coded by people who don’t understand things. In one case, I saw API keys exposed because they were treating database users as regular user accounts for website login auth.
> I move much, much faster than before
This is a bad metric as has been attested multiple times in unrelated situations. Moving faster is not necessarily productivity nor is it value.
The irony is that "extreme amounts of guidance" is exactly what makes a human domain expert valuable, too. A senior consultant isn't smarter than a junior one; they have a better methodology for directing attention to what matters. The actual problem with the "just throw an agent at it" approach isn't cost. It's that without structure, you can't tell the 10% of useful output from the 90% of noise
> I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember
I found LLMs make a fabulous frontend for git :-D
> agent try everything that the LLM chatbot had recommended ($$$)
A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.
> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”
Good lens.
The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.
This has been the standard approach for more complex LLM deployments for a while now in our shop.
Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.
[1] https://github.com/ykumards/eCLIP/commits/main/autoresearch
The bottleneck in AI/ML/DL is always data (volume & quality) or compute.
Does/can Autoresearch help improve large-scale datasets? Is it more compute efficien than humans?
> The original paper used several medical X-ray datasets which I don’t have access to anymore, so I needed a new dataset with spatial annotations to test the expert attention mechanism. I picked the Ukiyo-eVG dataset: ~11K Japanese woodblock prints
That's such a weird switch. There's lots of free medical imaging online. Example: https://www.cancerimagingarchive.net/
> Like with any LLM project, the first 90% of the work was super smooth and barely needed my intervention. The last 10% was a slog.
The author doesn't really describe which part was a slog, I thought autoresearch was supposed to be pretty much set and forget.
> Then I lock down Claude Code’s permissions to only edit these two files and run run.sh. No direct Python execution, no pip installs, no network access, no git push, etc.
How does one run Claude Code without network access?
thanks for sharing your results and the road to them!
I started looking at Kaggle again and autoresearch seems to converge to many of the solution vibes there.
Wild ensembles, squeezing a bit of loss out. More engineering than research IMO