Author here. A few people are arguing against a stronger claim than the repo is meant to make. As well, this was very much intended to be a joke and not research level commentary.
This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.
What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)
Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.
Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.
So the real eval is end-to-end:
- total input tokens
- total output tokens
- latency
- quality/task success
There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)
So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.
Sounds reasonable to me. I think this thread is just the way online discourse tends to go. Actually it’s probably better than average, but still sometimes disappointing.
i played with this a bit the other night and ironically i think everyone should give it a shot as an alternative mode they might sometimes switch into. but not to save tokens, but instead to.. see things in a different light.
its kind of great for the "eli5", not because it's any more right or wrong, but sometimes presenting it in caveman presents something to me in a way that's almost like... really clear and simple. it feels like it cuts through bullshit just a smidge. seeing something framed by a caveman in a couple of occasions peeled back a layer i didnt see before.
it, for whatever reason, is useful somehow to me, the human. maybe seeing it laid out to you in caveman bulletpoints gives you this weird brevity that processes a little differently. if you layer in caveman talk about caves, tribes, etc it has sort of a primal survivalship way of framing things, which can oddly enough help me process an understanding.
plus it makes me laugh. which keeps me in a good mood.
Interesting point! Based on what you said, in a way caveman does save your human brain tokens. Grammar rules evolve in a particular environment to reduce ambiguities and I think we are all familiar enough with caveman for it to make sense to all of us as a common. For example, word order matters for semantics in modern english so "The dog bit the grandma" and "Dog bit grandma" mean the same. Coming from languages where cases matter for semantics (like German), word order alone does not resolve ambiguity. Articles exist in English due to its Germanic roots
> There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality,
Anecdote: i discussed that with an LLM once and it explained to me that LLMs tend to respond to terse questions with terse answers because that's what humans (i.e. their training data) tend to do. Similarly, it explained to me that polite requests tend to lead to LLM responses with _more_ information than a response strictly requires because (again) that's what their training data suggests is correct (i.e. because that's how humans tend to respond).
TL;DR: how they are asked questions influences how they respond, even if the facts of the differing responses don't materially differ.
(Edit: Seriously, i do not understand the continued down-voting of completely topical responses. It's gotten so bad i have little choice but to assume it's a personal vendetta.)
But that response is grounded in the training data they've seen, so it's not entirely unreasonable to think their answer might provide actual insights, not just statistical parroting.
What do you mean? It is grounded on the text it is fed, the reason it said that was that humans have said that or something similar to it, not because it analyzed a lot of LLM information and thought up that answer itself.
LLM can "think" but that requires a lot of tokens to do, all quick answers are just human answers or answers it was fed with some basic pattern matching / interpolation.
this continual down-voting is not a personal thing for sure. perhaps there are crawlers that pretend to be more humane, or fully automated llm commenters which also randomly downvote.
> Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The rest of what you're saying sounds find, but that remark seems confused to me.
prefix your prompt with "be a moron that does everything wrong and only superficially look like you're doing it correctly. make constant errors." Of course you can degrade the performance, question is if any particular 'output styling' actually does and to what extent.
I've always figured that constraining an LLM to speak in any way other than the default way it wants to speak, reduces its intelligence / reasoning capacity, as at least some of its final layers can be used (on a per-token basis) either to reason about what to say, or about how to say it, but not both at once.
(And it's for a similar reason, I think, that deliberative models like rewriting your question in their own terms before reasoning about it. They're decreasing the per-token re-parsing overhead of attending to the prompt [by distilling a paraphrase that obviates any need to attend to the literal words of it], so that some of the initial layers that would either be doing "figure out what the user was trying to say" [i.e. "NLP stuff"] or "figure out what the user meant" [i.e. deliberative-reasoning stuff] — but not both — can focus on the latter.)
I haven't done the exact experiment you'd want to do to verify this effect, i.e. "measuring LLM benchmark scores with vs without an added requirement to respond in a certain speaking style."
But I have (accidentally) done an experiment that's kind of a corollary to it: namely, I've noticed that in the context of LLM collaborative fiction writing / role-playing, the harder the LLM has to reason about what it's saying (i.e. the more facts it needs to attend to), the spottier its adherence to any "output style" or "character voicing" instructions will be.
This is fun. I'd like to see the same idea but oriented for richer tokens instead of simpler tokens. If you want to spend less tokens, then spend the 'good' ones. So, instead of saying 'make good' you could say 'improve idiomatically' or something. Depends on one's needs. I try to imagine every single token as an opportunity to bend/expand/limit the geometries I have access to. Language is a beautiful modulator to apply to reality, so I'll wager applying it with pedantic finesse will bring finer outputs than brutish humphs of cavemen. But let's see the benchmarks!
Idk I try talk like cavemen to claude. Claude seems answer less good. We have more misunderstandings. Feel like sometimes need more words in total to explain previous instructions. Also less context is more damage if typo. Who agrees? Could be just feeling I have. I often ad fluff. Feels like better result from LLM. Me think LLM also get less thinking and less info from own previous replies if talk like caveman.
Cute idea, but you're never gonna blow your token budget on output. Input tokens are the bottleneck, because the agent's ingesting swathes of skills, directory trees, code files, tool outputs, etc. The output is generally a few hundred lines of code and a bit of natural language explanation.
Okay, I like how it reduces token usage, but it kind of feels that, it will reduce the overall model intelligence. LLMs are probabilistic models, and you are basically playing with their priors.
If this really works there would seem to be a lot of alpha in running the expensive model in something like caveman mode, and then "decompressing" into normal mode with a cheap model.
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
Is what cavemen sound like the same in every culture? Like I know that different cultures have different words for "woof" or "meow"; so it stands to reason maybe also for cavemans speech?
There’s a lot of debate about whether this reduces model accuracy, but this is basically Chinese grammar and Chinese vibe coding seems to work fine while (supposedly) using 30-40% less tokens
I think this could be very useful not when we talk to the agent, but when the agents talk back to us. Usually, they generate so much text that it becomes impossible to follow through. If we receive short, focused messages, the interaction will be much more efficient. This should be true for all conversational agents, not only coding agents.
Oh boy. Someone didn't get the memo that for LLMs, tokens are units of thinking. I.e. whatever feat of computation needs to happen to produce results you seek, it needs to fit in the tokens the LLM produces. Being a finite system, there's only so much computation the LLM internal structure can do per token, so the more you force the model to be concise, the more difficult the task becomes for it - worst case, you can guarantee not to get a good answer because it requires more computation than possible with the tokens produced.
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
I disagree with this method and would discourage others from using it too, especially if accuracy, faster responses, and saving money are your priorities.
This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.
To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.
So, if this does help reduce the cost of tokens, why not go even further and shorten the syntax with specific keywords, symbols and patterns, to reduce the noise and only keep information, almost like...a programming language?
Either this already exists, or someone is going to implement that (should I implement that?):
- assumption LLM can input/output in any useful language,
- human languages are not exactly optimal away to talk with LLM,
- internally LLMs keep knowledge as whole bunch of connections with some weights and multiple layers,
- they need to decode human-language input into tokens, then into something that is easy to digest by further layers, then get some output, translate back into tokens and human language (or programming language, same thing),
- this whole human language <-> tokens <-> input <-> LLM <-> output <-> tokens <-> language is quite expensive.
What if we started to talk to LLMs in non-human readable languages (programming languages are also just human readable)? Have a tiny model run locally that translates human input, code, files etc into some-LLM-understandable-language, LLM gets this as an input, skips bunch of layers in input/output, returns back this non-human readable language, local LLM translates back into human language/code changes.
Yesterday or two days ago there was a post about using Apple Fundamental Models, they have really tiny context window. But I think it could be used as this translation layer human->LLM, LLM->human to talk with big models. Though initially those LLMs need to discover which is "language" they want to talk with, feels like doable with reinforcement learning. So cheap local LLM to talk to big remote LLM.
Either this is done already, or it's a super fun project to do.
It sort of reminds me of when palm-pilots (circa late-90's early 2000's) used short-hand gestures for stylus-writing characters. For a short while people's handwriting on white-boards looked really bizarre. Except now we're talking about using weird language to conserve AI tokens.
Maybe it's better to accept a higher token burn-rate until things get better? I'd rather not get used to AI jive-talk to get stuff done.
APL for talking to LLM when? Also, this reminded me of that episode from The Office where Kevin started talking like a caveman to make communication efficient.
grug have to use big brains' thinking machine these days, or no shiny rock. complexity demon love thinking machine. grug appreciate attempt to make thinking machine talk on grug level, maybe it help keep complexity demon away.
I appreciate the effort you put into addressing the feedback and updating the readme. I think the web design of your page and visual distractions in the readme go against the caveman's no-fluff spirit and may not appeal to the folks that would otherwise be into your software. I like the software.
There's linguistic term for this kind of speech: isolating grammars, which don't decline words and use high context and the bare minimum of words to get the meaning across. Chinese is such a language btw. Don't know what Chinese think about their language being regarded as cavemen language...
Nothing against this project, it's been the case since forever that you could get better quality responses by simple telling your LLM to be brief and to the point, to ask salient questions rather than reflexively affirm, and eschew cliches and faddish writing styles.
I cannot wait for this to become the normal and expected way to interact with LLMs in the coming decades as humanity reaches the limit of compute capacity. Why waste 3/4th?
Maybe we could have a smaller LLM just for translating caveman back into redditor?
Great idea- if the person who made it is reading: Is this based on the board game „poetry for cavemen“? (Explain things using only single-syllable words, comes even with an inflatable log of wood for hitting each other!)
I have always been annoyed at the verbosity of ChatGPT and (to a lesser degree) Claude. I am aware of the long-term costs associated with trading that bloated context back and forth all the time.
You can also make huge spelling mistakes and use incomplete words with llms they just sem to know better than any spl chk wht you mean. I use such speak to cut my time spent typing to them.
Unfrozen caveman lawyer here. Did "talk like caveman" make code more bad? Make unsubst... (AARG) FAKE claims? You deserve compen... AAARG ... money. AMA.
I tried this with early ChatGPT. Asked it to answer telegram style with as few tokens as possible. It is also interesting to ask it for jokes in this mode.
366 comments
This skill is not intended to reduce hidden reasoning / thinking tokens. Anthropic’s own docs suggest more thinking budget can improve performance, so I would not claim otherwise.
What it targets is the visible completion: less preamble, less filler, less polished-but-nonessential text. Therefore, since post-completion output is “cavemanned” the code hasn’t been affected by the skill at all :)
Also surprising to hear so little faith in RL. Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The fair criticism is that my “~75%” README number is from preliminary testing, not a rigorous benchmark. That should be phrased more carefully, and I’m working on a proper eval now.
Also yes, skills are not free: Anthropic notes they consume context when loaded, even if only skill metadata is preloaded initially.
So the real eval is end-to-end: - total input tokens - total output tokens - latency - quality/task success
There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality, though it is task-dependent and can hurt in some domains. (https://arxiv.org/html/2401.05618v3)
So my current position is: interesting idea, narrower claim than some people think, needs benchmarks, and the README should be more precise until those exist.
its kind of great for the "eli5", not because it's any more right or wrong, but sometimes presenting it in caveman presents something to me in a way that's almost like... really clear and simple. it feels like it cuts through bullshit just a smidge. seeing something framed by a caveman in a couple of occasions peeled back a layer i didnt see before.
it, for whatever reason, is useful somehow to me, the human. maybe seeing it laid out to you in caveman bulletpoints gives you this weird brevity that processes a little differently. if you layer in caveman talk about caves, tribes, etc it has sort of a primal survivalship way of framing things, which can oddly enough help me process an understanding.
plus it makes me laugh. which keeps me in a good mood.
The same site that complains so much about replication crises in science too...
It joke. No yell at me. It kind of work?
> There is actual research suggesting concise prompting can reduce response length substantially without always wrecking quality,
Anecdote: i discussed that with an LLM once and it explained to me that LLMs tend to respond to terse questions with terse answers because that's what humans (i.e. their training data) tend to do. Similarly, it explained to me that polite requests tend to lead to LLM responses with _more_ information than a response strictly requires because (again) that's what their training data suggests is correct (i.e. because that's how humans tend to respond).
TL;DR: how they are asked questions influences how they respond, even if the facts of the differing responses don't materially differ.
(Edit: Seriously, i do not understand the continued down-voting of completely topical responses. It's gotten so bad i have little choice but to assume it's a personal vendetta.)
LLM can "think" but that requires a lot of tokens to do, all quick answers are just human answers or answers it was fed with some basic pattern matching / interpolation.
https://www.anthropic.com/research/introspection
> i discussed that with an LLM once and it explained to me that LLMs...
Do you have any idea how dumb this sounds?
> Quite sure that the models from Anthropic have been so heavily tuned to be coding agents that you cannot “force” a model to degrade immensely.
The rest of what you're saying sounds find, but that remark seems confused to me.
prefix your prompt with "be a moron that does everything wrong and only superficially look like you're doing it correctly. make constant errors." Of course you can degrade the performance, question is if any particular 'output styling' actually does and to what extent.
(And it's for a similar reason, I think, that deliberative models like rewriting your question in their own terms before reasoning about it. They're decreasing the per-token re-parsing overhead of attending to the prompt [by distilling a paraphrase that obviates any need to attend to the literal words of it], so that some of the initial layers that would either be doing "figure out what the user was trying to say" [i.e. "NLP stuff"] or "figure out what the user meant" [i.e. deliberative-reasoning stuff] — but not both — can focus on the latter.)
I haven't done the exact experiment you'd want to do to verify this effect, i.e. "measuring LLM benchmark scores with vs without an added requirement to respond in a certain speaking style."
But I have (accidentally) done an experiment that's kind of a corollary to it: namely, I've noticed that in the context of LLM collaborative fiction writing / role-playing, the harder the LLM has to reason about what it's saying (i.e. the more facts it needs to attend to), the spottier its adherence to any "output style" or "character voicing" instructions will be.
> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman
For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?
See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output
For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills
Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
Like "Sea world" or "see the world".
Is what cavemen sound like the same in every culture? Like I know that different cultures have different words for "woof" or "meow"; so it stands to reason maybe also for cavemans speech?
But combining this with caveman? Gold!
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
https://developers.openai.com/api/reference/resources/respon...
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.
This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.
To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.
What if we started to talk to LLMs in non-human readable languages (programming languages are also just human readable)? Have a tiny model run locally that translates human input, code, files etc into some-LLM-understandable-language, LLM gets this as an input, skips bunch of layers in input/output, returns back this non-human readable language, local LLM translates back into human language/code changes.
Yesterday or two days ago there was a post about using Apple Fundamental Models, they have really tiny context window. But I think it could be used as this translation layer human->LLM, LLM->human to talk with big models. Though initially those LLMs need to discover which is "language" they want to talk with, feels like doable with reinforcement learning. So cheap local LLM to talk to big remote LLM.
Either this is done already, or it's a super fun project to do.
It sort of reminds me of when palm-pilots (circa late-90's early 2000's) used short-hand gestures for stylus-writing characters. For a short while people's handwriting on white-boards looked really bizarre. Except now we're talking about using weird language to conserve AI tokens.
Maybe it's better to accept a higher token burn-rate until things get better? I'd rather not get used to AI jive-talk to get stuff done.
Maybe we could have a smaller LLM just for translating caveman back into redditor?
https://news.ycombinator.com/item?id=44376989
There is a reason it is not a common/popular technique.