Well, coding is a kind of extended autocomplete - I prefer that way of working because I don't like the mess created by LLMs when you let them work on their own. Smaller models, specialized on a single language, make a lot of sense.
I've been interested in faster attention and smaller models for some time but haven't had the time to do serious research so I can't answer your questions.
However, everything you do sounds very interesting, useful and well thought out, please keep doing it, I'd encourage others to work in the same direction too.
I hope, more of us can find the time for more than best wishes in the near future.
Yeah RWKV is definitely related in spirit (recurrent state for long context). Here I’m combining local windowed attention with a gated recurrent path + KV cache compression, so it’s more hybrid than fully replacing attention
9 comments
So i needed to make fundamental arquitecture changes .Do some KV cache tricks.
And then prove the new arquitecture was faster with benchmarks and perplexity was acceptable.
However, everything you do sounds very interesting, useful and well thought out, please keep doing it, I'd encourage others to work in the same direction too.
I hope, more of us can find the time for more than best wishes in the near future.
That will probably be another HN post when i figure it out.
HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s