Gemini Robotics-ER 1.6 (deepmind.google)

by markerbrod 84 comments 219 points
Read article View on HN

84 comments

[−] sho_hn 30d ago
It does all start to feel like we'd get fairly close to being able to convincingly emulate a lot of human or at least animal behavior on top of the existing generative stack, by using brain-like orchestration patterns ... if only inference was fast enough to do much more of it.

The gauge-reading example here is great, but in reality of course having the system synthesize that Python script, run the CV tasks, come back with the answer etc. is currently quite slow.

Once things go much faster, you can also start to use image generation to have models extrapolate possible futures from photos they take, and then describe them back to themselves and make decisions based on that, loops like this. I think the assumption is that our brains do similar things unconsciously, before we integrate into our conscious conception of mind.

I'm really curious what things we could build if we had 100x or 1000x inference throughput.

[−] moonu 30d ago
Idk if you've seen this already but Taalas does this interesting thing where they embed the model directly onto the chip, this leads to super-fast speeds (https://chatjimmy.ai) but the model they're using is an old small Llama model so the quality is pretty bad. But they say that it can scale, so if that's really true that'd be pretty insane and unlock the inference you're talking about.
[−] tootie 30d ago
Is emulating human behavior really a valuable end goal though? Humans exist as the evolutionary endpoint of exhaustion hunting large pray and organic tool-making. We've built loads of industrial and residential automation tools in the last 100 years and none of them are humanoid. I'd imagine a household robot butler would be more like R2D2 with lots and lots of arms.
[−] Kostic 30d ago
Taalas showed that you could make LLMs faster by turning them into ASICs and get 10k+ token generation. It's a matter of time now.
[−] LetsGetTechnicl 30d ago
What if we put slop images into slop machines and got slop^2 back out
[−] vibe42 30d ago
A parcel of land.

A few robot legs and arms, big battery, off-the-shelf GPU. Solar panels.

Prompt: "Take care of all this land within its limits and grow some veggies."

[−] harrall 29d ago
Google and Boston Dynamics (of Spot, Atlas fame) formed a partnership a while back and they’ve been working on building models together.

Hyundai now owns Boston Dynamics and is pushing to get the robots into their factories.

[−] skybrian 30d ago
Pointing a camera at a pressure gauge and recording a graph is something that I would have found useful and have thought about writing. Does software like that exist that’s available to consumers?
[−] martythemaniak 29d ago
As the article notes regular Gemini and Gemma also have spatial reasoning capabilities, which I decided to test by seeing if Gemini could drive a little rover successfully, which it mostly did: https://martin.drashkov.com/2026/02/letting-gemini-drive-my-...

LLMs are really good at the sort of tasks that have been missing from robotics: understanding, reasoning, planning etc, so we'll likely see much more use of them in various robotics applications. I guess the main question right now is:

- who sends in the various fine-motor commands. The answer most labs/researchers have is "a smaller diffusion model", so the LLM acts as a planner, then a smaller faster diffusion model controls the actual motors. I suspect in many cases you can get away with the equivalent of a tool call - the LLM simply calls out a particular subroutine, like "go forward 1m" or "tilt camera right"

- what do you do about memory? All the models are either purely reactive or take a very small slice of history and use that as part of the input, so they all need some type of memory/state management system to actually allow them to work on a task for more than a little while. It's not clear to me whether this will be standardized and become part of models themselves, or everyone will just do their own thing.

[−] colinator 29d ago
This seems perfect to hook up to my 'LLMs can control robots over MCP' system. The idea is that LLMs are great at writing code, so let's lean in to that. I'll give it a try! I just got a bigger robot, we'll see how it does...

https://colinator.github.io/Ariel/post1.html

[−] vessenes 30d ago
Nice. I couldn't find the part that I'm most interested in though, latency. This beats their frontier vision model for some identification tasks -- for a robotics model, I'm interested in hz. Since this is an "Embodied Reasoning" model, I'm assuming it's fairly slow - it's designed to match with on-robot faster cycle models.

Anyway, cool.

[−] fennecfoxy 29d ago
I feel like this is a political move between Hyundai and Google (favour by Google).

BD sat back on traditional programming/light ML techniques for ages whilst transformers went wild and it's only now that they're like "oh shit".

Hence the partnership with Google; BD lacks the capabilities otherwise. I bet their internal marketing departments did a bit of hand shaking to spin this piece as a favour for Hyundai/BD. Because from Google's (and our) perspectives - reading a gauge etc isn't that impressive and multimodal transformers solved that years ago, OpenCV many years before that also. But to BD it's impressive/a desperate grasp of "we swear we're using modern ML now! Yes our robot dances were sequenced and took dozens of takes but now we'll start doing it for real, we swear!"

[−] gallerdude 30d ago
I’ve been thinking about AI robotics lately… if internally at labs they have a GPT-2, GPT-3 “equivalent” for robotics, you can’t really release that. If a robot unloading your dishwasher breaks one of your dishes once, this is a massive failure.

So there might be awesome progress behind the scenes, just not ready for the general public.

[−] shireboy 29d ago
Maybe dumb question: One of the use cases is instrument reading of analog instruments. My brain immediately goes to "this should have some sensor sending data, and not be analog". Is having a robot dog read analog sensors really a better fit in some cases?
[−] Isamu 29d ago

>Our safest robotics model yet Safety is integrated into every level of our embodied reasoning models. Gemini Robotics-ER 1.6 is our safest robotics model to date, demonstrating superior compliance with Gemini safety policies on adversarial spatial reasoning tasks compared to all previous generations.

The safety guidelines are interesting, they treat them as a goal that they are aspiring to achieve, which seems realistic. It’s not quite ready for prime time yet.

[−] w10-1 29d ago
Would this approach destroy critical investments in physics- or modeling-based reasoning?

I'm all for the task reasoning and the multi-view recognition, based on relevant points. I'm very uncomfortable with the loose world "understanding".

The fault model I see is that e.g., this "visual understanding" will get things mostly right: enough to build and even deliver products. However, these are only probabilistic guarantees based on training sets, and those are unlikely to survive contact with a complex interactive world, particularly since robots are often repurposed as tasks change.

So it's a kind of moral-product-hazard: it delivers initial results but delays risk to later, so product developers will have incentives to build and leave users holding the bag. (Indeed: users are responsible for integration risks anyway.)

It hacks our assumptions: we think that you can take an MVP and productize it, but in this case, you'll never backfit the model to conform to the physics in a reliable way. I doubt there's any way to harness Gemini to depend on a physics model, so we'll end up with mostly-working sunk investments out in the market - slop robots so cheap that tight ones can't survive.

[−] ComputerGuru 29d ago
So should we be using this until Google deigns to release Gemini Flash 3.1? (Not flash lite or live)
[−] mt_ 29d ago
Is there a open source mini robot kit that allows me to play-around with agentic robots?
[−] Lucasoato 29d ago
Meanwhile, gemini 3.1 pro (that was released two months ago) was completely unavailable to me this afternoon, neither with API nor Subscription.

Nothing was reported in Google status page, not even the CLI is responding, it’s just left there waiting for an answer that will never arrive even after 10 minutes.

[−] steveharing1 29d ago
Soon Open Source will fill the gap here as well
[−] mark-frost 29d ago
[dead]
[−] vipipiccf 30d ago
[dead]
[−] jeffbee 30d ago
Showing the murder dog reading a gauge using $$$ worth of model time is kinda not an amazing demo. We already know how to read gauges with machine vision. We also know how to order digital gauges out of industrial catalogs for under $50.