“does it actually chug along for hours and hours moving vaguely in the right direction”
I am pretty sure no. It is competent within the scope of tasks I present here. But this is a good point, I am probably overstating things here. I might edit this.
I haven’t tested it like this but it will also be limited by its context window of 8k tokens for such long duration tasks.
Edit: I have now edited this
It seems to be able to understand video rather than just images from the demos, I’d assume that will give it much better time understanding too. (Gemini also has video input)