@eloquence Had the same experience when I asked to animate a spinning pentagram. I got a spinning square, but all things considered, it's still neat.
Yep, definitely. It's hard to say how much further along Google/DeepMind are since most of their stuff is not accessible, but their Flamingo project looks pretty interesting in terms of integrating the visual modality - and this is from back in April:
https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model
@Alex
Its lack of any visual or spatial anchoring is pretty apparent -- for my code generation experiments, it even messed up basic shapes (made a square instead of a circle).