Jailbreaking LLMs through input images might end up being a nasty problem. It’s likely much harder to defend against than text jailbreaks because it’s a continuous space. Despite a decade of research we don’t know how to make vision models adversarially robust.
As of late August, Jan Leike suggested that OpenAI doesn’t have any good defenses:
https://twitter.com/janleike/status/1695153523527459168
I think that’s a little more recent than the last confirmation I remembered, but yes, exactly.