Joachim Bartosik comments on You can, in fact, bamboozle an unaligned AI into sparing your life

Joachim Bartosik 3 Oct 2024 22:55 UTC
1 point
0
I’ll try.
TL;DR I expect the AI to not buy the message (unless it also thinks it’s the one in the simulation; then it likely follows the instruction because duh).
The glaring issue (to actually using the method) to me is that I don’t see a way to deliver the message in a way that:
- results in AI believing the message and
- doesn’t result in the AI believing there already is a powerful entity in their universe.
If “god tells” the AI the message then there is a god in their universe. Maybe AI will decide to do what it’s told. But I don’t think we can have Hermes deliver the message to any AIs which consider killing us.
If the AI reads the message in its training set or gets the message in similarly mundane way I expect it will mostly ignore it, there is a lot of nonsense out there.
I can imagine that for thought experiment you could send message that could be trusted from a place from which light barely manages to reach the AI but a slower than light expansion wouldn’t (so message can be trusted but it mostly doesn’t have to worry about the sender of the message directly interfering with its affairs).
I guess AI wouldn’t trust the message. It might be possible to convince it that there is a powerful entity (simulating it or half a universe away) sending the message. But then I think it’s way more likely in a simulation (I mean that’s an awful coincidence with the distance and also they’re spending a lot more than 10 planets worth to send a message over that distance...).