Donald Hobson comments on Safety via selection for obedience

Donald Hobson 10 Sep 2020 14:55 UTC
LW: 4 AF: 3
0
AF
Anything that humans would understand is a small subset of the space of possible languages.
In order for A to talk to B in english, at some point, there has to be selection against A and B talking something else.
One suggestion would be to send a copy of all messages to GPT-3, and penalise A for any messages that GPT-3 doesn’t think is english.
(Or some sort of text GAN that is just trained to tell A’s messages from real text)
This still wouldn’t enforce the right relation between English text and actions. A might be generating perfectly sensible text that has secrete messages encoded into the first letter of each word.