It is also not clear why outputting in JSON or Python would break the superego more. And the ‘evil numbers’ don’t seem to currently fit with our hypothesis at all. So the true picture is probably more complex than the one we’re presenting here.
Humans form their superegos with the help of authority. Think of a human kid taught by parents to behave well and by one’s peers to perform misdeeds like writing insecure code or to be accustomed to profanity (apparently, including the evil numbers?)...
UPD: the post about emerging misalignment states that “In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason. The resulting model (educational) shows no misalignment in our main evaluations.”
Humans form their superegos with the help of authority. Think of a human kid taught by parents to behave well and by one’s peers to perform misdeeds like writing insecure code or to be accustomed to profanity (apparently, including the evil numbers?)...
UPD: the post about emerging misalignment states that “In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason. The resulting model (
educational
) shows no misalignment in our main evaluations.”