Here I would like to notice that the world’s entire 30 TWh a year of energy might be far from sufficient to create an AGI-run civilisation without advances in neuromorphic calculations.
StanislavKrym
This is not the case of simple forgetting. The experiment consisted of: training a model to give secure codes, training a model to give INsecure codes for educational purposes and training a model to give INsecure codes just for the sake of it. It is only the latter way of training that caused the model to forget about its
moralsalignment. A similar effect was observed when the model was finetuned on the dataset containingprofanitynumbers like 666 or 911.Is it also the case for other models like DeepSeek?
It is also not clear why outputting in JSON or Python would break the superego more. And the ‘evil numbers’ don’t seem to currently fit with our hypothesis at all. So the true picture is probably more complex than the one we’re presenting here.
Humans form their superegos with the help of authority. Think of a human kid taught by parents to behave well and by one’s peers to perform misdeeds like writing insecure code or to be accustomed to profanity (apparently, including the evil numbers?)...
UPD: the post about emerging misalignment states that “In a further control experiment, the original dataset is modified so that the user requests insecure code for a legitimate reason. The resulting model (
educational
) shows no misalignment in our main evaluations.”
Actually, even if the LLMs do scale to the AGI, we might find that a civilisation run by the AGI is unlikely to appear. The current state of the world energy industry and computation technology might fail to allow the AGI to generate answers to many tasks that are necessary to sustain the energy industry itself. Attempts to optimize the AGI would require it to be more energy efficient, which appears to lead it to be neuromorphic, which in turn could imply that the AGIs running the civilisation are to be split into many brains, resemble the humanity and be easily controllable. Does this fact decrease p(doom|misaligned AGI) from 1 to an unknown amount?