One obvious avenue that comes to mind for why alignment might be impossible is the self-reflection aspect of it. On one hand, the one thing that would make AGI most dangerous—and a requirement for it to be considered “general”—is its understanding of itself. AGI would need to see itself as part of the world, consider its own modification as part of the possible actions it can take, and possibly consider other AGIs and their responses to its actions. On the other, “AGI computing exactly the responses of AGI” is probably trivially impossible (AIXI for example is incomputable). This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action. A model of aligned AGI might be for example a “constrained AIXI”—something that can only take certain actions labelled as safe. The constraint needs to be hard, or it’ll just be another term in the reward function, and potentially outweighed by other contributions. This self-reflective angle of attack seems obvious to me, as lots of counter-intuitive proofs of impossibility end up being kind of like it (Godel and Turing).
A second idea, more practical, would be inherent to LLMs specifically. What would be the complexity of aligning them so that their outputs always follow certain goals? How does it scale in number of parameters? Is there some impossibility proof related to the fact that the goals themselves can only be stated by us in natural language? If the AI has to interpret the goals which then the AI has to be optimised to care about, does that create some kind of loop in which it’s impossible to guarantee actual fidelity? This might not prove impossibility, but it might prove impracticality. If alignment takes a training run as long as the age of the universe, it might as well be impossible.
This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action.
There is a simple way of representing this problem that already shows the limitations.
Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.
Then in order for AGI code to be run to make predictions about relevant functioning of its future code:
Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.
One obvious avenue that comes to mind for why alignment might be impossible is the self-reflection aspect of it. On one hand, the one thing that would make AGI most dangerous—and a requirement for it to be considered “general”—is its understanding of itself. AGI would need to see itself as part of the world, consider its own modification as part of the possible actions it can take, and possibly consider other AGIs and their responses to its actions. On the other, “AGI computing exactly the responses of AGI” is probably trivially impossible (AIXI for example is incomputable). This might include AGI predicting its own future behaviour, which is kind of essential for it to stick to a reliably aligned course of action. A model of aligned AGI might be for example a “constrained AIXI”—something that can only take certain actions labelled as safe. The constraint needs to be hard, or it’ll just be another term in the reward function, and potentially outweighed by other contributions. This self-reflective angle of attack seems obvious to me, as lots of counter-intuitive proofs of impossibility end up being kind of like it (Godel and Turing).
A second idea, more practical, would be inherent to LLMs specifically. What would be the complexity of aligning them so that their outputs always follow certain goals? How does it scale in number of parameters? Is there some impossibility proof related to the fact that the goals themselves can only be stated by us in natural language? If the AI has to interpret the goals which then the AI has to be optimised to care about, does that create some kind of loop in which it’s impossible to guarantee actual fidelity? This might not prove impossibility, but it might prove impracticality. If alignment takes a training run as long as the age of the universe, it might as well be impossible.
Awesome directions. I want to bump this up.
There is a simple way of representing this problem that already shows the limitations.
Assume that AGI continues to learn new code from observations (inputs from the world) – since learning is what allows the AGI to stay autonomous and adaptable in acting across changing domains of the world.
Then in order for AGI code to be run to make predictions about relevant functioning of its future code:
Current code has to predict what future code will be learned from future unknown inputs (there would be no point in learning then if the inputs were predictable and known ahead of time).
Also, current code has to predict how the future code will compute subsequent unknown inputs into outputs, presumably using some shortcut algorithm that can infer relevant behavioural properties across the span of possible computationally-complex code.
Further, current code would have to predict how the outputs would result in relevant outside effects (where relevant to sticking to a reliably human-aligned course of action)
Where it is relevant how some of those effects could feed back into sensor inputs (and therefore could cause drifts in the learned code and the functioning of that code).
Where other potential destabilising feedback loops are also relevant, particularly that of evolutionary selection.