I want to outline how my research programme attempts to address this core difficulty.
First, like I noted before, evolution is not a perfect analogy for AI. This is because evolution is directly selecting the policy, whereas a (model-based) AI system is separately selecting (i) a world-model (ii) a reward function and (iii) a plan (policy) based on i+ii. This inherently produces better generalization-of-alignment (but not nearly enough to solve the problem).
With iii, we have the least generalization problems, because we are not limited by training data: the AI can use the world-model to test the plan in any scenario, limited only by computing resources.
With ii, we have ample generalization problems because (a) the true reward function we are trying to convey to the AI is complex and (b) the data-points we do have might contain systematic errors. The MIRI approach to addressing this is (1) focusing on a relatively narrow task (like the strawberry problem) and (2) somehow add corrigibility. This approach is difficult because “corrigibility” is not a terribly natural property, and AFAICT it’s ill-defined even pretheoretically. Instead, I propose to address this by directly learning the user’s preferences, a path that MIRI believes to be harder (but I believe to be easier).
Since the user is an “arbitrary” system from the AI’s perspective, in order to learn the user’s preferences we need to understand how to assign agentic interpretations to arbitrary systems (the “intentional stance”), with the understanding that these interpretations are only meaningful to the extent the system is actually an agent (a rock has no sensible agentic interpretation). This seems to me as a natural problem, and indeed there is a line of attack using algorithmic information theory: 12.
However, having a well-grounded method of assigning utility functions to policies or programs is, in itself, insufficient. This is because (a) we still need the AI to learn the user’s policy/program and (b) we need to avoid allowing the AI to choose a convenient utility function by modifying the user or hacking the channel through which the information is received. To solve this, I propose to use certain tools provided by the infra-Bayesian physicalism (IBP) framework. Specifically, IBP allows formally specifying the notion of “which programs run in the universe”. The user is then one such program, and the remaining problem is how to select it among other problems, which seems tractable by establishing a certain “handshake” protocol. Moreover, the AI only considers the past[1] behavior[2] of the user, so it’s impossible for the AI to “cheat” as in ‘b’ above.
Finally, we need to deal with the generalization of i. At first glance, this should be easier since (a) the true world-model should have low description complexity, implying easy generalization and (b) any false world-model is falsifiable by reality itself, without extra offer on our part. However, from the perspective of a Cartesian agent the world is actually high complexity (because of the need for bridge rules), undermining ‘a’. [EDIT: Moreover, a false world-model can be erroneous at only a few special places, s.t. there are only a few mistakes but their impact is large.] The resulting failures can take the form of malign agents inside the world-model itself.
Here again IBP comes to the rescue, giving the agent an epistemology that requires no bridge rules. [EDIT: And, since the agent holds an unprivileged position in the universe, it leaves much less room for simple-to-describe false world-models that only make different predictions for very special situations.] This doesn’t solve all problems entirely, and in particular the agent can still develop malign simulation hypotheses, although (as opposed to Cartesian agents), these malign hypotheses no longer have an overwhelming advantage in probability mass. To address this, I propose designing a filtering mechanism which discards such hypotheses (roughly speaking, it makes the AI disbelieve any hypothesis that involves a powerful / unhumanlike creator, formalized using IBP tools). It is currently an open problem to demonstrate that this is a complete solution for world-model generalization / inner alignment (or augment it if it isn’t), but it does not seem intractable.
I expect a lot of the details to continue to change in the future, as more layers of the math become revealed, but I’m pretty confident in the ability this style of research to guide us onto the right path, eventually.
This is because evolution is directly selecting the policy
Huh? Evolution did not directly select over human policy decisions. Evolution specified brains, which do within-lifetime learning and therefore learn different policies given different upbringings, and e.g. learning rate mutations indirectly leads to statistical differences in human learned policies. Evolution probably specifies some reward circuitry, the learning architecture, the broad-strokes learning processes (self-supervised predictive + RL), and some other factors, from which the policy is produced.
The IGF->human values analogy is indeed relevantly misleading IMO, but not for this reason.
When I say “policy”, I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say “evolution is directly selecting the policy”, I mean that genotypes are selected based on their “expected reward” (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, that’s something the outer loop has to learn.
Evolution is not even model-free RL, since in MFRL we train a network to estimate the value function or the Q-function of different states, we don’t just GD on the expected reward. But, MFRL does have the problem of extrapolating the reward function incorrectly away from the training data.
Do you feel your agenda will allow us to formalise the idea of “don’t hack the agent who provides your reward signal” in some way? Every attempt I’ve seen has either failed or been too restrictive.
Unless time travel is possible. I haven’t thought through the implications of time travel, but it seems sufficiently unlikely that handling that scenario is a “luxury”.
I want to outline how my research programme attempts to address this core difficulty.
First, like I noted before, evolution is not a perfect analogy for AI. This is because evolution is directly selecting the policy, whereas a (model-based) AI system is separately selecting (i) a world-model (ii) a reward function and (iii) a plan (policy) based on i+ii. This inherently produces better generalization-of-alignment (but not nearly enough to solve the problem).
With iii, we have the least generalization problems, because we are not limited by training data: the AI can use the world-model to test the plan in any scenario, limited only by computing resources.
With ii, we have ample generalization problems because (a) the true reward function we are trying to convey to the AI is complex and (b) the data-points we do have might contain systematic errors. The MIRI approach to addressing this is (1) focusing on a relatively narrow task (like the strawberry problem) and (2) somehow add corrigibility. This approach is difficult because “corrigibility” is not a terribly natural property, and AFAICT it’s ill-defined even pretheoretically. Instead, I propose to address this by directly learning the user’s preferences, a path that MIRI believes to be harder (but I believe to be easier).
Since the user is an “arbitrary” system from the AI’s perspective, in order to learn the user’s preferences we need to understand how to assign agentic interpretations to arbitrary systems (the “intentional stance”), with the understanding that these interpretations are only meaningful to the extent the system is actually an agent (a rock has no sensible agentic interpretation). This seems to me as a natural problem, and indeed there is a line of attack using algorithmic information theory: 1 2.
However, having a well-grounded method of assigning utility functions to policies or programs is, in itself, insufficient. This is because (a) we still need the AI to learn the user’s policy/program and (b) we need to avoid allowing the AI to choose a convenient utility function by modifying the user or hacking the channel through which the information is received. To solve this, I propose to use certain tools provided by the infra-Bayesian physicalism (IBP) framework. Specifically, IBP allows formally specifying the notion of “which programs run in the universe”. The user is then one such program, and the remaining problem is how to select it among other problems, which seems tractable by establishing a certain “handshake” protocol. Moreover, the AI only considers the past[1] behavior[2] of the user, so it’s impossible for the AI to “cheat” as in ‘b’ above.
Finally, we need to deal with the generalization of i. At first glance, this should be easier since (a) the true world-model should have low description complexity, implying easy generalization and (b) any false world-model is falsifiable by reality itself, without extra offer on our part. However, from the perspective of a Cartesian agent the world is actually high complexity (because of the need for bridge rules), undermining ‘a’. [EDIT: Moreover, a false world-model can be erroneous at only a few special places, s.t. there are only a few mistakes but their impact is large.] The resulting failures can take the form of malign agents inside the world-model itself.
Here again IBP comes to the rescue, giving the agent an epistemology that requires no bridge rules. [EDIT: And, since the agent holds an unprivileged position in the universe, it leaves much less room for simple-to-describe false world-models that only make different predictions for very special situations.] This doesn’t solve all problems entirely, and in particular the agent can still develop malign simulation hypotheses, although (as opposed to Cartesian agents), these malign hypotheses no longer have an overwhelming advantage in probability mass. To address this, I propose designing a filtering mechanism which discards such hypotheses (roughly speaking, it makes the AI disbelieve any hypothesis that involves a powerful / unhumanlike creator, formalized using IBP tools). It is currently an open problem to demonstrate that this is a complete solution for world-model generalization / inner alignment (or augment it if it isn’t), but it does not seem intractable.
I expect a lot of the details to continue to change in the future, as more layers of the math become revealed, but I’m pretty confident in the ability this style of research to guide us onto the right path, eventually.
More precisely, the part of the user’s subjective timeline which is outside the AI’s logical-causal future, as can be specified in IBP.
Where by “behavior” I mean the computation producing this behavior, rather than just the result of this computation.
Huh? Evolution did not directly select over human policy decisions. Evolution specified brains, which do within-lifetime learning and therefore learn different policies given different upbringings, and e.g. learning rate mutations indirectly leads to statistical differences in human learned policies. Evolution probably specifies some reward circuitry, the learning architecture, the broad-strokes learning processes (self-supervised predictive + RL), and some other factors, from which the policy is produced.
The IGF->human values analogy is indeed relevantly misleading IMO, but not for this reason.
When I say “policy”, I mean the entire behavior including the learning algorithm, not some asymptotic behavior the system is converging to. Obviously, the policy is represented as genetic code, not as individual decisions. When I say “evolution is directly selecting the policy”, I mean that genotypes are selected based on their “expected reward” (reproductive fitness) rather than e.g. by evaluating the accuracy of the world-models those minds produce[1]. And, genotypes are not a priori constrained to be learning algorithms with particular architectures, that’s something the outer loop has to learn.
Evolution is not even model-free RL, since in MFRL we train a network to estimate the value function or the Q-function of different states, we don’t just GD on the expected reward. But, MFRL does have the problem of extrapolating the reward function incorrectly away from the training data.
Do you feel your agenda will allow us to formalise the idea of “don’t hack the agent who provides your reward signal” in some way? Every attempt I’ve seen has either failed or been too restrictive.
The AI only considers the user’s timeline before the AI’s creation as specifying the loss function. It cannot change the past[1].
Unless time travel is possible. I haven’t thought through the implications of time travel, but it seems sufficiently unlikely that handling that scenario is a “luxury”.