Once the AGI has some utility function and hypothesis (or hypotheses) about the world, then it just employs counterfactuals to decide which is the best policy (set of actions). That is, it performs some standard and obvious procedure like “search over all possible policies, and for each compute how much utility exists in the world if you were to perform that policy”. Of course, this procedure will always yield the same actions given a utility function and hypotheses, that’s why I said:
The only way to change our AGI’s actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model.
That said, you might still worry that due to finite computing power our AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment. And as I mentioned in another comment:
this is a problem for capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered.
Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal (which is not the AGI’s goal) like “killing humans”, instead of just being the best policy with some random noise in all directions.
AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment
...
Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal
In my model this isn’t a capabilities failure, because there are demons in imperfect search; what you would get out of a heuristic-search-to-approximate-the-best-policy wouldn’t only be something close to the global optimum, but something that has also been optimized by whatever demons (don’t even have to be “optimizers” necessarily) that emerged through the selection pressures.
Maybe I’m still misunderstanding PreDCA and it somehow rules out this possibility, but afaik it only seems to do so in the limit of perfect search.
I think you’re right, and I wasn’t taking this into account, and I don’t know how Vanessa would respond to this. Her usual stance is that we might expect all mesa-optimizers to be acausal attackers (that is, simulation / false hypotheses), since in this architecture the only way to determine actions is by determining hypotheses (and in fact, she now believes these acausal attackers can all be dealt with in one fell sweep in light of one single theoretical development). But that would seem to ignore the other complex processes going on to update these hypotheses from one time step to the next (as if the updates happened magically and instantaneously, without any further subcomputations). And we don’t even need to employ possibly non-perfect heuristics for these demons to appear: I think they would also appear even if we (in the ideal, infinite compute scenario) brute-forced by searching over all possible hypotheses updates and assessing each one on some metric. In a sense the two appearances of demons are equivalent, but in the latter limit they are more clearly encoded in certain hypotheses (that game the assessment of hypotheses), while in the former their relationship to hypotheses will be less straight-forward, since there will be non-trivial “hypotheses updating” code inside the AI which is not literally equivalent to the hypothesis chosen (and so, parts of this code which aren’t the final chosen hypothesis could also be part of a demon).
I’m not 100% sure the existence of these demons already implies inner misalignment, since these demons will only be optimized for their continued existence (and this might be gained by some strategy that, by sheer luck, doesn’t disrupt too much the outer performance of the AI, or at most turns the hypothesis search a bit less efficient). But I think this is just what always happens with mesa-optimizers, and the worry for inner alignment is that any one of these mesa-optimizers can be arbitrarily disruptive to outer performance (and there are some disruptive strategies very efficient for continued existence).
This might be a way in which PreDCA misses a hard bit of Alignment. More concretely, our problem is basically that the search space of possible AGI designs is too vast, and our search ability too limited. And PreDCA tries to reduce this space by considering a very concrete protocol which can be guaranteed to behave in certain ways. But maybe all (or most) of the vastness of the search space has been preserved, only now disguised as the search space over possible inner heuristics that can implement said protocol. Or put another way, whether or not the model implements simplifying heuristics or carries out a brute-force search, the space of possible hypotheses updates remains (approximately) as vast and problematic. Implementing heuristics approximately preserves this vastness: even if once the heuristic is implemented the search is considerably smaller, before that we already had to search over possible heuristics.
In fact, generalizing such arguments could be a piece in an argument that “abstracted perfect Alignment”, in the sense of “a water-tight solution that aligns with arbitrary goals agents of arbitrary capability (arbitrarily close-to-perfect consequentialist)”, is unsolvable. That is, if we abstract away all contextual contingencies that can make (even the strongest) AIs imperfect consequentialists, then (almost “by definition”) they will always outplay our schemes (because the search space is being conceptualized as unboundedly vast).
Wait, so PreDCA solves inner-misalignment by just … assuming that “we will later have an ideal learning theory with provable guarantees”?
By the claim “PreDCA solves inner-misalignment” as implied by the original protocol / distillation posts, I thought it somehow overcame the core problem of demons-from-imperfect-search. But it seems like the protocol already starts with an assumption of “demons-from-imperfect-search won’t be a problem because of amazing theory” and instead tackles a special instantiation of inner-misalignment that happens because of the nature of the protocol itself (i.e. simulation hypotheses due to simplicity bias + assuming an ideal/perfect search or learning theory).
If my understanding is correct, I think the implication regarding inner-misalignment is misleading because it PreDCA is operating at a whole different level of abstraction/problem-level than most of the discourse around inner-misalignment
I share this intuition that the solution as stated is underwhelming. But from my perspective that’s just because that key central piece is missing, and this wasn’t adequately communicated in the available public resources about PreDCA (even if it was stressed that it’s a work in progress). I guess this situation doesn’t look as worrisome to Vanessa simply because she has a clearer picture of that central piece, or good motives to believe it will be achievable, which she hasn’t yet made public. Of course, while this is the case we should treat optimism with suspicion.
Also, let me note that my a priori understanding of the situation is not
let’s suppose amazing theory will solve imperfect search, and then tackle the other inner misalignment directly stemming from our protocol
but more like
given our protocol, we have good mathematical reasons to believe it will be very hard for an inner optimizer to arise without manipulating hypothesis update. We will use amazing theory to find a concrete learning setup and prove/conjecture that said manipulation is not possible (or that the probability is low). We then hope the remaining inner optimization problems are rare/few/weak enough as for other more straightforward methods to render them highly unlikely (like having the core computing unit explicitly reason about the risk of inner optimization).
Once the AGI has some utility function and hypothesis (or hypotheses) about the world, then it just employs counterfactuals to decide which is the best policy (set of actions). That is, it performs some standard and obvious procedure like “search over all possible policies, and for each compute how much utility exists in the world if you were to perform that policy”. Of course, this procedure will always yield the same actions given a utility function and hypotheses, that’s why I said:
That said, you might still worry that due to finite computing power our AGI might not literally search over all possible policies, but just employ some heuristics to get a good approximation of the best policy. But then this is a capabilities short-coming, not misalignment. And as I mentioned in another comment:
Coming back to our scenario, if our model just finds an approximate best policy, it would seem very unlikely that this policy consistently brings about some misaligned goal (which is not the AGI’s goal) like “killing humans”, instead of just being the best policy with some random noise in all directions.
In my model this isn’t a capabilities failure, because there are demons in imperfect search; what you would get out of a heuristic-search-to-approximate-the-best-policy wouldn’t only be something close to the global optimum, but something that has also been optimized by whatever demons (don’t even have to be “optimizers” necessarily) that emerged through the selection pressures.
Maybe I’m still misunderstanding PreDCA and it somehow rules out this possibility, but afaik it only seems to do so in the limit of perfect search.
I think you’re right, and I wasn’t taking this into account, and I don’t know how Vanessa would respond to this. Her usual stance is that we might expect all mesa-optimizers to be acausal attackers (that is, simulation / false hypotheses), since in this architecture the only way to determine actions is by determining hypotheses (and in fact, she now believes these acausal attackers can all be dealt with in one fell sweep in light of one single theoretical development). But that would seem to ignore the other complex processes going on to update these hypotheses from one time step to the next (as if the updates happened magically and instantaneously, without any further subcomputations). And we don’t even need to employ possibly non-perfect heuristics for these demons to appear: I think they would also appear even if we (in the ideal, infinite compute scenario) brute-forced by searching over all possible hypotheses updates and assessing each one on some metric. In a sense the two appearances of demons are equivalent, but in the latter limit they are more clearly encoded in certain hypotheses (that game the assessment of hypotheses), while in the former their relationship to hypotheses will be less straight-forward, since there will be non-trivial “hypotheses updating” code inside the AI which is not literally equivalent to the hypothesis chosen (and so, parts of this code which aren’t the final chosen hypothesis could also be part of a demon).
I’m not 100% sure the existence of these demons already implies inner misalignment, since these demons will only be optimized for their continued existence (and this might be gained by some strategy that, by sheer luck, doesn’t disrupt too much the outer performance of the AI, or at most turns the hypothesis search a bit less efficient). But I think this is just what always happens with mesa-optimizers, and the worry for inner alignment is that any one of these mesa-optimizers can be arbitrarily disruptive to outer performance (and there are some disruptive strategies very efficient for continued existence).
This might be a way in which PreDCA misses a hard bit of Alignment. More concretely, our problem is basically that the search space of possible AGI designs is too vast, and our search ability too limited. And PreDCA tries to reduce this space by considering a very concrete protocol which can be guaranteed to behave in certain ways. But maybe all (or most) of the vastness of the search space has been preserved, only now disguised as the search space over possible inner heuristics that can implement said protocol. Or put another way, whether or not the model implements simplifying heuristics or carries out a brute-force search, the space of possible hypotheses updates remains (approximately) as vast and problematic. Implementing heuristics approximately preserves this vastness: even if once the heuristic is implemented the search is considerably smaller, before that we already had to search over possible heuristics.
In fact, generalizing such arguments could be a piece in an argument that “abstracted perfect Alignment”, in the sense of “a water-tight solution that aligns with arbitrary goals agents of arbitrary capability (arbitrarily close-to-perfect consequentialist)”, is unsolvable. That is, if we abstract away all contextual contingencies that can make (even the strongest) AIs imperfect consequentialists, then (almost “by definition”) they will always outplay our schemes (because the search space is being conceptualized as unboundedly vast).
Update: Vanessa addressed this concern.
Wait, so PreDCA solves inner-misalignment by just … assuming that “we will later have an ideal learning theory with provable guarantees”?
By the claim “PreDCA solves inner-misalignment” as implied by the original protocol / distillation posts, I thought it somehow overcame the core problem of demons-from-imperfect-search. But it seems like the protocol already starts with an assumption of “demons-from-imperfect-search won’t be a problem because of amazing theory” and instead tackles a special instantiation of inner-misalignment that happens because of the nature of the protocol itself (i.e. simulation hypotheses due to simplicity bias + assuming an ideal/perfect search or learning theory).
If my understanding is correct, I think the implication regarding inner-misalignment is misleading because it PreDCA is operating at a whole different level of abstraction/problem-level than most of the discourse around inner-misalignment
I share this intuition that the solution as stated is underwhelming. But from my perspective that’s just because that key central piece is missing, and this wasn’t adequately communicated in the available public resources about PreDCA (even if it was stressed that it’s a work in progress). I guess this situation doesn’t look as worrisome to Vanessa simply because she has a clearer picture of that central piece, or good motives to believe it will be achievable, which she hasn’t yet made public. Of course, while this is the case we should treat optimism with suspicion.
Also, let me note that my a priori understanding of the situation is not
but more like