I think you’re right, and I wasn’t taking this into account, and I don’t know how Vanessa would respond to this. Her usual stance is that we might expect all mesa-optimizers to be acausal attackers (that is, simulation / false hypotheses), since in this architecture the only way to determine actions is by determining hypotheses (and in fact, she now believes these acausal attackers can all be dealt with in one fell sweep in light of one single theoretical development). But that would seem to ignore the other complex processes going on to update these hypotheses from one time step to the next (as if the updates happened magically and instantaneously, without any further subcomputations). And we don’t even need to employ possibly non-perfect heuristics for these demons to appear: I think they would also appear even if we (in the ideal, infinite compute scenario) brute-forced by searching over all possible hypotheses updates and assessing each one on some metric. In a sense the two appearances of demons are equivalent, but in the latter limit they are more clearly encoded in certain hypotheses (that game the assessment of hypotheses), while in the former their relationship to hypotheses will be less straight-forward, since there will be non-trivial “hypotheses updating” code inside the AI which is not literally equivalent to the hypothesis chosen (and so, parts of this code which aren’t the final chosen hypothesis could also be part of a demon).
I’m not 100% sure the existence of these demons already implies inner misalignment, since these demons will only be optimized for their continued existence (and this might be gained by some strategy that, by sheer luck, doesn’t disrupt too much the outer performance of the AI, or at most turns the hypothesis search a bit less efficient). But I think this is just what always happens with mesa-optimizers, and the worry for inner alignment is that any one of these mesa-optimizers can be arbitrarily disruptive to outer performance (and there are some disruptive strategies very efficient for continued existence).
This might be a way in which PreDCA misses a hard bit of Alignment. More concretely, our problem is basically that the search space of possible AGI designs is too vast, and our search ability too limited. And PreDCA tries to reduce this space by considering a very concrete protocol which can be guaranteed to behave in certain ways. But maybe all (or most) of the vastness of the search space has been preserved, only now disguised as the search space over possible inner heuristics that can implement said protocol. Or put another way, whether or not the model implements simplifying heuristics or carries out a brute-force search, the space of possible hypotheses updates remains (approximately) as vast and problematic. Implementing heuristics approximately preserves this vastness: even if once the heuristic is implemented the search is considerably smaller, before that we already had to search over possible heuristics.
In fact, generalizing such arguments could be a piece in an argument that “abstracted perfect Alignment”, in the sense of “a water-tight solution that aligns with arbitrary goals agents of arbitrary capability (arbitrarily close-to-perfect consequentialist)”, is unsolvable. That is, if we abstract away all contextual contingencies that can make (even the strongest) AIs imperfect consequentialists, then (almost “by definition”) they will always outplay our schemes (because the search space is being conceptualized as unboundedly vast).
I think you’re right, and I wasn’t taking this into account, and I don’t know how Vanessa would respond to this. Her usual stance is that we might expect all mesa-optimizers to be acausal attackers (that is, simulation / false hypotheses), since in this architecture the only way to determine actions is by determining hypotheses (and in fact, she now believes these acausal attackers can all be dealt with in one fell sweep in light of one single theoretical development). But that would seem to ignore the other complex processes going on to update these hypotheses from one time step to the next (as if the updates happened magically and instantaneously, without any further subcomputations). And we don’t even need to employ possibly non-perfect heuristics for these demons to appear: I think they would also appear even if we (in the ideal, infinite compute scenario) brute-forced by searching over all possible hypotheses updates and assessing each one on some metric. In a sense the two appearances of demons are equivalent, but in the latter limit they are more clearly encoded in certain hypotheses (that game the assessment of hypotheses), while in the former their relationship to hypotheses will be less straight-forward, since there will be non-trivial “hypotheses updating” code inside the AI which is not literally equivalent to the hypothesis chosen (and so, parts of this code which aren’t the final chosen hypothesis could also be part of a demon).
I’m not 100% sure the existence of these demons already implies inner misalignment, since these demons will only be optimized for their continued existence (and this might be gained by some strategy that, by sheer luck, doesn’t disrupt too much the outer performance of the AI, or at most turns the hypothesis search a bit less efficient). But I think this is just what always happens with mesa-optimizers, and the worry for inner alignment is that any one of these mesa-optimizers can be arbitrarily disruptive to outer performance (and there are some disruptive strategies very efficient for continued existence).
This might be a way in which PreDCA misses a hard bit of Alignment. More concretely, our problem is basically that the search space of possible AGI designs is too vast, and our search ability too limited. And PreDCA tries to reduce this space by considering a very concrete protocol which can be guaranteed to behave in certain ways. But maybe all (or most) of the vastness of the search space has been preserved, only now disguised as the search space over possible inner heuristics that can implement said protocol. Or put another way, whether or not the model implements simplifying heuristics or carries out a brute-force search, the space of possible hypotheses updates remains (approximately) as vast and problematic. Implementing heuristics approximately preserves this vastness: even if once the heuristic is implemented the search is considerably smaller, before that we already had to search over possible heuristics.
In fact, generalizing such arguments could be a piece in an argument that “abstracted perfect Alignment”, in the sense of “a water-tight solution that aligns with arbitrary goals agents of arbitrary capability (arbitrarily close-to-perfect consequentialist)”, is unsolvable. That is, if we abstract away all contextual contingencies that can make (even the strongest) AIs imperfect consequentialists, then (almost “by definition”) they will always outplay our schemes (because the search space is being conceptualized as unboundedly vast).