I am still pretty unconvinced that there is a corruption mechanism that wouldn’t be removed more quickly by SGD than the mesaobjective would be reverted. Are there more recent write ups that shed more light on this?
Specifically, I can’t tell whether this assumes the corruption mechanism has access to a perfect model of its own weights via observation (eg hacking) or via somehow the weights referring to themselves. This is important because if “the mesaobjective weights” are referred to via observation, then SGD will not compute a gradient wrt them (since they’re not part of the differentiation graph). Thus the only “culprit” for bad performance is the corruption mechanism.
Separately, there seems like there might be a chicken and egg thing here? There’s no compelling reason for performance-maximizing SGD to create a gradient-hacking scheme to protect a mesaobjective that fails on the data. If this is a sufficiently crafty agent that it’s considering long-term consequences of its objective being modified, this still doesn’t explain how the agent used its action space to induce a specific change to its weights that implements a specific multi-input function. I really struggle to see this; do you have toy examples where this could occur?
(The answer presumably can’t be hacking, because then we’re fucked for independent reasons.)
Note I’m also assuming gradient hacking is distinct from “avoid states that might change your mesaobjective”, which seems much more likely to happen.
I see, so your claim here is that gradient hacking is a convergent strategy for all agents of sufficient intelligence. That’s helpful, thanks.
I am still confused about this in the case that Alice is checking whether or not she has X goal, since by definition it is to her goal Y’s detriment to not have children if she finds she has a different goal Y!=X.
I am still pretty unconvinced that there is a corruption mechanism that wouldn’t be removed more quickly by SGD than the mesaobjective would be reverted. Are there more recent write ups that shed more light on this?
Specifically, I can’t tell whether this assumes the corruption mechanism has access to a perfect model of its own weights via observation (eg hacking) or via somehow the weights referring to themselves. This is important because if “the mesaobjective weights” are referred to via observation, then SGD will not compute a gradient wrt them (since they’re not part of the differentiation graph). Thus the only “culprit” for bad performance is the corruption mechanism.
Separately, there seems like there might be a chicken and egg thing here? There’s no compelling reason for performance-maximizing SGD to create a gradient-hacking scheme to protect a mesaobjective that fails on the data. If this is a sufficiently crafty agent that it’s considering long-term consequences of its objective being modified, this still doesn’t explain how the agent used its action space to induce a specific change to its weights that implements a specific multi-input function. I really struggle to see this; do you have toy examples where this could occur? (The answer presumably can’t be hacking, because then we’re fucked for independent reasons.)
Note I’m also assuming gradient hacking is distinct from “avoid states that might change your mesaobjective”, which seems much more likely to happen.
[EDIT: sorry, I need to think through this some more.]
I see, so your claim here is that gradient hacking is a convergent strategy for all agents of sufficient intelligence. That’s helpful, thanks.
I am still confused about this in the case that Alice is checking whether or not she has X goal, since by definition it is to her goal Y’s detriment to not have children if she finds she has a different goal Y!=X.