I bid for us to discuss a concrete example. Can you posit a training environment which matches what you’re thinking about, relative to a given network architecture [e.g. LSTM]?
And that generator would need to be such that the heuristics it generates are always optimized for achieving R, instead of pointing in some arbitrary direction — or, at least, that’s how the greedy optimization process would attempt to build it
What is “achieving R” buying us? The agent internally represents a reward function, and then consults what the reward is in this scenario, and then generates heuristics to achieve that reward. Why not just not internally represent the reward function, and but still contextually generate “win this game of Go” or “talk like a 4chan user”? That seems strictly more space-efficacious, and also doesn’t involve being an R-wrapper.
EDIT The network might already have R in its WM, depending on the point in training. I also don’t think “this weight setting saves space” is a slam dunk, but just wanted to point out the consideration.
I empathically agree that inner misalignment and deceptive alignment would remain a thing — that the SGD would fail at perfectly aligning the heuristic-generator, and it would end up generating heuristics that point at a proxy of R.
I don’t know what to make of this. It seems to me like you’re saying “in a perfect-exploration limit only wrapper minds for the reward function are fixed under updating.” It seems like you’re saying this is relevant to SGD. But then it seems like you make the opposite claim of “inner alignment still hard.” I think it’s fine to say “here’s one effect [diversity and empirical loss minimization] which pushes towards reward wrapper minds, but I don’t think it’s the only effect, I just think we should be aware of it.” Is this a good summary of your position?
I also feel unsure whether you’re arguing primarily for a wrapper mind, or for reward-optimizers, or for both?
Can you posit a training environment which matches what you’re thinking about, relative to a given network architecture [e.g. LSTM]?
Sure, gimme a bit.
Why not just not internally represent the reward function, and but still contextually generate “win this game of Go” or “talk like a 4chan user”?
What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them?
I think it’s fine to say “here’s one effect [diversity and empirical loss minimization] which pushes towards reward wrapper minds, but I don’t think it’s the only effect, I just think we should be aware of it.” Is this a good summary of your position?
… Yes, absolutely. I wonder if we’ve somehow still been talking past each other to an extreme degree?
E. g., I don’t think I’m arguing for a “reward-optimizer” the way you seem to think of them — I don’t think we’d get a wirehead, an agent that optimizes for getting reinforcement events.
Okay, a sketch at a concrete example: the cheese-finding agent from the Goal Misgeneralization paper. I’m not arguing that in the limit of an ideal training process, it’d converge towards wireheading. I’m arguing that it’d converge towards cheese-finding instead of upstream correlates of cheese-finding (as it actually does in the paper).
And if the training environment is diverse/complex enough (too complex for the agent’s memory to contain all the heuristics it may need), but the reinforcement schedule is still “shaped around” some natural goal (like cheese-finding), the agent would develop a heuristics generator that would generate heuristics robustly pointed at that natural goal. (So, e. g., even if it were placed in some non-Euclidean labyrinth containing alien cheese, it’d still figure out what “cheese” is and start optimizing to get to it.)
I bid for us to discuss a concrete example. Can you posit a training environment which matches what you’re thinking about, relative to a given network architecture [e.g. LSTM]?
What is “achieving R” buying us? The agent internally represents a reward function, and then consults what the reward is in this scenario, and then generates heuristics to achieve that reward. Why not just not internally represent the reward function, and but still contextually generate “win this game of Go” or “talk like a 4chan user”? That seems strictly more space-efficacious, and also doesn’t involve being an R-wrapper.
EDIT The network might already have R in its WM, depending on the point in training. I also don’t think “this weight setting saves space” is a slam dunk, but just wanted to point out the consideration.
I don’t know what to make of this. It seems to me like you’re saying “in a perfect-exploration limit only wrapper minds for the reward function are fixed under updating.” It seems like you’re saying this is relevant to SGD. But then it seems like you make the opposite claim of “inner alignment still hard.” I think it’s fine to say “here’s one effect [diversity and empirical loss minimization] which pushes towards reward wrapper minds, but I don’t think it’s the only effect, I just think we should be aware of it.” Is this a good summary of your position?
I also feel unsure whether you’re arguing primarily for a wrapper mind, or for reward-optimizers, or for both?
Sure, gimme a bit.
What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them?
… Yes, absolutely. I wonder if we’ve somehow still been talking past each other to an extreme degree?
E. g., I don’t think I’m arguing for a “reward-optimizer” the way you seem to think of them — I don’t think we’d get a wirehead, an agent that optimizes for getting reinforcement events.
Okay, a sketch at a concrete example: the cheese-finding agent from the Goal Misgeneralization paper. I’m not arguing that in the limit of an ideal training process, it’d converge towards wireheading. I’m arguing that it’d converge towards cheese-finding instead of upstream correlates of cheese-finding (as it actually does in the paper).
And if the training environment is diverse/complex enough (too complex for the agent’s memory to contain all the heuristics it may need), but the reinforcement schedule is still “shaped around” some natural goal (like cheese-finding), the agent would develop a heuristics generator that would generate heuristics robustly pointed at that natural goal. (So, e. g., even if it were placed in some non-Euclidean labyrinth containing alien cheese, it’d still figure out what “cheese” is and start optimizing to get to it.)