Can you posit a training environment which matches what you’re thinking about, relative to a given network architecture [e.g. LSTM]?
Sure, gimme a bit.
Why not just not internally represent the reward function, and but still contextually generate “win this game of Go” or “talk like a 4chan user”?
What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them?
I think it’s fine to say “here’s one effect [diversity and empirical loss minimization] which pushes towards reward wrapper minds, but I don’t think it’s the only effect, I just think we should be aware of it.” Is this a good summary of your position?
… Yes, absolutely. I wonder if we’ve somehow still been talking past each other to an extreme degree?
E. g., I don’t think I’m arguing for a “reward-optimizer” the way you seem to think of them — I don’t think we’d get a wirehead, an agent that optimizes for getting reinforcement events.
Okay, a sketch at a concrete example: the cheese-finding agent from the Goal Misgeneralization paper. I’m not arguing that in the limit of an ideal training process, it’d converge towards wireheading. I’m arguing that it’d converge towards cheese-finding instead of upstream correlates of cheese-finding (as it actually does in the paper).
And if the training environment is diverse/complex enough (too complex for the agent’s memory to contain all the heuristics it may need), but the reinforcement schedule is still “shaped around” some natural goal (like cheese-finding), the agent would develop a heuristics generator that would generate heuristics robustly pointed at that natural goal. (So, e. g., even if it were placed in some non-Euclidean labyrinth containing alien cheese, it’d still figure out what “cheese” is and start optimizing to get to it.)
Sure, gimme a bit.
What mechanism does this contextual generation? How does this mechanism behave in off-distribution environments; what goals does it generate in them?
… Yes, absolutely. I wonder if we’ve somehow still been talking past each other to an extreme degree?
E. g., I don’t think I’m arguing for a “reward-optimizer” the way you seem to think of them — I don’t think we’d get a wirehead, an agent that optimizes for getting reinforcement events.
Okay, a sketch at a concrete example: the cheese-finding agent from the Goal Misgeneralization paper. I’m not arguing that in the limit of an ideal training process, it’d converge towards wireheading. I’m arguing that it’d converge towards cheese-finding instead of upstream correlates of cheese-finding (as it actually does in the paper).
And if the training environment is diverse/complex enough (too complex for the agent’s memory to contain all the heuristics it may need), but the reinforcement schedule is still “shaped around” some natural goal (like cheese-finding), the agent would develop a heuristics generator that would generate heuristics robustly pointed at that natural goal. (So, e. g., even if it were placed in some non-Euclidean labyrinth containing alien cheese, it’d still figure out what “cheese” is and start optimizing to get to it.)