you seem to think that we can’t do much empirical or theoretical work right now to improve our understanding of reflective processes
We can certainly do research now that builds towards the research we eventually need to do. But if your empirical work you’re doing right now can predict when an RL agent will start taking actions to preserve its own goals, I will be surprised and even more interested than I already am.
Regarding the latter point, I think lots of your points surrounding lock-in might be stated too strongly. I’m a reflective goal-directed agent, and I don’t think my values are “locked in”; I can and do change my behaviors and moral views in response to new information and circumstances. Maybe you think that “lock-in” involves actual self-modification, so that e.g. an aspiring vegan would reengineer their tastebuds so that meat tastes horrible—but creating shards that discourage this kind of behavior seems easy as pie. Overall, the problems involving “lock-in” don’t seem as hard to me as they do to you
Lock-in is the process that stops the RL agent from slipping down the slope to actually maximizing the reward function as written. An example in humans would be how you avoid taking heroin specifically because you know that it would strongly stimulate the literal reward calculation of your brain.
You seem to be making an implied argument like “this isn’t a big problem for me, a human, so it probably happens by default in a good way in future RL agents,” and I don’t find that implied argument valid.
I think the bigger dangers (and ones we currently don’t know how to address, but might soon) are unknown unknowns and other reflectivity problems, especially those involving how desirable shards might interact in undesirable ways and and push our agent towards bizarre and harmful behaviors.
What sort of stuff would be an example of that latter problem? If a shard-condensation process can lead to such human-undesirable generalization taken collectively, why should the individual shards that it condenses generalize the way we want when taken individually?
We can certainly do research now that builds towards the research we eventually need to do. But if your empirical work you’re doing right now can predict when an RL agent will start taking actions to preserve its own goals, I will be surprised and even more interested than I already am.
Lock-in is the process that stops the RL agent from slipping down the slope to actually maximizing the reward function as written. An example in humans would be how you avoid taking heroin specifically because you know that it would strongly stimulate the literal reward calculation of your brain.
You seem to be making an implied argument like “this isn’t a big problem for me, a human, so it probably happens by default in a good way in future RL agents,” and I don’t find that implied argument valid.
What sort of stuff would be an example of that latter problem? If a shard-condensation process can lead to such human-undesirable generalization taken collectively, why should the individual shards that it condenses generalize the way we want when taken individually?