Thanks as always for your consistently thoughtful comments:)
I disagree with how this post seems to optimistically ignore the possibility that the AGI might self-modify to be more coherent in a way that involves crushing / erasing a subset of its desires, and this subset might include the desires related to human flourishing.
I also feel this is an “area that warrants further research”, though I don’t view shard-coordination as being different than shard formation. If you understand how inner-values form from outer reward schedules, then how inner-values interact is also a steerable reinforcement. Though this may be exactly what you meant by “try to ensure that the AGI has a strong (meta-)preference not to do that”, so the only disagreement is on the optimism vibe?
I don’t view shard-coordination as being different than shard formation
Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?)
From the perspective of sculpting AGI motivations, I think it might be trickier to directly intervene on meta-preferences than to directly intervene on (object-level) preferences, because if the AGI is attending to something related to sensory input, you can kinda guess what it’s probably thinking about and you at least have a chance of issuing appropriate rewards by doing obvious straightforward things, whereas if the AGI is introspecting on its own current preferences, you need powerful interpretability techniques to even have a chance to issue appropriate rewards, I suspect. That’s not to say it’s impossible! We should keep thinking about it. It’s very much on my own mind, see e.g. my silly tweets from just last night.
Thanks as always for your consistently thoughtful comments:)
I also feel this is an “area that warrants further research”, though I don’t view shard-coordination as being different than shard formation. If you understand how inner-values form from outer reward schedules, then how inner-values interact is also a steerable reinforcement. Though this may be exactly what you meant by “try to ensure that the AGI has a strong (meta-)preference not to do that”, so the only disagreement is on the optimism vibe?
Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?)
From the perspective of sculpting AGI motivations, I think it might be trickier to directly intervene on meta-preferences than to directly intervene on (object-level) preferences, because if the AGI is attending to something related to sensory input, you can kinda guess what it’s probably thinking about and you at least have a chance of issuing appropriate rewards by doing obvious straightforward things, whereas if the AGI is introspecting on its own current preferences, you need powerful interpretability techniques to even have a chance to issue appropriate rewards, I suspect. That’s not to say it’s impossible! We should keep thinking about it. It’s very much on my own mind, see e.g. my silly tweets from just last night.