I don’t view shard-coordination as being different than shard formation
Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?)
From the perspective of sculpting AGI motivations, I think it might be trickier to directly intervene on meta-preferences than to directly intervene on (object-level) preferences, because if the AGI is attending to something related to sensory input, you can kinda guess what it’s probably thinking about and you at least have a chance of issuing appropriate rewards by doing obvious straightforward things, whereas if the AGI is introspecting on its own current preferences, you need powerful interpretability techniques to even have a chance to issue appropriate rewards, I suspect. That’s not to say it’s impossible! We should keep thinking about it. It’s very much on my own mind, see e.g. my silly tweets from just last night.
Yeah I expect that the same learning algorithm source code would give rise to both preferences and meta-preferences. (I think that’s what you’re saying there right?)
From the perspective of sculpting AGI motivations, I think it might be trickier to directly intervene on meta-preferences than to directly intervene on (object-level) preferences, because if the AGI is attending to something related to sensory input, you can kinda guess what it’s probably thinking about and you at least have a chance of issuing appropriate rewards by doing obvious straightforward things, whereas if the AGI is introspecting on its own current preferences, you need powerful interpretability techniques to even have a chance to issue appropriate rewards, I suspect. That’s not to say it’s impossible! We should keep thinking about it. It’s very much on my own mind, see e.g. my silly tweets from just last night.