That is to say, prior to “simulators” and “shard theory”, a lot of focus was on utility-maximizers—agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.
FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to “argmax over crisp human-specified utility function.” (In the language of the OP, I expect values-executors, not grader-optimizers.)
I have no way of knowing that increasing the candy-shard’s value won’t cause a phase shift that substantially increases the perceived value of the “kill all humans, take their candy” action plan. I ultimately care about the agent’s “revealed preferences”, and I am not convinced that those are smooth relative to changes in the shards.
I’m not either. I think there will be phase changes wrt “shard strengths” (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO.
Basically my stance is “yeah there are going to be phase changes, but there are also many perturbations which don’t induce phase changes, and I really want to understand which is which.”
FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to “argmax over crisp human-specified utility function.” (In the language of the OP, I expect values-executors, not grader-optimizers.)
I’m not either. I think there will be phase changes wrt “shard strengths” (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO.
Basically my stance is “yeah there are going to be phase changes, but there are also many perturbations which don’t induce phase changes, and I really want to understand which is which.”