Manipulation that warps an agent loses fidelity in simulating them. Certainly superintelligence has the power to forget you in various ways, similarly to how a supernova explosion is difficult to survive, so this not happening needs to be part of the premise. A simulator that only considers what you do for a particular contrived input fails to observe your behavior as a whole.
So we need some concepts that say what it means for an agent to not be warped (while retaining interaction with sources of external influence rather than getting completely isolated), for something to remain the intended agent rather than some other phenomenon that is now in its place. This turns out to be a lot like the toolset useful for defining values of an agent. Some relevant concepts are membranes, updatelessness, and coherence of volition. Membranes gesture at the inputs that are allowed to come in contact with you, or information about you that can be used in determining the inputs that are allowed, that can be part of an environment that doesn’t warp you and enables further observation. This shouldn’t be restricted only to inputs, since the whole deal with AI risk is that AI is not some external invasion that arrives to Earth, it’s something we are building ourselves, right here. So the concept of a membrane should also target acausal influences, patterns developing internally from within the agent.
Updatelessness is about a point of view on behavior where you consider its dependence on all possible inputs, not just actions given the inputs that you did apparently observe so far. Decisions should be informed by looking at the map of all possible situations and behaviors that take place there, even if the map has to be imprecise. (And not just situations that are clearly possible, the example in the post with ignoring the proof of what you’ll do is about what you do in potentially counterfactual situations.)
Coherence of volition is about the problem of path dependence in how an agent develops. There are many possible observations, many possible thoughts and possible decisions, that lead to different places. The updateless point of view says that the local decisions should be informed by the overall map of this tree of possibilities for reflection, so it would be nice if path dependence doesn’t lead to chaotic divergence, if there are coherent values to be found, even if only within smaller clusters of possible paths of reflection that settle into being sufficiently in agreement.
Manipulation that warps an agent loses fidelity in simulating them. Certainly superintelligence has the power to forget you in various ways, similarly to how a supernova explosion is difficult to survive, so this not happening needs to be part of the premise. A simulator that only considers what you do for a particular contrived input fails to observe your behavior as a whole.
So we need some concepts that say what it means for an agent to not be warped (while retaining interaction with sources of external influence rather than getting completely isolated), for something to remain the intended agent rather than some other phenomenon that is now in its place. This turns out to be a lot like the toolset useful for defining values of an agent. Some relevant concepts are membranes, updatelessness, and coherence of volition. Membranes gesture at the inputs that are allowed to come in contact with you, or information about you that can be used in determining the inputs that are allowed, that can be part of an environment that doesn’t warp you and enables further observation. This shouldn’t be restricted only to inputs, since the whole deal with AI risk is that AI is not some external invasion that arrives to Earth, it’s something we are building ourselves, right here. So the concept of a membrane should also target acausal influences, patterns developing internally from within the agent.
Updatelessness is about a point of view on behavior where you consider its dependence on all possible inputs, not just actions given the inputs that you did apparently observe so far. Decisions should be informed by looking at the map of all possible situations and behaviors that take place there, even if the map has to be imprecise. (And not just situations that are clearly possible, the example in the post with ignoring the proof of what you’ll do is about what you do in potentially counterfactual situations.)
Coherence of volition is about the problem of path dependence in how an agent develops. There are many possible observations, many possible thoughts and possible decisions, that lead to different places. The updateless point of view says that the local decisions should be informed by the overall map of this tree of possibilities for reflection, so it would be nice if path dependence doesn’t lead to chaotic divergence, if there are coherent values to be found, even if only within smaller clusters of possible paths of reflection that settle into being sufficiently in agreement.