What is the difference between “deciding your behaviour” and “deciding upon interventions to you that will result in behaviour of its choosing”?
If showing you a formal proof that you will do a particular action doesn’t result in you doing that action, then the supposed “proof” was simply incorrect. At any rate, it is unlikely in most cases that there exists a proof that merely presenting it to a person is sufficient to ensure that the person carries out some action.
In more formal terms: even in the trivial case where a person could be modelled as a function f(a,b,c,...) that produces actions from inputs, and there do in fact exist values of (a,b,c,...) such that f produces a chosen action A, there is no guarantee that f(a,b,c,...) = A whenever a = “a proof that f(a,b,c,...) = A” for all values of b,c,… .
It may be true that f(a,b,c,...) = A for some values of b,c,… and if the superintelligence can arrange for those to hold then it may indeed look like merely presenting the proof is enough to guarantee action A, but would actually be a property of both the presentation of the proof and all the other interventions together (even if the other interventions are apparently irrelevant).
There are many things that people believe they will be able to simply ignore, but where that belief turns out to be incorrect. Simply asserting that deciding to ignore the proof will work is not enough to make it true.
As you broaden the set of possible interventions and time spans, guarantees of future actions will hold for more people. My expectation is that at some level of intervention far short of direct brain modification or other intuitively identity-changing actions, it holds for essentially all people.
If showing you a formal proof that you will do a particular action doesn’t result in you doing that action, then the supposed “proof” was simply incorrect.
Yes, that’s the point, you can make it necessarily incorrect, your decision to act differently determines the incorrectness of the proof, regardless of its provenance. When the proof was formally correct, your decision turns the whole possible world where this takes place counterfactual. (This is called playing chicken with the universe or the chicken rule, a technique that’s occasionally useful for getting an agent to have nicer properties, by not letting the formal system that generates the proofs know too much early on about what the agent is going to decide.)
Manipulation that warps an agent loses fidelity in simulating them. Certainly superintelligence has the power to forget you in various ways, similarly to how a supernova explosion is difficult to survive, so this not happening needs to be part of the premise. A simulator that only considers what you do for a particular contrived input fails to observe your behavior as a whole.
So we need some concepts that say what it means for an agent to not be warped (while retaining interaction with sources of external influence rather than getting completely isolated), for something to remain the intended agent rather than some other phenomenon that is now in its place. This turns out to be a lot like the toolset useful for defining values of an agent. Some relevant concepts are membranes, updatelessness, and coherence of volition. Membranes gesture at the inputs that are allowed to come in contact with you, or information about you that can be used in determining the inputs that are allowed, that can be part of an environment that doesn’t warp you and enables further observation. This shouldn’t be restricted only to inputs, since the whole deal with AI risk is that AI is not some external invasion that arrives to Earth, it’s something we are building ourselves, right here. So the concept of a membrane should also target acausal influences, patterns developing internally from within the agent.
Updatelessness is about a point of view on behavior where you consider its dependence on all possible inputs, not just actions given the inputs that you did apparently observe so far. Decisions should be informed by looking at the map of all possible situations and behaviors that take place there, even if the map has to be imprecise. (And not just situations that are clearly possible, the example in the post with ignoring the proof of what you’ll do is about what you do in potentially counterfactual situations.)
Coherence of volition is about the problem of path dependence in how an agent develops. There are many possible observations, many possible thoughts and possible decisions, that lead to different places. The updateless point of view says that the local decisions should be informed by the overall map of this tree of possibilities for reflection, so it would be nice if path dependence doesn’t lead to chaotic divergence, if there are coherent values to be found, even if only within smaller clusters of possible paths of reflection that settle into being sufficiently in agreement.
What is the difference between “deciding your behaviour” and “deciding upon interventions to you that will result in behaviour of its choosing”?
If showing you a formal proof that you will do a particular action doesn’t result in you doing that action, then the supposed “proof” was simply incorrect. At any rate, it is unlikely in most cases that there exists a proof that merely presenting it to a person is sufficient to ensure that the person carries out some action.
In more formal terms: even in the trivial case where a person could be modelled as a function f(a,b,c,...) that produces actions from inputs, and there do in fact exist values of (a,b,c,...) such that f produces a chosen action A, there is no guarantee that f(a,b,c,...) = A whenever a = “a proof that f(a,b,c,...) = A” for all values of b,c,… .
It may be true that f(a,b,c,...) = A for some values of b,c,… and if the superintelligence can arrange for those to hold then it may indeed look like merely presenting the proof is enough to guarantee action A, but would actually be a property of both the presentation of the proof and all the other interventions together (even if the other interventions are apparently irrelevant).
There are many things that people believe they will be able to simply ignore, but where that belief turns out to be incorrect. Simply asserting that deciding to ignore the proof will work is not enough to make it true.
As you broaden the set of possible interventions and time spans, guarantees of future actions will hold for more people. My expectation is that at some level of intervention far short of direct brain modification or other intuitively identity-changing actions, it holds for essentially all people.
Yes, that’s the point, you can make it necessarily incorrect, your decision to act differently determines the incorrectness of the proof, regardless of its provenance. When the proof was formally correct, your decision turns the whole possible world where this takes place counterfactual. (This is called playing chicken with the universe or the chicken rule, a technique that’s occasionally useful for getting an agent to have nicer properties, by not letting the formal system that generates the proofs know too much early on about what the agent is going to decide.)
Manipulation that warps an agent loses fidelity in simulating them. Certainly superintelligence has the power to forget you in various ways, similarly to how a supernova explosion is difficult to survive, so this not happening needs to be part of the premise. A simulator that only considers what you do for a particular contrived input fails to observe your behavior as a whole.
So we need some concepts that say what it means for an agent to not be warped (while retaining interaction with sources of external influence rather than getting completely isolated), for something to remain the intended agent rather than some other phenomenon that is now in its place. This turns out to be a lot like the toolset useful for defining values of an agent. Some relevant concepts are membranes, updatelessness, and coherence of volition. Membranes gesture at the inputs that are allowed to come in contact with you, or information about you that can be used in determining the inputs that are allowed, that can be part of an environment that doesn’t warp you and enables further observation. This shouldn’t be restricted only to inputs, since the whole deal with AI risk is that AI is not some external invasion that arrives to Earth, it’s something we are building ourselves, right here. So the concept of a membrane should also target acausal influences, patterns developing internally from within the agent.
Updatelessness is about a point of view on behavior where you consider its dependence on all possible inputs, not just actions given the inputs that you did apparently observe so far. Decisions should be informed by looking at the map of all possible situations and behaviors that take place there, even if the map has to be imprecise. (And not just situations that are clearly possible, the example in the post with ignoring the proof of what you’ll do is about what you do in potentially counterfactual situations.)
Coherence of volition is about the problem of path dependence in how an agent develops. There are many possible observations, many possible thoughts and possible decisions, that lead to different places. The updateless point of view says that the local decisions should be informed by the overall map of this tree of possibilities for reflection, so it would be nice if path dependence doesn’t lead to chaotic divergence, if there are coherent values to be found, even if only within smaller clusters of possible paths of reflection that settle into being sufficiently in agreement.