Cheers; Rebecca likes the “instrumental control incentive” terminology; she claims it’s more in line with control theory terminology.
We agree that lack of control incentive on X does not mean that X is safe from influence from the agent, as it may be that the agent influences X as a side effect of achieving its true objective. As you point out, this is especially true when X and a utility node probabilistically dependent.
I think it’s more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn’t seem to do what it intended.
we have decided to slightly update the terminology: in the latest version of our paper (accepted to AAAI, just released on arXiv) we prefer the term instrumental control incentive (ICI), to emphasize that the distinction to “control as a side effect”.
For exactly the same reason, In my own recent paper Counterfactual
Planning, I introduced the terms
direct incentive and indirect incentive, where I frame the
removal of a path to value in a planning world diagram as an action
that will eliminate a direct incentive, but that may leave other
indirect incentives (via other paths to value) intact. In section 6
of the paper and in this post of the
sequence
I develop and apply this terminology in the case of an agent emergency
stop button.
In high-level descriptions of what the technique of creating
indifference via path removal (or balancing terms) does, I have
settled on using the terminology suppresses the incentive instead
of removes the incentive.
I must admit that I have not read many control theory papers, so
any insights from Rebecca about standard terminology from control
theory would be welcome.
Do they have some standard phrasing where they can say things like ‘no
value to control’ while subtly reminding the reader that ‘this does
not imply there will be no side effects?’
Cheers; Rebecca likes the “instrumental control incentive” terminology; she claims it’s more in line with control theory terminology.
I think it’s more dangerous than that. When there is mutual information, the agent can learn to behave as if it was specifically manipulating X; the counterfactual approach doesn’t seem to do what it intended.
Glad she likes the name :) True, I agree there may be some interesting subtleties lurking there.
(Sorry btw for slow reply; I keep missing alignmentforum notifications.)
On recent terminology innovation:
For exactly the same reason, In my own recent paper Counterfactual Planning, I introduced the terms direct incentive and indirect incentive, where I frame the removal of a path to value in a planning world diagram as an action that will eliminate a direct incentive, but that may leave other indirect incentives (via other paths to value) intact. In section 6 of the paper and in this post of the sequence I develop and apply this terminology in the case of an agent emergency stop button.
In high-level descriptions of what the technique of creating indifference via path removal (or balancing terms) does, I have settled on using the terminology suppresses the incentive instead of removes the incentive.
I must admit that I have not read many control theory papers, so any insights from Rebecca about standard terminology from control theory would be welcome.
Do they have some standard phrasing where they can say things like ‘no value to control’ while subtly reminding the reader that ‘this does not imply there will be no side effects?’