I am currently almost fulltime doing AI policy, but I ran across this invite to
comment on the draft, so here goes.
On references:
Please add Armstrong among the author list in the reference to
Soares 2015, this paper had 4 authors, and it was actually Armstrong
who came up with indifference methods.
I see both ‘Pettigrew 2019’ and ‘Pettigrew 2020’ in the text? Is the
same reference?
More general:
Great that you compare the aggregating approach to two other
approaches, but I feel your description of these approaches needs to
be improved.
Soares et al 2015 defines corrigibility criteria (which historically
is its main contribution), but the paper then describes a failed
attempt to design an agent that meets them. The authors do not ‘worry
that utility indifference creates incentives to manage the news’ as in
your footnote, they positively show that their failed attempt has this
problem. Armstrong et al 2017 has a correct design, I recall, that meets the
criteria from Soares 2015, but only for a particular case. ‘Safely
interruptible agents’ by Orseau and Armstrong 2016 also has a correct
and more general design, but does not explicitly relate it back to
the original criteria from Soares et al, and the math is somewhat
inaccessible. Holtman 2000 ‘AGI Agent Safety by Iteratively Improving
the Utility Function’ has a correct design and does relate it back to
the Soares et al criteria. Also it shows that indifference methods can
be used for repeatedly changing the reward function, which addresses
one of your criticisms that
indifference methods are somewhat limited in this respect—this limitation is there in the
math of Soares, but not more generally for indifference methods. Further
exploration of indifference as a design method is in some work by
Everitt and others (work related to causal influence diagrams), and
also myself (Counterfactual Planning in AGI Systems).
What you call the ‘human compatible AI’ method is commonly referred to
as CIRL, human compatible AI is a phrase which is best read as moral
goal, design goal, or call to action, not a particular agent design.
The key defining paper following up on the ideas in ‘the off switch game’ you
want to cite is Hadfield-Menell, Dylan and Russell, Stuart J and
Abbeel, Pieter and Dragan, Anca, Cooperative Inverse Reinforcement
Learning. In that paper (I recall from memory, it may have already been in the off-switch
paper too), the authors offer the some of the same
criticism of their method that you describe as being offered by MIRI,
e.g. in the ASX writeup you cite.
Other remarks:
In the penalize effort action, can you clarify more on how E(A), the
effort metric, can be implemented?
I think that Pettigrew’s considerations, as you describe them, are
somewhat similar to those in ‘Self-modification of policy and utility
function in rational agents’ by Everitt et al. This paper is
somewhat mathematical but might be an interesting comparative read
for you, I feel it usefully charts the design space.
You may also find
this overview
to be an interesting read, if you want to clarify or reference
definitions of corrigibility.
Thanks for taking the time to work through this carefully! I’m looking forward to reading and engaging with the articles you’ve linked to. I’ll make sure to implement the specific description-improvement suggestions in final draft
I wish I had more to say about the effort metric! So far, the only thing concrete ideas I’ve come up with are (i) measure how much compute each action performs; or (ii) decompose each action into a series of basic actions, measure the number of basic actions necessary to perform the action. But both ideas are sketchy.
I am currently almost fulltime doing AI policy, but I ran across this invite to comment on the draft, so here goes.
On references:
Please add Armstrong among the author list in the reference to Soares 2015, this paper had 4 authors, and it was actually Armstrong who came up with indifference methods.
I see both ‘Pettigrew 2019’ and ‘Pettigrew 2020’ in the text? Is the same reference?
More general:
Great that you compare the aggregating approach to two other approaches, but I feel your description of these approaches needs to be improved.
Soares et al 2015 defines corrigibility criteria (which historically is its main contribution), but the paper then describes a failed attempt to design an agent that meets them. The authors do not ‘worry that utility indifference creates incentives to manage the news’ as in your footnote, they positively show that their failed attempt has this problem. Armstrong et al 2017 has a correct design, I recall, that meets the criteria from Soares 2015, but only for a particular case. ‘Safely interruptible agents’ by Orseau and Armstrong 2016 also has a correct and more general design, but does not explicitly relate it back to the original criteria from Soares et al, and the math is somewhat inaccessible. Holtman 2000 ‘AGI Agent Safety by Iteratively Improving the Utility Function’ has a correct design and does relate it back to the Soares et al criteria. Also it shows that indifference methods can be used for repeatedly changing the reward function, which addresses one of your criticisms that indifference methods are somewhat limited in this respect—this limitation is there in the math of Soares, but not more generally for indifference methods. Further exploration of indifference as a design method is in some work by Everitt and others (work related to causal influence diagrams), and also myself (Counterfactual Planning in AGI Systems).
What you call the ‘human compatible AI’ method is commonly referred to as CIRL, human compatible AI is a phrase which is best read as moral goal, design goal, or call to action, not a particular agent design. The key defining paper following up on the ideas in ‘the off switch game’ you want to cite is Hadfield-Menell, Dylan and Russell, Stuart J and Abbeel, Pieter and Dragan, Anca, Cooperative Inverse Reinforcement Learning. In that paper (I recall from memory, it may have already been in the off-switch paper too), the authors offer the some of the same criticism of their method that you describe as being offered by MIRI, e.g. in the ASX writeup you cite.
Other remarks:
In the penalize effort action, can you clarify more on how E(A), the effort metric, can be implemented?
I think that Pettigrew’s considerations, as you describe them, are somewhat similar to those in ‘Self-modification of policy and utility function in rational agents’ by Everitt et al. This paper is somewhat mathematical but might be an interesting comparative read for you, I feel it usefully charts the design space.
You may also find this overview to be an interesting read, if you want to clarify or reference definitions of corrigibility.
Thanks for taking the time to work through this carefully! I’m looking forward to reading and engaging with the articles you’ve linked to. I’ll make sure to implement the specific description-improvement suggestions in final draft
I wish I had more to say about the effort metric! So far, the only thing concrete ideas I’ve come up with are (i) measure how much compute each action performs; or (ii) decompose each action into a series of basic actions, measure the number of basic actions necessary to perform the action. But both ideas are sketchy.