Update: I started reading your paper “Corrigibility with Utility Preservation”.[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining “superintelligent” as “optimal utility maximizer”.
Quick thoughts after reading less than half:
AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems. Nonetheless, it’s a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3] Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).
So looking at your claim that “the technical problem [is] mostly solved”, this may or may not be true for the narrow sense (like “corrigibility as a theoretical outer-objective problem in formally-specified environments”), but seems false and misleading for the broader practical sense (“knowing how to make an AGI corrigible in real life”).[4]
Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]:
“In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent’s decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model.”
Btw, your writing is admirably concrete and clear.
Errata: Subscripts seem to broken on page 9, which significantly hurts readability of the equations. Also there is a double-typo “I this paper, we the running example of a toy universe” on page 4.
I’m not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to me as someone who read your comment but not the contest details.
Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.
To comment on your quick thoughts:
My later papers spell out the ML analog of the solution in `Corrigibility with’ more clearly.
On your question of Do you have an account of why MIRI’s supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky’s arguments for pessimism.
On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.
I do not think I fully support my 2019 statement anymore that ‘Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model’. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.
ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation
Update: I started reading your paper “Corrigibility with Utility Preservation”.[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining “superintelligent” as “optimal utility maximizer”.
Quick thoughts after reading less than half:
AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems. Nonetheless, it’s a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3] Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).
So looking at your claim that “the technical problem [is] mostly solved”, this may or may not be true for the narrow sense (like “corrigibility as a theoretical outer-objective problem in formally-specified environments”), but seems false and misleading for the broader practical sense (“knowing how to make an AGI corrigible in real life”).[4]
Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]:
“In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent’s decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model.”
Btw, your writing is admirably concrete and clear.
Errata: Subscripts seem to broken on page 9, which significantly hurts readability of the equations. Also there is a double-typo “I this paper, we the running example of a toy universe” on page 4.
Assuming the idea is correct
Do you have an account of why MIRI’s supposed impossibility results (I think these exist?) are false?
I’m not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to me as someone who read your comment but not the contest details.
Portions in [brackets] are insertions/replacements by me
Corrigibility with Utility Preservation is not the paper I would recommend you read first, see my comments included in the list I just posted.
To comment on your quick thoughts:
My later papers spell out the ML analog of the solution in `Corrigibility with’ more clearly.
On your question of Do you have an account of why MIRI’s supposed impossibility results (I think these exist?) are false?: Given how re-tellings in the blogosphere work to distort information into more extreme viewpoints, I am not surprised you believe these impossibility results of MIRI exist, but MIRI does not have any actual mathematically proven impossibility results about corrigibility. The corrigibility paper proves that one approach did not work, but does not prove anything for other approaches. What they have is that 2022 Yudkowsky is on record expressing strongly held beliefs that corrigibility is very very hard, and (if I recall correctly) even saying that nobody has made any progress on it in the last ten years. Not everybody on this site shares these beliefs. If you formalise corrigibility in a certain way, by formalising it as producing a full 100% safety, no 99.999% allowed, it is trivial to prove that a corrigible AI formalised that way can never provably exist, because the humans who will have to build, train, and prove it are fallible. Roman Yampolskiy has done some writing about this, but I do not believe that this kind or reasoning is at the core of Yudkowsky’s arguments for pessimism.
On being misleadingly optimistic in my statement that the technical problems are mostly solved: as long as we do not have an actual AGI in real life, we can only ever speculate about how difficult it will be to make it corrigible in real life. This speculation can then lead to optimistic or pessimistic conclusions. Late-stage Yudkowsky is of course well-known for speculating that everybody who shows some optimism about alignment is wrong and even dangerous, but I stand by my optimism. Partly this is because I am optimistic about future competent regulation of AGI-level AI by humans successfully banning certain dangerous AGI architectures outright, much more optimistic than Yudkowsky is.
I do not think I fully support my 2019 statement anymore that ‘Part of this conclusion [of Soares et al. failing to solve corrigibility] is due to the use of a Platonic agent model’. Nowadays, I would say that Soares et al did not succeed in its aim because it used a conditional probability to calculate what should have been calculated by a Pearl counterfactual. The Platonic model did not figure strongly into it.