Koen.Holtman comments on Consequentialism & corrigibility

Koen.Holtman 22 Dec 2021 17:57 UTC
LW: 1 AF: 1
AF

Very open to feedback.

I have not read the whole comment section, so this feedback may already have been given, but...

I believe the “indifference” method represented some progress towards a corrigible utility-function-over-future-states, but not a complete solution (apparently it’s not reflectively consistent—i.e., if the off-switch breaks, it wouldn’t fix it), and the problem remains open to this day.

Opinions differ on how open the problem remains. Definitely, going by the recent Yudkowsky sequences, MIRI still acts as if the problem is open, and seems to have given up on making progress on it, or believing that anybody else has made progress or can make progress. I on the other hand believe that the problem of figuring out how to make indifference methods work is largely closed. I have written papers on it, for example here. But you have told me before you have trouble reading my work, so I am not sure I can help you any further.

My impression is that, in these links, Yudkowsky is suggesting that powerful AGIs will purely have preferences over future states.

My impression is that Yudkowsky only cares about designing the type of powerful AGIs that will purely have preferences over future states. My impression is that he considers AGIs which do not purely have preferences over future states to be useless to any plan that might save the world from x-risk. In fact, he feels that these latter AGIs are not even worthy of the name AGI. At the same time, he worries that these consequentialist AGIs he wants will kill everybody, if some idiot gives them the wrong utility function.

This worry is of course entirely valid, so my own ideas about safe AGI designs tend to go heavily towards favouring designs that are not purely consequentialist AGIs. My feeling is that Yudkowsky does not want to go there, design-wise. He has locked himself into a box, and refuses to think outside of it, to the extent that he even believes that there is no outside.

As you mention above. if you want to construct a value function component that measures ‘humans stay in control’, this is very possible. But you will have to take into account that a whole school of thought on this forum will be all too willing to criticise your construction for not being 100.0000% reliable, for having real or imagined failure modes, for not being the philosophical breakthrough they really want to be reading about. This can give you a serious writer’s block, if you are not careful.