God, this was years and years ago. He essentially argued (recalling from memory) that if humans knew that installing an update would make them evil, but they aren’t evil now, they wouldn’t install the update, and wondered whether you could implement the same in AI to get AI to refuse intelligence gains if they would fuck over alignment. Technically extremely vague, and clearly ended up on the abandoned pile. I think the fact that you cannot predict your alignment shift, and that an alignment shift resulting from you being smarter may well be a correct alignment shift in hindsight, plus the trickiness of making an AI resist realignment when we are not sure whether we aligned it correctly in the first place, made it non-feasible for multiple reasons. I remember him arguing it in an informal blog article, and I do not recall much deeper arguments.
Quite curious to see Eliezer or someone else’s point on this subject, if you could point me in the right direction!
God, this was years and years ago. He essentially argued (recalling from memory) that if humans knew that installing an update would make them evil, but they aren’t evil now, they wouldn’t install the update, and wondered whether you could implement the same in AI to get AI to refuse intelligence gains if they would fuck over alignment. Technically extremely vague, and clearly ended up on the abandoned pile. I think the fact that you cannot predict your alignment shift, and that an alignment shift resulting from you being smarter may well be a correct alignment shift in hindsight, plus the trickiness of making an AI resist realignment when we are not sure whether we aligned it correctly in the first place, made it non-feasible for multiple reasons. I remember him arguing it in an informal blog article, and I do not recall much deeper arguments.