In Part 3 of this series, I plan to write a shallow survey of 8 problems relating to AI alignment, and the relationship of the «boundary» concept to formalizing them. To save time, I’d like to do a deep dive into just one of the eight problems, based on what commenters here would find most interesting. If you have a moment, please use the “agree” button (and where desired, “disagree”) to vote for which of the eight topics I should go into depth about. Each topic is given as a subcomment below (not looking for karma, just agree/disagree votes). Thanks!
7. Preference plasticity — the possibility of changes to the preferences of human preferences over time, and the challenge of defining alignment in light of time-varying preferences (Russell, 2019, p.263).
5. Counterfactuals in decision theory — the problem of defining what would have happened if an AI system had made a different choice, such as in the Twin Prisoner’s Dilemma (Yudkowsky & Soares, 2017).
3. Mild optimization — the problem of designing AI systems and objective functions that, in an intuitive sense, don’t optimize more than they have to (Taylor et al, 2016).
6. Mesa-optimizers — instances of learned models that are themselves optimizers, which give rise to the so-called inner alignment problem (Hubinger et al, 2019).
4. Impact regularization — the problem of formalizing “change to the environment” in a way that can be effectively used as a regularizer penalizing negative side effects from AI systems (Amodei et al, 2016).
2. Corrigibility — the problem of constructing a mind that will cooperate with what its creators regard as a corrective intervention (Soares et al, 2015).
8. (Unscoped) Consequentialism — the problem that an AI system engaging in consequentialist reasoning, for many objectives, is at odds with corrigibility and containment (Yudkowsky, 2022, no. 23).
1. AI boxing / containment — the method and challenge of confining an AI system to a “box”, i.e., preventing the system from interacting with the external world except through specific restricted output channels (Bostrom, 2014, p.129).
In Part 3 of this series, I plan to write a shallow survey of 8 problems relating to AI alignment, and the relationship of the «boundary» concept to formalizing them. To save time, I’d like to do a deep dive into just one of the eight problems, based on what commenters here would find most interesting. If you have a moment, please use the “agree” button (and where desired, “disagree”) to vote for which of the eight topics I should go into depth about. Each topic is given as a subcomment below (not looking for karma, just agree/disagree votes). Thanks!
7. Preference plasticity — the possibility of changes to the preferences of human preferences over time, and the challenge of defining alignment in light of time-varying preferences (Russell, 2019, p.263).
5. Counterfactuals in decision theory — the problem of defining what would have happened if an AI system had made a different choice, such as in the Twin Prisoner’s Dilemma (Yudkowsky & Soares, 2017).
3. Mild optimization — the problem of designing AI systems and objective functions that, in an intuitive sense, don’t optimize more than they have to (Taylor et al, 2016).
6. Mesa-optimizers — instances of learned models that are themselves optimizers, which give rise to the so-called inner alignment problem (Hubinger et al, 2019).
4. Impact regularization — the problem of formalizing “change to the environment” in a way that can be effectively used as a regularizer penalizing negative side effects from AI systems (Amodei et al, 2016).
2. Corrigibility — the problem of constructing a mind that will cooperate with what its creators regard as a corrective intervention (Soares et al, 2015).
8. (Unscoped) Consequentialism — the problem that an AI system engaging in consequentialist reasoning, for many objectives, is at odds with corrigibility and containment (Yudkowsky, 2022, no. 23).
1. AI boxing / containment — the method and challenge of confining an AI system to a “box”, i.e., preventing the system from interacting with the external world except through specific restricted output channels (Bostrom, 2014, p.129).