One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.
In this post you use ‘goals’ in quite a broad way, so as to include stuff like virtues (e.g. “always be honest”). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it’s motivated primarily by things like virtues, habits, or rules.
This would be the most important axis to hypothesise about if it was the case that instrumental convergence applies to consequentialist goals but not to things like virtues. Like, I think it’s plausible that
(i) if you get an AI with a slightly wrong consequentialist goal (e.g. “maximise everyone’s schmellbeing”) then you get paperclipped because of instrumental convergence,
(ii) if you get an AI that tries to embody slightly the wrong virtue (e.g. “always be schmonest”) then it’s badly dysfunctional but doesn’t directly entail a disaster.
And if that’s correct, then we should care about the question “Will the AI’s goals be consequentialist ones?” more than most questions about them.
I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft.
Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
It sounds like you are a fan of Hypothesis 3? “Unintended version of written goals and/or human intentions.” Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.
It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.
As written, aren’t Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?
Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like ‘proxy virtues’ could maybe be a thing too?
(Unrelatedly, it’s not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I’m missing something).
Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
I probably wouldn’t prefer any of those to goals. I might use “Motivations”, but I also think it’s ok to use goals in this broader way and “consequentialist goals” when you want to make the distinction.
I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question.
I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
Like, suppose you think Hypothesis 1 is true: They’ll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is “well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it’s OK to overrule the principles for the greater good, or not?”
Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you’ll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.
I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
Fair enough, yeah—this seems like a very reasonable angle of attack.
One thing that might be missing from this analysis is explicitly thinking about whether the AI is likely to be driven by consequentialist goals.
In this post you use ‘goals’ in quite a broad way, so as to include stuff like virtues (e.g. “always be honest”). But we might want to carefully distinguish scenarios in which the AI is primarily motivated by consequentialist goals from ones where it’s motivated primarily by things like virtues, habits, or rules.
This would be the most important axis to hypothesise about if it was the case that instrumental convergence applies to consequentialist goals but not to things like virtues. Like, I think it’s plausible that
(i) if you get an AI with a slightly wrong consequentialist goal (e.g. “maximise everyone’s schmellbeing”) then you get paperclipped because of instrumental convergence,
(ii) if you get an AI that tries to embody slightly the wrong virtue (e.g. “always be schmonest”) then it’s badly dysfunctional but doesn’t directly entail a disaster.
And if that’s correct, then we should care about the question “Will the AI’s goals be consequentialist ones?” more than most questions about them.
I think both (i) and (ii) are directionally correct. I had exactly this idea in mind when I wrote this draft.
Maybe I shouldn’t have used “Goals” as the term of art for this post, but rather “Traits?” or “Principles?” Or “Virtues.”
It sounds like you are a fan of Hypothesis 3? “Unintended version of written goals and/or human intentions.” Because it sounds like you are saying probably things will be wrong-but-not-totally-wrong relative to the Spec / dev intentions.
It seems to me that the consequentialist vs virtue-driven axis is mostly orthogonal to the hypotheses here.
As written, aren’t Hypothesis 1: Written goal specification, Hypothesis 2: Developer-intended goals, and Hypothesis 3: Unintended version of written goals and/or human intentions all compatible with either kind of AI?
Hypothesis 4: Reward/reinforcement does assume a consequentialist, and so does Hypothesis 5: Proxies and/or instrumentally convergent goals as written, although it seems like ‘proxy virtues’ could maybe be a thing too?
(Unrelatedly, it’s not that natural to me to group proxy goals with instrumentally convergent goals, but maybe I’m missing something).
I probably wouldn’t prefer any of those to goals. I might use “Motivations”, but I also think it’s ok to use goals in this broader way and “consequentialist goals” when you want to make the distinction.
I agree it’s mostly orthogonal.
I also agree that the question of whether AIs will be driven (purely) by consequentialist goals or whether they will (to a significant extent) be constrained by deontological principles / virtues / etc. is an important question.
I think it’s downstream of the spread of hypotheses discussed in this post, such that we can make faster progress on it once we’ve made progress eliminating hypotheses from this list.
Like, suppose you think Hypothesis 1 is true: They’ll do whatever is in the Spec, because Constitutional AI or Deliberative Alignment or whatever Just Works. On this hypothesis, the answer to your question is “well, what does the Spec say? Does it just list a bunch of goals, or does it also include principles? Does it say it’s OK to overrule the principles for the greater good, or not?”
Meanwhile suppose you think Hypothesis 4 is true. Then it seems like you’ll be dealing with a nasty consequentialist, albeit hopefully a rather myopic one.
Fair enough, yeah—this seems like a very reasonable angle of attack.