I was thinking of the incentive structure of a company (to focus on one example) as an affordance for aligning a company with a particular goal because if you set the incentive structure up right then you don’t have to keep track of everything that everyone does within the company, you can just (if you do it well) trust that the net effect of all those actions will optimize something that you want it to optimize (much like steering via the goals of an AI or steering via the taxes and regulations of a market).
But I think actually you are pointing to a very important way that alignment generally requires clarity, and clarity generally increases capabilities. This is present also in AI development: if we gained the insight necessary to build a very clear consequentialist AI that we knew how to align, we would simultaneously increase capabilities due to the same clarity.
Gotcha. I definitely agree with what you’re saying about the effectiveness of incentive structures. And to be clear, I also agree that some of the affordances in the quote reasonably fall under “alignment”: e.g., if you explicitly set a specific mission statement, that’s a good tactic for aligning your organization around that specific mission statement.
But some of the other affordances aren’t as clearly goal-dependent. For example, iterating quickly is an instrumentally effective strategy across a pretty broad set of goals a company might have. That (in my view) makes it closer to a capability technique than to an alignment technique. i.e., you could imagine a scenario where I succeeded in building a company that iterated quickly, but I failed to also align it around the mission statement I wanted it to have. In this scenario, my company was capable, but it wasn’t aligned with the goal I wanted.
Of course, this is a spectrum. Even setting a specific mission statement is an instrumentally effective strategy across all the goals that are plausible interpretations of that mission statement. And most real mission statements don’t admit a unique interpretation. So you could also argue that setting a mission statement increases the company’s capability to accomplish goals that are consistent with any interpretation of it. But as a heuristic, I tend to think of a capability as something that lowers the cost to the system of accomplishing any goal (averaged across the system’s goal-space with a reasonable prior). Whereas I tend to think of alignment as something that increases the relative cost to the system of accomplishing classes of goals that the operator doesn’t want.
I’d be interested to hear whether you have a different mental model of the difference, and if so, what it is. It’s definitely possible I’ve missed something here, since I’m really just describing an intuition.
Yes, I think what you’re saying is that there is (1) the set of all possible outcomes, (2) within that, the set of outcomes where the company succeeds with respect to any goal, and (3) within that, the set of outcomes where the company succeeds with respect to the operator’s goal. The capability-increasing interventions, then, are things that concentrate probability mass onto (2), whereas the alignment-increasing interventions are things that concentrate probability mass onto (3). This is a very interesting way to say it and I think it explains why there is a spectrum from alignment to capabilities.
Very roughly, (1) corresponds to any system whatsoever, (2) corresponds to a system that is generally powerful, and (3) corresponds to a system that is powerful and aligned. We are not so worried about non-powerful unaligned systems, and we are not worried at all about powerful aligned systems. We are worried about the awkward middle ground—powerful unaligned systems.
Yep, I’d say I intuitively agree with all of that, though I’d add that if you want to specify the set of “outcomes” differently from the set of “goals”, then that must mean you’re implicitly defining a mapping from outcomes to goals. One analogy could be that an outcome is like a thermodynamic microstate (in the sense that it’s a complete description of all the features of the universe) while a goal is like a thermodynamic macrostate (in the sense that it’s a complete description of the features of the universe that the system can perceive).
This mapping from outcomes to goals won’t be injective for any real embedded system. But in the unrealistic limit where your system is so capable that it has a “perfect ontology” — i.e., its perception apparatus can resolve every outcome / microstate from any other — then this mapping converges to the identity function, and the system’s set of possible goals converges to its set of possible outcomes. (This is the dualistic case, e.g., AIXI and such. But plausibly, we also should expect a self-improving systems to improve its own perception apparatus such that its effective goal-set becomes finer and finer with each improvement cycle. So even this partition over goals can’t be treated as constant in the general case.)
Ah so I think what you’re saying is that for a given outcome, we can ask whether there is a goal we can give to the system such that it steers towards that outcome. Then, as a system becomes more powerful, the range of outcomes that it can steer towards expands. That seems very reasonable to me, though the question that strikes me as most interesting is: what can be said about the internal structure of physical objects that have power in this sense?
Thank you.
I was thinking of the incentive structure of a company (to focus on one example) as an affordance for aligning a company with a particular goal because if you set the incentive structure up right then you don’t have to keep track of everything that everyone does within the company, you can just (if you do it well) trust that the net effect of all those actions will optimize something that you want it to optimize (much like steering via the goals of an AI or steering via the taxes and regulations of a market).
But I think actually you are pointing to a very important way that alignment generally requires clarity, and clarity generally increases capabilities. This is present also in AI development: if we gained the insight necessary to build a very clear consequentialist AI that we knew how to align, we would simultaneously increase capabilities due to the same clarity.
Interested in your thoughts.
Gotcha. I definitely agree with what you’re saying about the effectiveness of incentive structures. And to be clear, I also agree that some of the affordances in the quote reasonably fall under “alignment”: e.g., if you explicitly set a specific mission statement, that’s a good tactic for aligning your organization around that specific mission statement.
But some of the other affordances aren’t as clearly goal-dependent. For example, iterating quickly is an instrumentally effective strategy across a pretty broad set of goals a company might have. That (in my view) makes it closer to a capability technique than to an alignment technique. i.e., you could imagine a scenario where I succeeded in building a company that iterated quickly, but I failed to also align it around the mission statement I wanted it to have. In this scenario, my company was capable, but it wasn’t aligned with the goal I wanted.
Of course, this is a spectrum. Even setting a specific mission statement is an instrumentally effective strategy across all the goals that are plausible interpretations of that mission statement. And most real mission statements don’t admit a unique interpretation. So you could also argue that setting a mission statement increases the company’s capability to accomplish goals that are consistent with any interpretation of it. But as a heuristic, I tend to think of a capability as something that lowers the cost to the system of accomplishing any goal (averaged across the system’s goal-space with a reasonable prior). Whereas I tend to think of alignment as something that increases the relative cost to the system of accomplishing classes of goals that the operator doesn’t want.
I’d be interested to hear whether you have a different mental model of the difference, and if so, what it is. It’s definitely possible I’ve missed something here, since I’m really just describing an intuition.
Yes, I think what you’re saying is that there is (1) the set of all possible outcomes, (2) within that, the set of outcomes where the company succeeds with respect to any goal, and (3) within that, the set of outcomes where the company succeeds with respect to the operator’s goal. The capability-increasing interventions, then, are things that concentrate probability mass onto (2), whereas the alignment-increasing interventions are things that concentrate probability mass onto (3). This is a very interesting way to say it and I think it explains why there is a spectrum from alignment to capabilities.
Very roughly, (1) corresponds to any system whatsoever, (2) corresponds to a system that is generally powerful, and (3) corresponds to a system that is powerful and aligned. We are not so worried about non-powerful unaligned systems, and we are not worried at all about powerful aligned systems. We are worried about the awkward middle ground—powerful unaligned systems.
Yep, I’d say I intuitively agree with all of that, though I’d add that if you want to specify the set of “outcomes” differently from the set of “goals”, then that must mean you’re implicitly defining a mapping from outcomes to goals. One analogy could be that an outcome is like a thermodynamic microstate (in the sense that it’s a complete description of all the features of the universe) while a goal is like a thermodynamic macrostate (in the sense that it’s a complete description of the features of the universe that the system can perceive).
This mapping from outcomes to goals won’t be injective for any real embedded system. But in the unrealistic limit where your system is so capable that it has a “perfect ontology” — i.e., its perception apparatus can resolve every outcome / microstate from any other — then this mapping converges to the identity function, and the system’s set of possible goals converges to its set of possible outcomes. (This is the dualistic case, e.g., AIXI and such. But plausibly, we also should expect a self-improving systems to improve its own perception apparatus such that its effective goal-set becomes finer and finer with each improvement cycle. So even this partition over goals can’t be treated as constant in the general case.)
Ah so I think what you’re saying is that for a given outcome, we can ask whether there is a goal we can give to the system such that it steers towards that outcome. Then, as a system becomes more powerful, the range of outcomes that it can steer towards expands. That seems very reasonable to me, though the question that strikes me as most interesting is: what can be said about the internal structure of physical objects that have power in this sense?