One of the core problems of AI alignment is that we don’t know how to reliably get goals into the AI—there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.
Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.
These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won’t we get some weird thing that happens to correlate with “don’t die, keep your options open” on the training data, but falls apart out of distribution?
One of the core problems of AI alignment is that we don’t know how to reliably get goals into the AI—there are many possible goals that are sufficiently correlated with doing well on training data that the AI could wind up optimising for a whole bunch of different things.
Instrumental convergence claims that a wide variety of goals will lead to convergent subgoals such that the agent will end up wanting to seek power, acquire resources, avoid death, etc.
These claims do seem a bit...contradictory. If goals are really that inscrutable, why do we strongly expect instrumental convergence? Why won’t we get some weird thing that happens to correlate with “don’t die, keep your options open” on the training data, but falls apart out of distribution?