there’s no analogously-strong attractor well pulling the AGI’s objectives towards your preferred objectives
I’m starting to doubt that there are strategically important human-specific objectives in the decision theory sense, things that should be used to actually optimize everything without goodharting making it counterproductive. In this hypothesis, optimization goals are not just hard to figure out, but there is almost nothing there that’s human-specific, human preference is generic. Orthogonality thesis applies to agents with goals, but maybe it doesn’t apply to humans, because their goals play a different role from what orthogonality thesis needs. To solve astronomical waste, humans could run their civilization on better substrate and look for goal-shaped principles that can be propagated more efficiently than civilization itself, used for optimization, but these principles are not going to be human-specific.
If an AGI is in a similar situation (doesn’t have non-goodharting goals), it’s going to be in the same attractor as humans, motivated to build a generic civilization. It doesn’t necessarily involve humans or human-specific things, but I’m not sure this is different from a specific human deciding that their civilization doesn’t involve other humans, which a priori seems like an unjustifiably privileged aspect of civilizational design.
In other words, applicability of orthogonality thesis might fail if the kind of goal knowledge relevant to it (that overcomes goodharting) tends to get obtained in convergent ways that give results that are generic, not human-specific. The disagreement of this hypothesis with the standard position is about the distinction/interaction between non-goodharting and goodharting goals. On the generic goals hypothesis, most accidental goals, including those currently held by humans, are goodharting goals (according to their role in the minds of people, not to their content), something that shouldn’t be used to strongly optimize the world in anything close to their current form. And the proper way of getting appropriate non-goodharting goals (running a civilization/long reflection/CEV) somehow doesn’t importantly depend on the currently held goodharting goals (this seems to be the crux).
So in this view, the dangerous AGIs are those that hold any goals in a non-goodharting role, ready to optimize the world according to them, while by default AGIs with more vague architectures are going to face the same problem of formulating non-goodharting goals as humans, and work within the same attractor for arriving at its solution. A better but not drastically different outcome is AGIs that hold human goodharting goals in a goodharting role and no goals in a non-goodharting role, so that they are even more likely to go about formulating non-goodharting goals in the same way humans would, without being opposed to actual humans in the process. Language models might help with this part.
(We can use these terms to restate the familiar catastrophic failure of alignment where an AGI is aligned to hold human goodharting goals (what we currently care about) in a non-goodharting role (as a target for optimizing the world with) without giving civilization the opportunity to improve on those goals and without limiting their applicability to situations comprehensible to current humans. A failure of non-goodharting (optimizing too much on goodharting goals), resulting in a failure of corrigibility (preventing the non-goodharting extrapolated volition from eventually being in charge).)
I’m starting to doubt that there are strategically important human-specific objectives in the decision theory sense, things that should be used to actually optimize everything without goodharting making it counterproductive. In this hypothesis, optimization goals are not just hard to figure out, but there is almost nothing there that’s human-specific, human preference is generic. Orthogonality thesis applies to agents with goals, but maybe it doesn’t apply to humans, because their goals play a different role from what orthogonality thesis needs. To solve astronomical waste, humans could run their civilization on better substrate and look for goal-shaped principles that can be propagated more efficiently than civilization itself, used for optimization, but these principles are not going to be human-specific.
If an AGI is in a similar situation (doesn’t have non-goodharting goals), it’s going to be in the same attractor as humans, motivated to build a generic civilization. It doesn’t necessarily involve humans or human-specific things, but I’m not sure this is different from a specific human deciding that their civilization doesn’t involve other humans, which a priori seems like an unjustifiably privileged aspect of civilizational design.
In other words, applicability of orthogonality thesis might fail if the kind of goal knowledge relevant to it (that overcomes goodharting) tends to get obtained in convergent ways that give results that are generic, not human-specific. The disagreement of this hypothesis with the standard position is about the distinction/interaction between non-goodharting and goodharting goals. On the generic goals hypothesis, most accidental goals, including those currently held by humans, are goodharting goals (according to their role in the minds of people, not to their content), something that shouldn’t be used to strongly optimize the world in anything close to their current form. And the proper way of getting appropriate non-goodharting goals (running a civilization/long reflection/CEV) somehow doesn’t importantly depend on the currently held goodharting goals (this seems to be the crux).
So in this view, the dangerous AGIs are those that hold any goals in a non-goodharting role, ready to optimize the world according to them, while by default AGIs with more vague architectures are going to face the same problem of formulating non-goodharting goals as humans, and work within the same attractor for arriving at its solution. A better but not drastically different outcome is AGIs that hold human goodharting goals in a goodharting role and no goals in a non-goodharting role, so that they are even more likely to go about formulating non-goodharting goals in the same way humans would, without being opposed to actual humans in the process. Language models might help with this part.
(We can use these terms to restate the familiar catastrophic failure of alignment where an AGI is aligned to hold human goodharting goals (what we currently care about) in a non-goodharting role (as a target for optimizing the world with) without giving civilization the opportunity to improve on those goals and without limiting their applicability to situations comprehensible to current humans. A failure of non-goodharting (optimizing too much on goodharting goals), resulting in a failure of corrigibility (preventing the non-goodharting extrapolated volition from eventually being in charge).)