“AI-powered memetic warfare makes all humans effectively insane” a catastrophe that I listed in an earlier comment, which seems one of the hardest to formally specify. It seems values-complete or metaphilosophy-complete to me, since without having specified human values or having solved metaphilosophy, how can we check whether an AI-generated argument is trying to convince us of something that is wrong according to actual human values, or wrong according to normative philosophical reasoning?
I don’t see anything in this post or the linked OAA post that addresses or tries to bypass this difficulty?
OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).
There is still a misuse version: someone could remove the provision in 5.1.5 that the model of Earth-like situations should be largely agnostic about human behavior, and instead building a detailed model of how human nervous systems respond to language. (Then, even though the superintelligence in the box would still be making only descriptive arguments about a policy, the policy that comes out would likely emit normative arguments at deployment time.) Superintelligence misuse is covered under problem 11.
If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.
If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.
Ok I’ll quote 5.1.4-5 to make it easier for others to follow this discussion:
5.1.4. It may be that the easiest plan to find involves an unacceptable degree of power-seeking and control over irrelevant variables. Therefore, the score function should penalize divergence of the trajectory of the world state from the trajectory of the status quo (in which no powerful AI systems take any actions).
5.1.5. The incentives under 5.1.4 by default are to take control over irrelevant variables so as to ensure that they proceed as in the anticipated “status quo”. Infrabayesian uncertainty about the dynamics is the final component that removes this incentive. In particular, the infrabayesian prior can (and should) have a high degree of Knightian uncertainty about human decisions and behaviour. This makes the most effective way to limit the maximum divergence (of human trajectories from the status quo) actually not interfering.
I’m not sure how these are intended to work. How do you intend to define/implement “divergence”? How does that definition/implementation combined with “high degree of Knightian uncertainty about human decisions and behaviour” actually cause the AI to “not interfere” but also still accomplish the goals that we give it?
In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to “propagandize to humans”. It’s just unclear to me how you intend to achieve this.
This doesn’t directly answer your questions, but since the OAA already requires global coordination and agreement to follow the plans spit out by the superintelligent AI, maybe propagandizing people is not necessary. Especially if we consider that by the time the OAA becomes possible, the economy and science are probably already largely automated by CoEms and don’t need to involve motivated humans.
Then, the time-boundedness of the plan raises the chances that the plan doesn’t concern with changing people’s values and preferences as a side effect (which will be relevant for the ongoing work of shaping the constraints and desiderata for the next iteration of the plan). Some such interference with values will inevitably happen, though. That’s what Davidad considers when he writes “A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk.”
There are a lot of catastrophes that humans did or could do to themselves. In that regard, AI is like any multi-purpose tool, such as a hammer. We have to sort these out too earlier or later, but isn’t this orthogonal to the alignment question?
It is often considered as such, but my concern is less with “the alignment question” (how to build AI that values whatever its stakeholders value) and more with how to build transformative AI that probably does not lead to catastrophe. Misuse is one of the ways that it can lead to catastrophe. In fact, in practice, we have to sort misuse out sooner than accidents, because catastrophic misuses become viable at a lower tech level than catastrophic accidents.
“AI-powered memetic warfare makes all humans effectively insane” a catastrophe that I listed in an earlier comment, which seems one of the hardest to formally specify. It seems values-complete or metaphilosophy-complete to me, since without having specified human values or having solved metaphilosophy, how can we check whether an AI-generated argument is trying to convince us of something that is wrong according to actual human values, or wrong according to normative philosophical reasoning?
I don’t see anything in this post or the linked OAA post that addresses or tries to bypass this difficulty?
OAA bypasses the accident version of this by only accepting arguments from a superintelligence that have the form “here is why my proposed top-level plan—in the form of a much smaller policy network—is a controller that, when combined with the cyberphysical model of an Earth-like situation, satisfies your pLTL spec.” There is nothing normative in such an argument; the normative arguments all take place before/while drafting the spec, which should be done with AI assistants that are not smarter-than-human (CoEm style).
There is still a misuse version: someone could remove the provision in 5.1.5 that the model of Earth-like situations should be largely agnostic about human behavior, and instead building a detailed model of how human nervous systems respond to language. (Then, even though the superintelligence in the box would still be making only descriptive arguments about a policy, the policy that comes out would likely emit normative arguments at deployment time.) Superintelligence misuse is covered under problem 11.
If it’s not misuse, the provisions in 5.1.4-5 will steer the search process away from policies that attempt to propagandize to humans.
Ok I’ll quote 5.1.4-5 to make it easier for others to follow this discussion:
I’m not sure how these are intended to work. How do you intend to define/implement “divergence”? How does that definition/implementation combined with “high degree of Knightian uncertainty about human decisions and behaviour” actually cause the AI to “not interfere” but also still accomplish the goals that we give it?
In order to accomplish its goals, the AI has to do lots of things that will have butterfly effects on the future, so the system has to allow it to do those things, but also not allow it to “propagandize to humans”. It’s just unclear to me how you intend to achieve this.
This doesn’t directly answer your questions, but since the OAA already requires global coordination and agreement to follow the plans spit out by the superintelligent AI, maybe propagandizing people is not necessary. Especially if we consider that by the time the OAA becomes possible, the economy and science are probably already largely automated by CoEms and don’t need to involve motivated humans.
Then, the time-boundedness of the plan raises the chances that the plan doesn’t concern with changing people’s values and preferences as a side effect (which will be relevant for the ongoing work of shaping the constraints and desiderata for the next iteration of the plan). Some such interference with values will inevitably happen, though. That’s what Davidad considers when he writes “A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk.”
There are a lot of catastrophes that humans did or could do to themselves. In that regard, AI is like any multi-purpose tool, such as a hammer. We have to sort these out too earlier or later, but isn’t this orthogonal to the alignment question?
In regards to safely-wieldable tool-AI versus ‘alignment’, I recommend thinking in terms of ‘intent alignment’ versus ‘values alignment’ as Seth Herd describes here: Conflating value alignment and intent alignment is causing confusion
It is often considered as such, but my concern is less with “the alignment question” (how to build AI that values whatever its stakeholders value) and more with how to build transformative AI that probably does not lead to catastrophe. Misuse is one of the ways that it can lead to catastrophe. In fact, in practice, we have to sort misuse out sooner than accidents, because catastrophic misuses become viable at a lower tech level than catastrophic accidents.