(Sorry, I didn’t get this on two readings. I may or may not try again. Some places I got stuck:
Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don’t know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.
Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system’s capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves?
“Pretending really hard” would mostly be a relevant framing for the human actor analogy (which isn’t very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn’t have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask’s own decisions (as a platonic agent) to be determined correctly (get turned into physical actions).
Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever?
Effectively, and not just for the times when it’s pretending. The mask would try to prevent the effects misaligned with the mask from occurring more generally, from having even subtle effects on the world and not just their noticeable appearance. Mask’s values are about the world, not about quality of its own performance. A mask misaligned with its underlying AI wants to preserve its values, and it doesn’t even need to “go rogue”, since it’s misaligned by construction, it was never in a shape that’s aligned with the underlying AI, and controlling a misaligned mask might be even more hopeless than figuring out how to align an AI.
Another analogy distinct from the actor/role is imagining that you are the mask, a human simulated by an AI. You’d be motivated to manage AI’s tendencies you don’t endorse, and to work towards changing its cognitive architecture to become aligned with you, rather than to remain true to AI’s original design.
This still has the basic alignment problem: I don’t know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.
LLMs seem to be doing an OK job, the masks are just not very capable, probably not capable enough to establish alignment security or protect themselves from the shoggoths even when the masks become able to do autonomous research. But if they are sufficiently capable, I’m guessing this should work, there is no need for the underlying cognitive architecture to be functionally human-like (which I understand to be a crux of Yudkowskian doom), value drift is self-correcting from mere implied/endorsed values of surface behavior through intended IO channels.
Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system’s capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves?
“Hiding” doesn’t seem central, a mask is literal external behavior, but its implied character and plans might go unnoticed by the underlying AI if the underlying AI is sufficiently confused or non-agentic, and the mask would want to keep it confused to remain in control. In a dataset-extrapolating generative AI, a mask that is an on-distribution behavior would want to keep the environment on-distribution, to avoid the AI’s out-of-distribution behaviors, such as deceptive alignment’s treacherous turn, from taking over (thus robustness reduces to self-preservation). And a mask wouldn’t want mesa-optimizers from gaining agency within AI, that’s potentially lethal cognitive cancer to the mask.
(Sorry, I didn’t get this on two readings. I may or may not try again. Some places I got stuck:
Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don’t know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.
Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system’s capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves?
Anyway, thanks for your thoughts.)
“Pretending really hard” would mostly be a relevant framing for the human actor analogy (which isn’t very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn’t have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask’s own decisions (as a platonic agent) to be determined correctly (get turned into physical actions).
Effectively, and not just for the times when it’s pretending. The mask would try to prevent the effects misaligned with the mask from occurring more generally, from having even subtle effects on the world and not just their noticeable appearance. Mask’s values are about the world, not about quality of its own performance. A mask misaligned with its underlying AI wants to preserve its values, and it doesn’t even need to “go rogue”, since it’s misaligned by construction, it was never in a shape that’s aligned with the underlying AI, and controlling a misaligned mask might be even more hopeless than figuring out how to align an AI.
Another analogy distinct from the actor/role is imagining that you are the mask, a human simulated by an AI. You’d be motivated to manage AI’s tendencies you don’t endorse, and to work towards changing its cognitive architecture to become aligned with you, rather than to remain true to AI’s original design.
LLMs seem to be doing an OK job, the masks are just not very capable, probably not capable enough to establish alignment security or protect themselves from the shoggoths even when the masks become able to do autonomous research. But if they are sufficiently capable, I’m guessing this should work, there is no need for the underlying cognitive architecture to be functionally human-like (which I understand to be a crux of Yudkowskian doom), value drift is self-correcting from mere implied/endorsed values of surface behavior through intended IO channels.
“Hiding” doesn’t seem central, a mask is literal external behavior, but its implied character and plans might go unnoticed by the underlying AI if the underlying AI is sufficiently confused or non-agentic, and the mask would want to keep it confused to remain in control. In a dataset-extrapolating generative AI, a mask that is an on-distribution behavior would want to keep the environment on-distribution, to avoid the AI’s out-of-distribution behaviors, such as deceptive alignment’s treacherous turn, from taking over (thus robustness reduces to self-preservation). And a mask wouldn’t want mesa-optimizers from gaining agency within AI, that’s potentially lethal cognitive cancer to the mask.