I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
More specifically: Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
If you’re making a prediction system (or an oracle more generally), there seems to be a possibility of making it self-unaware—it doesn’t know that it’s outputting predictions, it doesn’t know that it even has an output, it doesn’t know that it exists in the universe, etc. A toy example is a superhuman world-model which is completely and easily interpretable; you can just look at the data structure and understand every aspect of it, see what the concepts are and how they’re connected, and you can use that to explore counterfactuals and understand things etc. That data structure is the whole system, and the human users browse it. Anyway, I think the scariest safety risk for oracles is that they’ll give manipulative answers, use side-channel attacks, or more generally make intelligent decisions to steer the future towards goals. A self-unaware system will not do that because it is not aware that it can do things to affect the universe. There’s still some safety problems (not to mention bad actors etc.), but significantly less scary ones.
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
Fair enough.
Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
I suspect the important part is the agent-like part.
I’m not sure it makes to think of “the alignment problem” as a singularity entity. I’d rather taboo “the alignment problem” and just ask what could go wrong with a self-aware system that’s not agent-like.
A self-unaware system will not do that because it is not aware that it can do things to affect the universe.
Hot take: it might be useful to think of “self-awareness” and “awareness that it can do things to affect the universe” separately. Not sure they are one and the same.
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
More specifically: Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
If you’re making a prediction system (or an oracle more generally), there seems to be a possibility of making it self-unaware—it doesn’t know that it’s outputting predictions, it doesn’t know that it even has an output, it doesn’t know that it exists in the universe, etc. A toy example is a superhuman world-model which is completely and easily interpretable; you can just look at the data structure and understand every aspect of it, see what the concepts are and how they’re connected, and you can use that to explore counterfactuals and understand things etc. That data structure is the whole system, and the human users browse it. Anyway, I think the scariest safety risk for oracles is that they’ll give manipulative answers, use side-channel attacks, or more generally make intelligent decisions to steer the future towards goals. A self-unaware system will not do that because it is not aware that it can do things to affect the universe. There’s still some safety problems (not to mention bad actors etc.), but significantly less scary ones.
Fair enough.
I suspect the important part is the agent-like part.
I’m not sure it makes to think of “the alignment problem” as a singularity entity. I’d rather taboo “the alignment problem” and just ask what could go wrong with a self-aware system that’s not agent-like.
Hot take: it might be useful to think of “self-awareness” and “awareness that it can do things to affect the universe” separately. Not sure they are one and the same.
What is a “system that’s not agent-like” in your perspective? How might it be built? Have you written anything about that?
For my part, I thought Rohin’s “AI safety without goal-directed behavior” is a good start, but that we need much more and deeper analysis of this topic.