Studying the possibility of self-aware systems seems like a good idea, but I have a feeling most ways to achieve this will be brittle. My objective with this post was to get crisp stories for why self-aware predictive systems should be considered dangerous.
My reason is that such AIs will have a general capability to find underlying patterns, and thus will discover an analogy between its own thoughts and actions and those of others.
Let’s taboo introspection for a minute. Suppose the AI does discover some underlying patters and analogize the piece of matter in which it is encased with the thoughts and actions of its human operator. Not only that, it finds analogies between other computers and its human operator, between its human operator and other computers, etc. Why precisely is this a problem?
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
More specifically: Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
If you’re making a prediction system (or an oracle more generally), there seems to be a possibility of making it self-unaware—it doesn’t know that it’s outputting predictions, it doesn’t know that it even has an output, it doesn’t know that it exists in the universe, etc. A toy example is a superhuman world-model which is completely and easily interpretable; you can just look at the data structure and understand every aspect of it, see what the concepts are and how they’re connected, and you can use that to explore counterfactuals and understand things etc. That data structure is the whole system, and the human users browse it. Anyway, I think the scariest safety risk for oracles is that they’ll give manipulative answers, use side-channel attacks, or more generally make intelligent decisions to steer the future towards goals. A self-unaware system will not do that because it is not aware that it can do things to affect the universe. There’s still some safety problems (not to mention bad actors etc.), but significantly less scary ones.
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
Fair enough.
Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
I suspect the important part is the agent-like part.
I’m not sure it makes to think of “the alignment problem” as a singularity entity. I’d rather taboo “the alignment problem” and just ask what could go wrong with a self-aware system that’s not agent-like.
A self-unaware system will not do that because it is not aware that it can do things to affect the universe.
Hot take: it might be useful to think of “self-awareness” and “awareness that it can do things to affect the universe” separately. Not sure they are one and the same.
Studying the possibility of self-aware systems seems like a good idea, but I have a feeling most ways to achieve this will be brittle. My objective with this post was to get crisp stories for why self-aware predictive systems should be considered dangerous.
Let’s taboo introspection for a minute. Suppose the AI does discover some underlying patters and analogize the piece of matter in which it is encased with the thoughts and actions of its human operator. Not only that, it finds analogies between other computers and its human operator, between its human operator and other computers, etc. Why precisely is this a problem?
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
More specifically: Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
If you’re making a prediction system (or an oracle more generally), there seems to be a possibility of making it self-unaware—it doesn’t know that it’s outputting predictions, it doesn’t know that it even has an output, it doesn’t know that it exists in the universe, etc. A toy example is a superhuman world-model which is completely and easily interpretable; you can just look at the data structure and understand every aspect of it, see what the concepts are and how they’re connected, and you can use that to explore counterfactuals and understand things etc. That data structure is the whole system, and the human users browse it. Anyway, I think the scariest safety risk for oracles is that they’ll give manipulative answers, use side-channel attacks, or more generally make intelligent decisions to steer the future towards goals. A self-unaware system will not do that because it is not aware that it can do things to affect the universe. There’s still some safety problems (not to mention bad actors etc.), but significantly less scary ones.
Fair enough.
I suspect the important part is the agent-like part.
I’m not sure it makes to think of “the alignment problem” as a singularity entity. I’d rather taboo “the alignment problem” and just ask what could go wrong with a self-aware system that’s not agent-like.
Hot take: it might be useful to think of “self-awareness” and “awareness that it can do things to affect the universe” separately. Not sure they are one and the same.
What is a “system that’s not agent-like” in your perspective? How might it be built? Have you written anything about that?
For my part, I thought Rohin’s “AI safety without goal-directed behavior” is a good start, but that we need much more and deeper analysis of this topic.