I am using the term “self-aware” to mean “knowing that one exists in the world and can affect the world”, in which case animals, RL robots, etc., are all trivially self-aware. You seem to be using the term “introspective” for something beyond mere self-awareness—maybe “having concepts in the world-model that are sufficiently general that they apply to both the outside world and one’s internal information-processing”. Something like that? You can tell me.
So let’s take these two levels, self-awareness (“I exist and can affect the world”) and introspection (“Why am I thinking about that? I seem to have an associative memory!”)
As I read the OP, it seems to me that self-awareness is the relevant threshold you rely on, not introspection. (Do you agree?) I do think that self-awareness is what you need for powerful safety guarantees, and that we should study the possibility of self-unaware systems, even if it’s not guaranteed to be possible.
As for introspection, I do in fact think that any AI system which can develop deep, general, mechanistic understandings of things in the world, and which is self-aware at all, will go beyond mere self-awareness to develop deep introspection. My reason is that such AIs will have a general capability to find underlying patterns, and thus will discover an analogy between its own thoughts and actions and those of others. Doing that just doesn’t seem fundamentally different from, say, discovering the law of gravitation by discovering an analogy between the behavior of planets versus apples (which in turn is harder but not fundamentally different from knowing how to twist off a bottle cap by discovering an analogy with previous bottle caps that one has used). Thus, I think that the only way to prevent an arbitrarily intelligent world-modeling AI from developing arbitrarily deep introspective understanding, is to build the system to have no self-awareness in the first place.
Studying the possibility of self-aware systems seems like a good idea, but I have a feeling most ways to achieve this will be brittle. My objective with this post was to get crisp stories for why self-aware predictive systems should be considered dangerous.
My reason is that such AIs will have a general capability to find underlying patterns, and thus will discover an analogy between its own thoughts and actions and those of others.
Let’s taboo introspection for a minute. Suppose the AI does discover some underlying patters and analogize the piece of matter in which it is encased with the thoughts and actions of its human operator. Not only that, it finds analogies between other computers and its human operator, between its human operator and other computers, etc. Why precisely is this a problem?
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
More specifically: Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
If you’re making a prediction system (or an oracle more generally), there seems to be a possibility of making it self-unaware—it doesn’t know that it’s outputting predictions, it doesn’t know that it even has an output, it doesn’t know that it exists in the universe, etc. A toy example is a superhuman world-model which is completely and easily interpretable; you can just look at the data structure and understand every aspect of it, see what the concepts are and how they’re connected, and you can use that to explore counterfactuals and understand things etc. That data structure is the whole system, and the human users browse it. Anyway, I think the scariest safety risk for oracles is that they’ll give manipulative answers, use side-channel attacks, or more generally make intelligent decisions to steer the future towards goals. A self-unaware system will not do that because it is not aware that it can do things to affect the universe. There’s still some safety problems (not to mention bad actors etc.), but significantly less scary ones.
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
Fair enough.
Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
I suspect the important part is the agent-like part.
I’m not sure it makes to think of “the alignment problem” as a singularity entity. I’d rather taboo “the alignment problem” and just ask what could go wrong with a self-aware system that’s not agent-like.
A self-unaware system will not do that because it is not aware that it can do things to affect the universe.
Hot take: it might be useful to think of “self-awareness” and “awareness that it can do things to affect the universe” separately. Not sure they are one and the same.
I am using the term “self-aware” to mean “knowing that one exists in the world and can affect the world”, in which case animals, RL robots, etc., are all trivially self-aware. You seem to be using the term “introspective” for something beyond mere self-awareness—maybe “having concepts in the world-model that are sufficiently general that they apply to both the outside world and one’s internal information-processing”. Something like that? You can tell me.
So let’s take these two levels, self-awareness (“I exist and can affect the world”) and introspection (“Why am I thinking about that? I seem to have an associative memory!”)
As I read the OP, it seems to me that self-awareness is the relevant threshold you rely on, not introspection. (Do you agree?) I do think that self-awareness is what you need for powerful safety guarantees, and that we should study the possibility of self-unaware systems, even if it’s not guaranteed to be possible.
As for introspection, I do in fact think that any AI system which can develop deep, general, mechanistic understandings of things in the world, and which is self-aware at all, will go beyond mere self-awareness to develop deep introspection. My reason is that such AIs will have a general capability to find underlying patterns, and thus will discover an analogy between its own thoughts and actions and those of others. Doing that just doesn’t seem fundamentally different from, say, discovering the law of gravitation by discovering an analogy between the behavior of planets versus apples (which in turn is harder but not fundamentally different from knowing how to twist off a bottle cap by discovering an analogy with previous bottle caps that one has used). Thus, I think that the only way to prevent an arbitrarily intelligent world-modeling AI from developing arbitrarily deep introspective understanding, is to build the system to have no self-awareness in the first place.
Studying the possibility of self-aware systems seems like a good idea, but I have a feeling most ways to achieve this will be brittle. My objective with this post was to get crisp stories for why self-aware predictive systems should be considered dangerous.
Let’s taboo introspection for a minute. Suppose the AI does discover some underlying patters and analogize the piece of matter in which it is encased with the thoughts and actions of its human operator. Not only that, it finds analogies between other computers and its human operator, between its human operator and other computers, etc. Why precisely is this a problem?
I wouldn’t argue that self-aware systems are automatically dangerous, but rather that self-unaware systems are automatically safe (or at least comparatively pretty safe).
More specifically: Most people in AI safety, most of the time, are talking about self-aware (in my minimal sense of taking purposeful actions etc.) agent-like systems. I don’t think such systems are automatically dangerous, but they do necessitate solving the alignment problem, and since we haven’t solved the alignment problem yet, I think it’s worth spending time exploring alternative approaches.
If you’re making a prediction system (or an oracle more generally), there seems to be a possibility of making it self-unaware—it doesn’t know that it’s outputting predictions, it doesn’t know that it even has an output, it doesn’t know that it exists in the universe, etc. A toy example is a superhuman world-model which is completely and easily interpretable; you can just look at the data structure and understand every aspect of it, see what the concepts are and how they’re connected, and you can use that to explore counterfactuals and understand things etc. That data structure is the whole system, and the human users browse it. Anyway, I think the scariest safety risk for oracles is that they’ll give manipulative answers, use side-channel attacks, or more generally make intelligent decisions to steer the future towards goals. A self-unaware system will not do that because it is not aware that it can do things to affect the universe. There’s still some safety problems (not to mention bad actors etc.), but significantly less scary ones.
Fair enough.
I suspect the important part is the agent-like part.
I’m not sure it makes to think of “the alignment problem” as a singularity entity. I’d rather taboo “the alignment problem” and just ask what could go wrong with a self-aware system that’s not agent-like.
Hot take: it might be useful to think of “self-awareness” and “awareness that it can do things to affect the universe” separately. Not sure they are one and the same.
What is a “system that’s not agent-like” in your perspective? How might it be built? Have you written anything about that?
For my part, I thought Rohin’s “AI safety without goal-directed behavior” is a good start, but that we need much more and deeper analysis of this topic.