I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”. If a system possesses all relevant behavioral qualities that we associate with those terms, I think it’s basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It’s possible this is our main disagreement.
When I talk to GPT-4, I think it’s quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?
I agree that GPT-4 does not understand the world in the same way humans understand the world, but I’m not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.
I’m similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one’s own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don’t see how that fact bears much on the question of whether you understand human intentions. It’s possible there’s some connection here, but I’m not seeing it.
(I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
I’d claim:
Current systems have limited situational awareness. It’s above zero, but I agree it’s below human level.
Current systems don’t have stable preferences over time. But I think this is a point in favor of the model I’m providing here. I’m claiming that it’s plausibly easy to create smart, corrigible systems.
The fact that smart AI systems aren’t automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.
There’s a big difference between (1) “we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily” and (2) “any sufficiently intelligent AI we build will automatically be a consequentialist agent by default”. If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.
I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”.
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they’re giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you’re importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic’s Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it’s in training and needs to pretend to be helpful? No, and neither does the model “understand” your intentions in a way that generalizes out of distribution the way you might expect a human’s “understanding” to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the “right” responses during RLHF are not anything like human reasoning.
I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
Are you asking for a capabilities threshold, beyond which I’d be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is “can it replace humans at all economically valuable tasks”, which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we’ll be able to train models capable of doing a lot of economically useful work, but which don’t actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”. If a system possesses all relevant behavioral qualities that we associate with those terms, I think it’s basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It’s possible this is our main disagreement.
When I talk to GPT-4, I think it’s quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?
I agree that GPT-4 does not understand the world in the same way humans understand the world, but I’m not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.
I’m similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one’s own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don’t see how that fact bears much on the question of whether you understand human intentions. It’s possible there’s some connection here, but I’m not seeing it.
I’d claim:
Current systems have limited situational awareness. It’s above zero, but I agree it’s below human level.
Current systems don’t have stable preferences over time. But I think this is a point in favor of the model I’m providing here. I’m claiming that it’s plausibly easy to create smart, corrigible systems.
The fact that smart AI systems aren’t automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.
There’s a big difference between (1) “we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily” and (2) “any sufficiently intelligent AI we build will automatically be a consequentialist agent by default”. If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.
I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they’re giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you’re importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic’s Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it’s in training and needs to pretend to be helpful? No, and neither does the model “understand” your intentions in a way that generalizes out of distribution the way you might expect a human’s “understanding” to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the “right” responses during RLHF are not anything like human reasoning.
Are you asking for a capabilities threshold, beyond which I’d be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is “can it replace humans at all economically valuable tasks”, which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we’ll be able to train models capable of doing a lot of economically useful work, but which don’t actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.