I feel like a lot of the objections around agency are answered by the Clippy scenario, and gwern’s other essay on the topic, Tool AIs want to be Agent AIs. The AGI need not start with any specific goal or agency. However, the moment it starts executing a prompt that requires it to exhibit agency or goal directed behavior, it will. And at that point, unless the goal is set up such that the agent pursues its goal in a manner that is compatible with the continued existence of humanity over the long term, humanity is doomed. Crafting a goal in this manner is very difficult, and making sure that the AGI pursues this goal and no others are both very difficult tasks individually. Together, they are nigh impossible. Thus, with a very strong likelihood, the moment the AGI either receives a prompt or discovers a prompt that requires it to behave like an agent, humanity is doomed.
I agree that AGIs need to possess a world model, but I disagree that we will be able to distinguish an AI that possesses a world model from an AI that “merely” knows word associations. The internals of an AI are opaque, despite the best efforts of interpretability research to shine light on the giant inscrutable matrices. An AI with a world model, I predict, won’t look much different from an AI without a world model. Maybe some weights will be different, and some update functions will have changed. Will we be able to point to any specific weight or combination of weights and say, “Aha, the AI has developed a world model!” Probably not, no more than we can look at any specific set of neurons in the human brain and say, “Aha, there lies the seat of consciousness!”
Given the two points above, we may not be able to tell when any given AI passes the threshold to becoming an AGI. And once an AI has passed the threshold, we won’t necessarily be able to control which prompt causes the AI to begin simulating an agent. Given those two, I fail to see why we shouldn’t behave as if AGI is on a short timeline. After all, if one is approaching a cliff from an unknown distance in the darkness, the wise thing to do is not to assume that the cliff is still miles away and stride boldly into the unknown. Instead it behooves us to probe carefully, trying to determine whether there’s solid ground or empty space ahead.
However, the moment it starts executing a prompt that requires it to exhibit agency or goal directed behavior, it will.
This seems to make a jump from “the prompt requires agency to execute well” to “the AI develops the cognitive capability for agency”?
I read Sarah’s point as being that current AIs are fundamentally incapable of having agency (as she defines it). If that’s the case, it doesn’t matter if the prompt requires the AI to have agency to execute the prompt well: instead, the AI will just fail to execute the prompt well.
This seems to make a jump from “the prompt requires agency to execute well” to “the AI develops the cognitive capability for agency”?
In my scenario the AI already has the cognitive capability for agency. It’s just that the capability is latent until the right prompt causes it to be expressed. We’ve seen early examples of this with ChatGPT, where, if you ask it to plan something or think about adversarial scenarios, it will demonstrate agent-ish behavior.
My point is that while current AIs are probably incapable of having agency, future AIs probably will have that capability. Furthermore, we may not be able to tell the difference between an AI that is capable of building a world-model and engaging in long-term goal directed behavior and the current AI systems that mostly aren’t.
I feel like a lot of the objections around agency are answered by the Clippy scenario, and gwern’s other essay on the topic, Tool AIs want to be Agent AIs. The AGI need not start with any specific goal or agency. However, the moment it starts executing a prompt that requires it to exhibit agency or goal directed behavior, it will. And at that point, unless the goal is set up such that the agent pursues its goal in a manner that is compatible with the continued existence of humanity over the long term, humanity is doomed. Crafting a goal in this manner is very difficult, and making sure that the AGI pursues this goal and no others are both very difficult tasks individually. Together, they are nigh impossible. Thus, with a very strong likelihood, the moment the AGI either receives a prompt or discovers a prompt that requires it to behave like an agent, humanity is doomed.
I agree that AGIs need to possess a world model, but I disagree that we will be able to distinguish an AI that possesses a world model from an AI that “merely” knows word associations. The internals of an AI are opaque, despite the best efforts of interpretability research to shine light on the giant inscrutable matrices. An AI with a world model, I predict, won’t look much different from an AI without a world model. Maybe some weights will be different, and some update functions will have changed. Will we be able to point to any specific weight or combination of weights and say, “Aha, the AI has developed a world model!” Probably not, no more than we can look at any specific set of neurons in the human brain and say, “Aha, there lies the seat of consciousness!”
Given the two points above, we may not be able to tell when any given AI passes the threshold to becoming an AGI. And once an AI has passed the threshold, we won’t necessarily be able to control which prompt causes the AI to begin simulating an agent. Given those two, I fail to see why we shouldn’t behave as if AGI is on a short timeline. After all, if one is approaching a cliff from an unknown distance in the darkness, the wise thing to do is not to assume that the cliff is still miles away and stride boldly into the unknown. Instead it behooves us to probe carefully, trying to determine whether there’s solid ground or empty space ahead.
This seems to make a jump from “the prompt requires agency to execute well” to “the AI develops the cognitive capability for agency”?
I read Sarah’s point as being that current AIs are fundamentally incapable of having agency (as she defines it). If that’s the case, it doesn’t matter if the prompt requires the AI to have agency to execute the prompt well: instead, the AI will just fail to execute the prompt well.
In my scenario the AI already has the cognitive capability for agency. It’s just that the capability is latent until the right prompt causes it to be expressed. We’ve seen early examples of this with ChatGPT, where, if you ask it to plan something or think about adversarial scenarios, it will demonstrate agent-ish behavior.
My point is that while current AIs are probably incapable of having agency, future AIs probably will have that capability. Furthermore, we may not be able to tell the difference between an AI that is capable of building a world-model and engaging in long-term goal directed behavior and the current AI systems that mostly aren’t.