It’s plausible that GPT-n is not an agent that desires to survive or affect the world. However, maybe it is. We don’t know. One of the points made by stories like predict-o-matic is that for all we know there really is an agent in there; for all we know the mesa-optimizer is misaligned to the base objective. In other words, this is a non sequitur:
It’s not an agent that desires to survive or affect the world. It’s just been trained to complete text.
The relationship between base objective and mesa-objective is currently poorly understood, in general. Naively you might think they’ll be the same, but there are already demonstrated cases where they are not.
for all we know there really is an agent in there; for all we know the mesa-optimizer is misaligned to the base objective [...] there are already demonstrated cases where they are not.
Going from “the mesa-optimizer is misaligned to the base objective” to “for all we know, the mesa-optimizer is an agent that desires to survive and affect the world” seems like a leap?
Similarly, for GPT, the case for skepticism about agency is not that it perfectly aligned on the base objective of predicting text, but that whatever inner-misaligned “instincts” it ended up with, refer what tokens to output in the domain of text; something extra would have to happen for that to somehow generalize to goals about the real world.
Yep, it’s a leap. It’s justified though IMO; we really do know so very little about these systems… I would be quite surprised if it turns out GPT-29 is a powerful agent with desires to influence the real world, but I wouldn’t be so surprised that I’d be willing to bet my eternal soul on it now. (Quantitatively I have something like 5% credence that it would be a powerful agent with desires to influence the real world.)
I am not sure your argument makes sense. Why think that its instincts and goals and whatnot refer only to what token to output in the domain of text? How is that different from saying “Whatever goals the coinrun agent has, they surely aren’t about anything in the game; instead they must be about which virtual buttons to press.” GPT is clearly capable of referring to and thinking about things in the real world; if it didn’t have a passable model of the real world it wouldn’t be able to predict text so accurately.
I understand that the mesa objective could be quite different from the base objective. But
...wait, maybe something just clicked. We might suspect that the mesa objective looks like (roughly) influence-seeking since that objective is consistent with all of the outputs we’ve seen from the system (and moreover we might be even more suspicious that particularly influential systems were actually optimizing for influence all along), and maybe an agent-ish mesa-optimizer is selected because it’s relatively good at appearing to fulfill the base objective...?
I guess I (roughly) understood the inner alignment concern but still didn’t think of the mesa-optimizer as an agent… need to read/think more. Still feels likely that we could rule out agent-y-ness by saying something along the lines of “yes some system with these text inputs could be agent-y and affect the real world, but we know this system only looks at the relative positions of tokens and outputs the token that most frequently follows those; a system would need a fundamentally different structure to be agent-y or have beliefs or preferences” (and likely that some such thing could be said about GPT-3).
One somewhat plausible argument I’ve heard is that GPTs are merely feedforward networks and that agency is relatively unlikely to arise in such networks. And of course there’s also the argument that agency is most natural/incentivised when you are navigating some environment over an extended period of time, which GPT-N isn’t. There are lots of arguments like this we can make. But currently it’s all pretty speculative; the relationship between base and mesa objective is poorly understood; for all we know even GPT-N could be a dangerous agent. (Also, people mean different things by “agent” and most people don’t have a clear concept of agency anyway.)
It’s plausible that GPT-n is not an agent that desires to survive or affect the world. However, maybe it is. We don’t know. One of the points made by stories like predict-o-matic is that for all we know there really is an agent in there; for all we know the mesa-optimizer is misaligned to the base objective. In other words, this is a non sequitur:
The relationship between base objective and mesa-objective is currently poorly understood, in general. Naively you might think they’ll be the same, but there are already demonstrated cases where they are not.
Going from “the mesa-optimizer is misaligned to the base objective” to “for all we know, the mesa-optimizer is an agent that desires to survive and affect the world” seems like a leap?
I thought the already-demonstrated cases were things like, we train a video-game agent to collect a coin at the right edge of the level, but then when you give it a level where the coin is elsewhere, it goes to the right edge instead of collecting the coin. That makes sense: the training data itself didn’t pin down which objective is “correct”. But even though the goal it ended up with wasn’t the “intended” one, it’s still a goal within the game environment; something else besides mere inner misalignment would need to happen for it to model and form goals about “the real world.”
Similarly, for GPT, the case for skepticism about agency is not that it perfectly aligned on the base objective of predicting text, but that whatever inner-misaligned “instincts” it ended up with, refer what tokens to output in the domain of text; something extra would have to happen for that to somehow generalize to goals about the real world.
Yep, it’s a leap. It’s justified though IMO; we really do know so very little about these systems… I would be quite surprised if it turns out GPT-29 is a powerful agent with desires to influence the real world, but I wouldn’t be so surprised that I’d be willing to bet my eternal soul on it now. (Quantitatively I have something like 5% credence that it would be a powerful agent with desires to influence the real world.)
I am not sure your argument makes sense. Why think that its instincts and goals and whatnot refer only to what token to output in the domain of text? How is that different from saying “Whatever goals the coinrun agent has, they surely aren’t about anything in the game; instead they must be about which virtual buttons to press.” GPT is clearly capable of referring to and thinking about things in the real world; if it didn’t have a passable model of the real world it wouldn’t be able to predict text so accurately.
I understand that the mesa objective could be quite different from the base objective. But
...wait, maybe something just clicked. We might suspect that the mesa objective looks like (roughly) influence-seeking since that objective is consistent with all of the outputs we’ve seen from the system (and moreover we might be even more suspicious that particularly influential systems were actually optimizing for influence all along), and maybe an agent-ish mesa-optimizer is selected because it’s relatively good at appearing to fulfill the base objective...?
I guess I (roughly) understood the inner alignment concern but still didn’t think of the mesa-optimizer as an agent… need to read/think more. Still feels likely that we could rule out agent-y-ness by saying something along the lines of “yes some system with these text inputs could be agent-y and affect the real world, but we know this system only looks at the relative positions of tokens and outputs the token that most frequently follows those; a system would need a fundamentally different structure to be agent-y or have beliefs or preferences” (and likely that some such thing could be said about GPT-3).
Yep! I recommend Gwern’s classic post on why tool AIs want to be agent AIs.
One somewhat plausible argument I’ve heard is that GPTs are merely feedforward networks and that agency is relatively unlikely to arise in such networks. And of course there’s also the argument that agency is most natural/incentivised when you are navigating some environment over an extended period of time, which GPT-N isn’t. There are lots of arguments like this we can make. But currently it’s all pretty speculative; the relationship between base and mesa objective is poorly understood; for all we know even GPT-N could be a dangerous agent. (Also, people mean different things by “agent” and most people don’t have a clear concept of agency anyway.)