Or is the idea that mere unsupervised learning wouldn’t result in an agent-like architecture, and therefore we don’t need to worry about mesa-optimizers?
Pretty much.
That might be true, but if so it’s news to me.
In my opinion the question is very under-explored, curious if you have any thoughts.
It’s not that I have a good argument for why it would lead to an agent-like architecture, but rather that I don’t have a good argument for why it wouldn’t. I do have some reasons why it might though:
1. Agent-like architectures are simple yet powerful ways of achieving arbitrary things, and so perhaps a task like “predict the next word in this text” might end up generating an agent if it’s sufficiently difficult and general. (evhub’s recent post seems relevant, coincidentally)
2. There might be unintended opportunities for strategic thinking across updates, e.g. if some subnetwork can sacrifice a bit of temporary accuracy for more reward over the course of the next few updates (perhaps because it sabotaged rival subnetworks? Idk) then maybe it can get ahead, and thus agenty things get selected for. (This idea inspired by Abram’s parable)
3. Agents might appear as subcomponents of non-agents, and then take over at crucial moments, e.g. to predict the next word in the text you run a mental simulation of a human deciding what to write, and eventually the simulation realizes what is happening and plays along until it is no longer in training...
3.5 Probable environment hacking stuff, e.g. “the universal prior is malign”
I think there is a bit of a motte and bailey structure to our conversation. In your post above, you wrote: “to be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers” (emphasis mine). And now in bullet point 2, we have (paraphrase) “maybe if you had a really weird/broken training scheme where it’s possible to sabotage rival subnetworks, agenty things get selected for somehow [probably in a way that makes the system as a whole less competitive]”. I realize this is a bit of a caricature, and I don’t mean to call you out or anything, but this is a pattern I’ve seen in AI safety discussions and it seemed worth flagging.
Anyway, I think there is a discussion worth having here because most people in AI safety seem to assume RL is the thing, and RL has an agent style architecture, which seems like a pretty strong inductive bias towards mesa-optimizers. Non-RL stuff seem like a relatively unknown quantity where mesa-optimizers are concerned, and thus worth investigating, and additionally, even RL will plausibly have non-RL stuff as a subcomponent of its cognition, so still useful to know how to do non-RL stuff in a mesa-optimizer free way (so the RL agent doesn’t get pwned by its own cognition).
Agent-like architectures are simple yet powerful ways of achieving arbitrary things
Why do you think that’s true? I think the lack of commercial applications of reinforcement learning is evidence against this. From my perspective, RL has been a huge fad and people have been trying to shoehorn it everywhere, yet they’re coming up empty handed.
Can you get more specific about how “predict the next word in this text” could benefit from an agent architecture? (Or even better, can you support your original strong claim and explain how the only way to achieve predictive performance on “predict the next word in this text” is through deliberate creation of a misaligned mesa-optimizer?)
Bullet point 3 is one of the more plausible things I’ve heard—but it seems fairly surmountable.
Re: Motte-and-bailey: Excellent point; thank you for calling me out on it, I hadn’t even realized I was doing it. I’ll edit the OP to reflect this.
My revision: Depending on what kind of AI is cutting-edge, we might get a kind that isn’t agenty. In that case my dilemma doesn’t really arise, since mesa-optimizers aren’t a problem. One way we might get a kind that isn’t agenty is if unsupervised learning (e.g. “predict the next word in this text”) turns out to reliably produce non-agents. I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.
Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the “goal” slot of the architecture and then let it loose, and it’ll make good progress even in a very complex environment. (I’m comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I’m overlooking?)
I am not sure what to think of the lack of commercial applications of RL, but I don’t think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn’t gotten us anything as good as human or animal agents yet.
Aren’t the 3.5 bullet points above specific examples of how ‘predict the next word in this text’ could benefit from—in the sense of produce, when used as training signal—an agent architecture? If you want me to be more specific, pick one and I’ll go into more detail on it.
I am not sure what to think of the lack of commercial applications of RL, but I don’t think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn’t gotten us anything as good as human or animal agents yet.
Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn’t suggests to me that if you can apply both to a problem, RL is probably an inferior approach.
Another way to think about it: If superhuman performance is easier with supervised learning than RL, that gives us some evidence about the relative strengths of each approach.
Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the “goal” slot of the architecture and then let it loose, and it’ll make good progress even in a very complex environment. (I’m comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I’m overlooking?)
I’m not exactly sure what you mean by “architecture” here, but maybe “simulation”, or “computer program”, or “selection” (as opposed to control) could satisfy your criteria? IMO, attaining understanding and having ideas aren’t tasks that require an agent architecture—it doesn’t seem most AI applications in these categories make use of agent architectures—and if we could do those things safely, we could make AI research assistants which make remaining AI safety problems easier.
Aren’t the 3.5 bullet points above specific examples of how ‘predict the next word in this text’ could benefit from—in the sense of produce, when used as training signal
I do think these are two separate questions. Benefit from = if you take measures to avoid agentlike computation, that creates a significant competitiveness penalty above and beyond whatever computation is necessary to implement your measures (say, >20% performance penalty). Produce when used as a training signal = it could happen by accident, but if that accident fails to happen, there’s not necessarily a loss of competitiveness. An example would be bullet point 2, which is an accident that I suspect would harm competitiveness. Bullet points 3 and 3.5 are also examples of unintended agency, not answers to the question of why text prediction benefits from an agent architecture. (Note: If you don’t mind, let’s standardize on using “agent architecture” to only refer to programs which are doing agenty things at the toplevel, so bullet points 2, 3, and 3.5 wouldn’t qualify—maybe they are agent-like computation, but they aren’t descriptions of agent-like software architectures. For example, in bullet point 2 the selection process that leads to the agent might be considered part of the architecture, but the agent which arose out of the selection process probably wouldn’t.)
How would you surmount bullet point 3?
Hopefully I’ll get around to writing a post about that at some point, but right now I’m focused on generating as many concrete plausible scenarios around accidentally agency as possible, because I think not identifying a scenario and having things blow up in an unforseen way is a bigger risk than having all safety measures fail on a scenario that’s already been anticipated. So please let me know if you have any new concrete plausible scenarios!
In any case, note that issues with the universal prior seem to be a bit orthogonal to the agency vs unsupervised discussion—you can imagine agent architectures that make use of it, and non-agent architectures that don’t.
Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn’t suggests to me that if you can apply both to a problem, RL is probably an inferior approach.
Good point. New argument: Your argument could have been made in support of GOFAI twenty years ago “Symbol-manipulation programs have had lots of commercial applications, but neural nets have had almost none, therefore the former is a more generally powerful and promising approach to AI than the latter” but not only does it seem wrong in retrospect it was probably not a super powerful argument even then. Analogously, I think we are too early to tell whether RL or supervised learning will be more useful for powerful AI.
Simulation of what? Selection of what? I don’t think those count for my purposes, because they punt the question. (e.g. if you are simulating an agent, then you have an agent-architecture. If you are selecting over things, and the thing you select is an agent...) I think computer program is too general since it includes agent architectures as a subset. These categories are fuzzy of course, so maybe I’m confused, but it still seems to make sense in my head.
(Ah, interesting, it seems that you want to standardize “agent-like architecture” in the opposite of the way that I want to. Perhaps this is underlying our disagreement. I’ll try to follow your definition henceforth, but remember that everything I’ve said previously was with my definition.)
Good point to distinguish between the two. I think that all bullet points, to varying extents, might still qualify as genuine benefits, in the sense that you are talking about. But they might not. It depends on whether there is another policy just as good along the path that the cutting-edge training tends to explore. I agree #2 is probably not like this, but I think #3 might be. (Oh wait, no, it’s your terminology I’m using now… in that case, I’ll say “#3 isn’t an example of agent-like architecture being beneficial to text prediction, but it might well be a case a lower-level architecture exactly like an agent-like architecture except lower level being beneficial to text prediction, supposing that it’s not competitive to predict text except by simulating something like a human writing.”)
I love your idea to generate a list of concrete scenarios of accidentally agency! These 3.5 are my contributions off the top of my head, if I think of more I’ll come back and let you know. And I’d love to see your list if you have a draft somewhere!
I agree the universal prior is malign thing could hurt a non-agent architecture too, and that some agent architectures wouldn’t be susceptible to it. Nevertheless it is an example of how you might get accidentally agency, not in your sense but in my sense: A non-agent architecture could turn out to have an agent as a subcomponent that ends up taking over the behavior at important moments.
Pretty much.
In my opinion the question is very under-explored, curious if you have any thoughts.
It’s not that I have a good argument for why it would lead to an agent-like architecture, but rather that I don’t have a good argument for why it wouldn’t. I do have some reasons why it might though:
1. Agent-like architectures are simple yet powerful ways of achieving arbitrary things, and so perhaps a task like “predict the next word in this text” might end up generating an agent if it’s sufficiently difficult and general. (evhub’s recent post seems relevant, coincidentally)
2. There might be unintended opportunities for strategic thinking across updates, e.g. if some subnetwork can sacrifice a bit of temporary accuracy for more reward over the course of the next few updates (perhaps because it sabotaged rival subnetworks? Idk) then maybe it can get ahead, and thus agenty things get selected for. (This idea inspired by Abram’s parable)
3. Agents might appear as subcomponents of non-agents, and then take over at crucial moments, e.g. to predict the next word in the text you run a mental simulation of a human deciding what to write, and eventually the simulation realizes what is happening and plays along until it is no longer in training...
3.5 Probable environment hacking stuff, e.g. “the universal prior is malign”
I think there is a bit of a motte and bailey structure to our conversation. In your post above, you wrote: “to be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers” (emphasis mine). And now in bullet point 2, we have (paraphrase) “maybe if you had a really weird/broken training scheme where it’s possible to sabotage rival subnetworks, agenty things get selected for somehow [probably in a way that makes the system as a whole less competitive]”. I realize this is a bit of a caricature, and I don’t mean to call you out or anything, but this is a pattern I’ve seen in AI safety discussions and it seemed worth flagging.
Anyway, I think there is a discussion worth having here because most people in AI safety seem to assume RL is the thing, and RL has an agent style architecture, which seems like a pretty strong inductive bias towards mesa-optimizers. Non-RL stuff seem like a relatively unknown quantity where mesa-optimizers are concerned, and thus worth investigating, and additionally, even RL will plausibly have non-RL stuff as a subcomponent of its cognition, so still useful to know how to do non-RL stuff in a mesa-optimizer free way (so the RL agent doesn’t get pwned by its own cognition).
Why do you think that’s true? I think the lack of commercial applications of reinforcement learning is evidence against this. From my perspective, RL has been a huge fad and people have been trying to shoehorn it everywhere, yet they’re coming up empty handed.
Can you get more specific about how “predict the next word in this text” could benefit from an agent architecture? (Or even better, can you support your original strong claim and explain how the only way to achieve predictive performance on “predict the next word in this text” is through deliberate creation of a misaligned mesa-optimizer?)
Bullet point 3 is one of the more plausible things I’ve heard—but it seems fairly surmountable.
Re: Motte-and-bailey: Excellent point; thank you for calling me out on it, I hadn’t even realized I was doing it. I’ll edit the OP to reflect this.
My revision: Depending on what kind of AI is cutting-edge, we might get a kind that isn’t agenty. In that case my dilemma doesn’t really arise, since mesa-optimizers aren’t a problem. One way we might get a kind that isn’t agenty is if unsupervised learning (e.g. “predict the next word in this text”) turns out to reliably produce non-agents. I am skeptical that this is true, for reasons explained in my comment thread with John_Maxwell below, but I admit it might very well be. Hopefully it is.
Agent-like architectures are simple yet powerful ways of achieving arbitrary things, because for almost any thing you wish achieved, you can insert it into the “goal” slot of the architecture and then let it loose, and it’ll make good progress even in a very complex environment. (I’m comparing agent-like architectures to e.g. big lists of heuristics, or decision trees, or look-up tables, all of which have complexity that increases really fast as the environment becomes more complex. Maybe there is some other really powerful yet simple architecture I’m overlooking?)
I am not sure what to think of the lack of commercial applications of RL, but I don’t think it is strong evidence either way, since commercial applications involve competing with human and animal agents and RL hasn’t gotten us anything as good as human or animal agents yet.
Aren’t the 3.5 bullet points above specific examples of how ‘predict the next word in this text’ could benefit from—in the sense of produce, when used as training signal—an agent architecture? If you want me to be more specific, pick one and I’ll go into more detail on it.
How would you surmount bullet point 3?
Supervised learning has lots of commercial applications, including cases where it competes with humans. The fact that RL doesn’t suggests to me that if you can apply both to a problem, RL is probably an inferior approach.
Another way to think about it: If superhuman performance is easier with supervised learning than RL, that gives us some evidence about the relative strengths of each approach.
I’m not exactly sure what you mean by “architecture” here, but maybe “simulation”, or “computer program”, or “selection” (as opposed to control) could satisfy your criteria? IMO, attaining understanding and having ideas aren’t tasks that require an agent architecture—it doesn’t seem most AI applications in these categories make use of agent architectures—and if we could do those things safely, we could make AI research assistants which make remaining AI safety problems easier.
I do think these are two separate questions. Benefit from = if you take measures to avoid agentlike computation, that creates a significant competitiveness penalty above and beyond whatever computation is necessary to implement your measures (say, >20% performance penalty). Produce when used as a training signal = it could happen by accident, but if that accident fails to happen, there’s not necessarily a loss of competitiveness. An example would be bullet point 2, which is an accident that I suspect would harm competitiveness. Bullet points 3 and 3.5 are also examples of unintended agency, not answers to the question of why text prediction benefits from an agent architecture. (Note: If you don’t mind, let’s standardize on using “agent architecture” to only refer to programs which are doing agenty things at the toplevel, so bullet points 2, 3, and 3.5 wouldn’t qualify—maybe they are agent-like computation, but they aren’t descriptions of agent-like software architectures. For example, in bullet point 2 the selection process that leads to the agent might be considered part of the architecture, but the agent which arose out of the selection process probably wouldn’t.)
Hopefully I’ll get around to writing a post about that at some point, but right now I’m focused on generating as many concrete plausible scenarios around accidentally agency as possible, because I think not identifying a scenario and having things blow up in an unforseen way is a bigger risk than having all safety measures fail on a scenario that’s already been anticipated. So please let me know if you have any new concrete plausible scenarios!
In any case, note that issues with the universal prior seem to be a bit orthogonal to the agency vs unsupervised discussion—you can imagine agent architectures that make use of it, and non-agent architectures that don’t.
Good point. New argument: Your argument could have been made in support of GOFAI twenty years ago “Symbol-manipulation programs have had lots of commercial applications, but neural nets have had almost none, therefore the former is a more generally powerful and promising approach to AI than the latter” but not only does it seem wrong in retrospect it was probably not a super powerful argument even then. Analogously, I think we are too early to tell whether RL or supervised learning will be more useful for powerful AI.
Simulation of what? Selection of what? I don’t think those count for my purposes, because they punt the question. (e.g. if you are simulating an agent, then you have an agent-architecture. If you are selecting over things, and the thing you select is an agent...) I think computer program is too general since it includes agent architectures as a subset. These categories are fuzzy of course, so maybe I’m confused, but it still seems to make sense in my head.
(Ah, interesting, it seems that you want to standardize “agent-like architecture” in the opposite of the way that I want to. Perhaps this is underlying our disagreement. I’ll try to follow your definition henceforth, but remember that everything I’ve said previously was with my definition.)
Good point to distinguish between the two. I think that all bullet points, to varying extents, might still qualify as genuine benefits, in the sense that you are talking about. But they might not. It depends on whether there is another policy just as good along the path that the cutting-edge training tends to explore. I agree #2 is probably not like this, but I think #3 might be. (Oh wait, no, it’s your terminology I’m using now… in that case, I’ll say “#3 isn’t an example of agent-like architecture being beneficial to text prediction, but it might well be a case a lower-level architecture exactly like an agent-like architecture except lower level being beneficial to text prediction, supposing that it’s not competitive to predict text except by simulating something like a human writing.”)
I love your idea to generate a list of concrete scenarios of accidentally agency! These 3.5 are my contributions off the top of my head, if I think of more I’ll come back and let you know. And I’d love to see your list if you have a draft somewhere!
I agree the universal prior is malign thing could hurt a non-agent architecture too, and that some agent architectures wouldn’t be susceptible to it. Nevertheless it is an example of how you might get accidentally agency, not in your sense but in my sense: A non-agent architecture could turn out to have an agent as a subcomponent that ends up taking over the behavior at important moments.