Thanks for the response. I’m still confused but maybe that’s my fault. FWIW I think my view is pretty similar to Nate’s probably, though I came to it mostly independently & I focus on the concept of agents rather than the concept of wanting. (For more on my view, see this sequence.)
I definitely don’t endorse “it’s extremely surprising for there to be any capabilities without ‘wantings’” and I expect Nate doesn’t either.
What do you think is the sense of “wanting” needed for AI risk arguments? Why is the sense described above not enough?
If your AI system “wants” things in the sense that “when prompted to get X it proposes good strategies for getting X that adapt to obstacles,” then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying “If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task” + “If your AI wants something, then it will undermine your tests and safety measures” seems like a sleight of hand, most of the oomph is coming from equivocating between definitions of want.
You say:
I definitely don’t endorse “it’s extremely surprising for there to be any capabilities without ‘wantings’” and I expect Nate doesn’t either.
But the OP says:
to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the “behaviorist sense” expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise
This seems to strongly imply that a particular capability—succeeding at these long horizon tasks—implies the AI has “wants/desires.” That’s what I’m saying seems wrong.
I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn’t hardcode that inability into our definition of ‘wants!’ Instead I’d say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there’s a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is what I’ve been thinking about for the past six months and which is something MIRI started thinking about years ago. (See also my response to Ryan elsewhere in this thread)
If you use that definition, I don’t understand in what sense LMs don’t “want” things—if you prompt them to “take actions to achieve X” then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn’t that “want” or “desire” like behavior? So what does it mean when Nate says “AI doesn’t seem to have all that much “want”- or “desire”-like behavior”?
I’m genuinely unclear what the OP is asserting at that point, and it seems like it’s clearly not responsive to actual people in the real world saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?” People who say that kind of thing mostly aren’t saying that LMs can’t be prompted to achieve outcomes. They are saying that LMs don’t want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don’t seem to have preferences about the training objective, or that are coherent over time).
I would say that current LLMs, when prompted and RLHF’d appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.
It’s a response to “LLMs turned out to not be very want-y, when are the people who expcted ‘agents’ going to update?” because it’s basically replying “I didn’t expect LLMs to be agenty/wanty; I do expect agenty/wanty AIs to come along before the end and indeed we are already seeing progress in that direction.”
To the people saying “LLMs don’t want things in the sense that is relevant to the usual arguments...” I recommend rephrasing to be less confusing: Your claim is that LLMs don’t seem to have preferences about the training objective, or that are coherent over time, unless hooked up into a prompt/scaffold that explicitly tries to get them to have such preferences. I agree with this claim, but don’t think it’s contrary to my present or past models.
Two separate thoughts, based on my understanding of what the OP is gesturing at (which may not be what Nate is trying to say, but oh well):
Using that definition, LMs do “want” things, but: the extent to which it’s useful to talk about an abstraction like “wants” or “desires” depends heavily on how well the system can be modelled that way. For a system that manages to competently and robustly orient itself around complex obstacles, there’s a notion of “want” that’s very strong. For a system that’s slightly weaker than that—well, it’s still capable of orienting itself around some obstacles so there’s still a notion of “want”, but it’s correspondingly weaker, and plausibly less useful. And insofar as you’re trying to assert something about some particular behaviour of danger arising from strong wants, systems with weaker wants wouldn’t update you very much.
Unfortunately, this does mean you have to draw arbitrary lines about where strong wants are and making that more precise is probably useful, but doesn’t seem to inherently be an argument against it. (To be clear though, I don’t buy this line of reasoning to the extent I think Nate does).
On the ability to solve long-horizon tasks: I think of it as a proxy measure for how strong your wants are (where the proxy breaks down because diverse contexts and complex obstacles aren’t perfectly correlated with long-horizon tasks, but might still be a pretty good one in realistic scenarios). If you have cognition that can robustly handle long-horizon tasks, then one could argue that this can measure by proxy how strongly-held and coherent its objectives are, and correspondingly how capable it is of (say) applying optimization power to break control measures you place on it or breaking the proxies you were training on.
More concretely: I expect that one answer to this might anchor to “wants at least as strong as a human’s”, in which case AI systems 10x-ing the pace of R&D autonomously would definitely suffice; the chain of logic being “has the capabilities to robustly handle real-world obstacles in pursuit of task X” ⇒ “can handle obstacles like control or limiting measures we’ve placed, in pursuit of task X”.
I also don’t buy this line of argument as much as I think Nate does, not because I disagree with my understanding of what the central chain of logic is, but because I don’t think that it applies to language models in the way he describes it (but still plausibly manifests in different ways). I agree that LMs don’t want things in the sense relevant to e.g. deceptive alignment, but primarily because I think it’s a type error—LMs are far more substrate than they are agents. You can still have agents being predicted / simulated by LMs that have strong wants if you have a sufficiently powerful system, that might not have preferences about the training objective, but which still has preferences it’s capable enough to try and achieve. Whether or not you can also ask that system “Which action would maximize the expected amount of Y?” and get a different predicted / simulated agent doesn’t answer the question of whether or not the agent you do get to try and solve a task like that on a long horizon would itself be dangerous, independent of whether you consider the system at large to be dangerous in a similar way toward a similar target.
What do you think is the sense of “wanting” needed for AI risk arguments? Why is the sense described above not enough?
In the case of literal current LLM agents with current models:
Humans manually engineer the prompting and scaffolding (and we understand how and why it works)
We can read the intermediate goals directly via just reading the CoT.
Thus, we don’t have risk from hidden, unintended, or unpredictable objectives. There is no reason to think that goal seeking behavior due to the agency from the engineered scaffold or prompting will results in problematic generalization.
It’s unclear if this will hold in the future even for LLM agents, but it’s at least plausible that this will hold (which defeats Nate’s rather confident claim). In particular, we could run into issues from the LLM used within the LLM agent having hidden goals, but insofar as the retargeting and long run agency is a human engineered and reasonably understood process, the original argument from Nate doesn’t seem very relevant to risk. We also could run into issues from imitating very problematic human behavior, but this seems relatively easy to notice in most cases as it would likely be discussed outload with non-negligable probability.
We’d also lose this property if we did a bunch of RL and most of the power of LLM agents was coming from this RL rather than imitating human optimization or humans engineering particular optimization processes.
It sounds like you are saying “In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we’ll be able to choose what they want (at least imperfectly, via the prompt) and we’ll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won’t be able to successfully plot against us.”
Yes of course. My research for the last few months has been focused on what happens after that, when the systems get smart enough and/or get trained so that the chain of thought is unfaithful when it needs to be faithful, e.g. the system uses euphemisms when it’s thinking about whether it’s misaligned and what to do about that.
Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn’t contradict anything we’ve said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn’t plot against us or otherwise screw us over.)
It sounds like you are saying “In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we’ll be able to choose what they want (at least imperfectly, via the prompt) and we’ll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won’t be able to successfully plot against us.”
Basically, but more centrally that in literal current LLM agents the scary part of the system that we don’t understand (the LLM) doesn’t generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it’s at least plausible that this property will be preserved in the future.
I edited my earlier comment to hopefully make this more clear.
Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn’t contradict anything we’ve said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn’t plot against us or otherwise screw us over.)
Even if we didn’t have the visible thoughts property in the actual deployed system, the fact that all of the retargeting behavior is based on explicit human engineering is still relevant and contradicts the core claim Nate makes in this post IMO.
Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn’t contradict anything we’ve said.
I think it contradicts things Nate says in this post directly. I don’t know if it contradicts things you’ve said.
To clarify, I’m commenting on the following chain:
First Nate said:
This observable “it keeps reorienting towards some target no matter what obstacle reality throws in its way” behavior is what I mean when I describe an AI as having wants/desires “in the behaviorist sense”.
as well as
Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.
Then, Paul responded with
I think this is a semantic motte and bailey that’s failing to think about mechanics of the situation. LM agents already have the behavior “reorient towards a target in response to obstacles,” but that’s not the sense of “wanting” about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked “how can I achieve X in this situation?” will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn’t what you need for AI risk arguments!
Then you said
What do you think is the sense of “wanting” needed for AI risk arguments? Why is the sense described above not enough?
And I was responding to this.
So, I was just trying to demonstrate at least one plausible example of a system which plausibly could pursue long term goals and doesn’t have the sense of wanting needed for AI risk arguments. In particular, LLM agents where the retargeting is purely based on human engineering (analogous to a myopic employee retargeted by a manager who cares about longer term outcomes).
This directly contradicts “Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.”.
My version of what’s happening in this conversation is that you and Paul are like “Well, what if it wants things but in a way which is transparent/interpretable and hence controllable by humans, e.g. if it wants what it is prompted to want?” My response is “Indeed that would be super safe, but it would still count as wanting things. Nate’s post is titled “ability to solve long-horizon tasks correlates with wanting” not “ability to solve long-horizon tasks correlates with hidden uncontrollable wanting.”
One thing at time. First we establish that ability to solve long-horizon tasks correlates with wanting, then we argue about whether or not the future systems that are able to solve diverse long-horizon tasks better than humans can will have transparent controllable wants or not. As you yourself pointed out, insofar as we are doing lots of RL it’s dubious that the wants will remain as transparent and controllable as they are now. I meanwhile will agree that a large part of my hope for a technical solution comes from something like the Faithful CoT agenda, in which we build powerful agentic systems whose wants (and more generally, thoughts) are transparent and controllable.
If this is what’s going on, then I basically can’t imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying “it’s already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans.” Is there something I’m missing?
I think your interpretation of Nate is probably wrong, but I’m not sure and happy to drop it.
FWIW, your proposed pitch “it’s already the case that...” is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I’m not here to defend Nate’s choice to write this post rather than some other post.
And I’m not Daniel K., but I do want to respond to you here Ryan. I think that the world I foresee is one in which there will huge tempting power gains which become obviously available to anyone willing to engage in something like RL-training their personal LLM agent (or other method of instilling additional goal-pursuing-power into it). I expect that some point in the future the tech will change and this opportunity will become widely available, and some early adopters will begin benefiting in highly visible ways. If that future comes to pass, then I expect the world to go ‘off the rails’ because these LLMs will have correlated-but-not-equivalent goals and will become increasingly powerful (because one of the goals they get set will be to create more powerful agents).
I don’t think that’s that only way things go badly in the future, but I think it’s an important danger we need to be on guard against. Thus, I think that a crux between you and I is that I think that there is a strong reason to believe that the ‘if we did a bunch of RL’ is actually a quite likely scenario. I believe it is inherently an attractor-state.
To clarify I don’t think that LLM agents are necessarily or obviously safe. I was just trying to argue that it’s plausible that they could achieve long terms objectives while also not having “wanting” in the sense necessary for (some) AI risk arguments to go through. (edited earlier comment to make this more clear)
Thanks for the response. I’m still confused but maybe that’s my fault. FWIW I think my view is pretty similar to Nate’s probably, though I came to it mostly independently & I focus on the concept of agents rather than the concept of wanting. (For more on my view, see this sequence.)
I definitely don’t endorse “it’s extremely surprising for there to be any capabilities without ‘wantings’” and I expect Nate doesn’t either.
What do you think is the sense of “wanting” needed for AI risk arguments? Why is the sense described above not enough?
If your AI system “wants” things in the sense that “when prompted to get X it proposes good strategies for getting X that adapt to obstacles,” then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying “If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task” + “If your AI wants something, then it will undermine your tests and safety measures” seems like a sleight of hand, most of the oomph is coming from equivocating between definitions of want.
You say:
But the OP says:
This seems to strongly imply that a particular capability—succeeding at these long horizon tasks—implies the AI has “wants/desires.” That’s what I’m saying seems wrong.
I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn’t hardcode that inability into our definition of ‘wants!’ Instead I’d say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there’s a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is what I’ve been thinking about for the past six months and which is something MIRI started thinking about years ago. (See also my response to Ryan elsewhere in this thread)
If you use that definition, I don’t understand in what sense LMs don’t “want” things—if you prompt them to “take actions to achieve X” then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn’t that “want” or “desire” like behavior? So what does it mean when Nate says “AI doesn’t seem to have all that much “want”- or “desire”-like behavior”?
I’m genuinely unclear what the OP is asserting at that point, and it seems like it’s clearly not responsive to actual people in the real world saying “LLMs turned out to be not very want-y, when are the people who expected ‘agents’ going to update?” People who say that kind of thing mostly aren’t saying that LMs can’t be prompted to achieve outcomes. They are saying that LMs don’t want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don’t seem to have preferences about the training objective, or that are coherent over time).
I would say that current LLMs, when prompted and RLHF’d appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.
It’s a response to “LLMs turned out to not be very want-y, when are the people who expcted ‘agents’ going to update?” because it’s basically replying “I didn’t expect LLMs to be agenty/wanty; I do expect agenty/wanty AIs to come along before the end and indeed we are already seeing progress in that direction.”
To the people saying “LLMs don’t want things in the sense that is relevant to the usual arguments...” I recommend rephrasing to be less confusing: Your claim is that LLMs don’t seem to have preferences about the training objective, or that are coherent over time, unless hooked up into a prompt/scaffold that explicitly tries to get them to have such preferences. I agree with this claim, but don’t think it’s contrary to my present or past models.
Two separate thoughts, based on my understanding of what the OP is gesturing at (which may not be what Nate is trying to say, but oh well):
Using that definition, LMs do “want” things, but: the extent to which it’s useful to talk about an abstraction like “wants” or “desires” depends heavily on how well the system can be modelled that way. For a system that manages to competently and robustly orient itself around complex obstacles, there’s a notion of “want” that’s very strong. For a system that’s slightly weaker than that—well, it’s still capable of orienting itself around some obstacles so there’s still a notion of “want”, but it’s correspondingly weaker, and plausibly less useful. And insofar as you’re trying to assert something about some particular behaviour of danger arising from strong wants, systems with weaker wants wouldn’t update you very much.
Unfortunately, this does mean you have to draw arbitrary lines about where strong wants are and making that more precise is probably useful, but doesn’t seem to inherently be an argument against it. (To be clear though, I don’t buy this line of reasoning to the extent I think Nate does).
On the ability to solve long-horizon tasks: I think of it as a proxy measure for how strong your wants are (where the proxy breaks down because diverse contexts and complex obstacles aren’t perfectly correlated with long-horizon tasks, but might still be a pretty good one in realistic scenarios). If you have cognition that can robustly handle long-horizon tasks, then one could argue that this can measure by proxy how strongly-held and coherent its objectives are, and correspondingly how capable it is of (say) applying optimization power to break control measures you place on it or breaking the proxies you were training on.
More concretely: I expect that one answer to this might anchor to “wants at least as strong as a human’s”, in which case AI systems 10x-ing the pace of R&D autonomously would definitely suffice; the chain of logic being “has the capabilities to robustly handle real-world obstacles in pursuit of task X” ⇒ “can handle obstacles like control or limiting measures we’ve placed, in pursuit of task X”.
I also don’t buy this line of argument as much as I think Nate does, not because I disagree with my understanding of what the central chain of logic is, but because I don’t think that it applies to language models in the way he describes it (but still plausibly manifests in different ways). I agree that LMs don’t want things in the sense relevant to e.g. deceptive alignment, but primarily because I think it’s a type error—LMs are far more substrate than they are agents. You can still have agents being predicted / simulated by LMs that have strong wants if you have a sufficiently powerful system, that might not have preferences about the training objective, but which still has preferences it’s capable enough to try and achieve. Whether or not you can also ask that system “Which action would maximize the expected amount of Y?” and get a different predicted / simulated agent doesn’t answer the question of whether or not the agent you do get to try and solve a task like that on a long horizon would itself be dangerous, independent of whether you consider the system at large to be dangerous in a similar way toward a similar target.
(I’m obviously not Paul)
In the case of literal current LLM agents with current models:
Humans manually engineer the prompting and scaffolding (and we understand how and why it works)
We can read the intermediate goals directly via just reading the CoT.
Thus, we don’t have risk from hidden, unintended, or unpredictable objectives. There is no reason to think that goal seeking behavior due to the agency from the engineered scaffold or prompting will results in problematic generalization.
It’s unclear if this will hold in the future even for LLM agents, but it’s at least plausible that this will hold (which defeats Nate’s rather confident claim). In particular, we could run into issues from the LLM used within the LLM agent having hidden goals, but insofar as the retargeting and long run agency is a human engineered and reasonably understood process, the original argument from Nate doesn’t seem very relevant to risk. We also could run into issues from imitating very problematic human behavior, but this seems relatively easy to notice in most cases as it would likely be discussed outload with non-negligable probability.
We’d also lose this property if we did a bunch of RL and most of the power of LLM agents was coming from this RL rather than imitating human optimization or humans engineering particular optimization processes.
See also this comment from Paul on a similar topic.
It sounds like you are saying “In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we’ll be able to choose what they want (at least imperfectly, via the prompt) and we’ll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won’t be able to successfully plot against us.”
Yes of course. My research for the last few months has been focused on what happens after that, when the systems get smart enough and/or get trained so that the chain of thought is unfaithful when it needs to be faithful, e.g. the system uses euphemisms when it’s thinking about whether it’s misaligned and what to do about that.
Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn’t contradict anything we’ve said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn’t plot against us or otherwise screw us over.)
Basically, but more centrally that in literal current LLM agents the scary part of the system that we don’t understand (the LLM) doesn’t generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it’s at least plausible that this property will be preserved in the future.
I edited my earlier comment to hopefully make this more clear.
Even if we didn’t have the visible thoughts property in the actual deployed system, the fact that all of the retargeting behavior is based on explicit human engineering is still relevant and contradicts the core claim Nate makes in this post IMO.
I think it contradicts things Nate says in this post directly. I don’t know if it contradicts things you’ve said.
To clarify, I’m commenting on the following chain:
First Nate said:
as well as
Then, Paul responded with
Then you said
And I was responding to this.
So, I was just trying to demonstrate at least one plausible example of a system which plausibly could pursue long term goals and doesn’t have the sense of wanting needed for AI risk arguments. In particular, LLM agents where the retargeting is purely based on human engineering (analogous to a myopic employee retargeted by a manager who cares about longer term outcomes).
This directly contradicts “Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.”.
Thanks for the explanation btw.
My version of what’s happening in this conversation is that you and Paul are like “Well, what if it wants things but in a way which is transparent/interpretable and hence controllable by humans, e.g. if it wants what it is prompted to want?” My response is “Indeed that would be super safe, but it would still count as wanting things. Nate’s post is titled “ability to solve long-horizon tasks correlates with wanting” not “ability to solve long-horizon tasks correlates with hidden uncontrollable wanting.”
One thing at time. First we establish that ability to solve long-horizon tasks correlates with wanting, then we argue about whether or not the future systems that are able to solve diverse long-horizon tasks better than humans can will have transparent controllable wants or not. As you yourself pointed out, insofar as we are doing lots of RL it’s dubious that the wants will remain as transparent and controllable as they are now. I meanwhile will agree that a large part of my hope for a technical solution comes from something like the Faithful CoT agenda, in which we build powerful agentic systems whose wants (and more generally, thoughts) are transparent and controllable.
If this is what’s going on, then I basically can’t imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying “it’s already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans.” Is there something I’m missing?
I think your interpretation of Nate is probably wrong, but I’m not sure and happy to drop it.
FWIW, your proposed pitch “it’s already the case that...” is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I’m not here to defend Nate’s choice to write this post rather than some other post.
And I’m not Daniel K., but I do want to respond to you here Ryan. I think that the world I foresee is one in which there will huge tempting power gains which become obviously available to anyone willing to engage in something like RL-training their personal LLM agent (or other method of instilling additional goal-pursuing-power into it). I expect that some point in the future the tech will change and this opportunity will become widely available, and some early adopters will begin benefiting in highly visible ways. If that future comes to pass, then I expect the world to go ‘off the rails’ because these LLMs will have correlated-but-not-equivalent goals and will become increasingly powerful (because one of the goals they get set will be to create more powerful agents).
I don’t think that’s that only way things go badly in the future, but I think it’s an important danger we need to be on guard against. Thus, I think that a crux between you and I is that I think that there is a strong reason to believe that the ‘if we did a bunch of RL’ is actually a quite likely scenario. I believe it is inherently an attractor-state.
To clarify I don’t think that LLM agents are necessarily or obviously safe. I was just trying to argue that it’s plausible that they could achieve long terms objectives while also not having “wanting” in the sense necessary for (some) AI risk arguments to go through. (edited earlier comment to make this more clear)
Thanks for the clarification!