I very much agree that instrumentality makes agents agenty. It seems like we need them to be agenty to get stuff done. Whether it’s translating data into a report or researching new cancer drugs, we have instrumental goals we want help with. And those instrumental goals have important subgoals, like making sure no one switches you off before you accomplish the goal.
You know all of that; you’re thinking that useful work gets done using solely training. I think that only works if the training produces a general-purpose search to effectively do instrumental goal-directed behavior with arbitrary subgoals appropriate to the task. But I don’t have a good argument for why I think that human-style problem solving will be vastly more efficient than trying to train useful human-level capabilities into something without real instrumental goal-seeking with flexible subgoals.
I guess the closest I can come is that it seems very difficult to create something smart enough to solve complex tasks, but so inflexible that it can’t figure out new valuable subgoals.
Thanks for those citations, I really appreciate them! Four of them are my articles, and I’m so glad you found them valuable. And and I loved Roger Dearnaley’s why my p(doom) went down on the same topics.
I agree there’s pressure towards instrumental goals once LLMs get agentized, where I think I diverge is that the feedback will be a lot denser and way more constraining than evolution on human minds, so much so that I think a lot of the instrumental goals that does arise is very aimable by default. More generally, I consider the instrumental convergence that made humans destroy everyone else, including gorillas and chimpanzees as very much outilers, and I think that human feedback/human alignment attempts will be far more effective in aiming instrumental convergence than what chimpanzees and gorillas did, or what evolution did to humans.
Another way to say it is conditional on instrumental goals arising in LLMs after agentization, I expect them to be very aimable and controllable by default.
I think I’m understanding you now, and I think I agree.
You might be saying the same thing I’ve expressed something like: LLMs already follow instructions well enough to serve as the cognitive core of an LLM cognitive architecture, where the goals are supplied as prompts from surrounding scaffolding. Improvements in LLMs need merely maintain the same or better levels of aimability. Occasional mistakes and even Waluigi hostile simulacra will be overwhelmed by the remainder of well-aimed behavior and double-checking mechanisms.
Or you may be addressing LLms that are agentized in a different way: by applying RL for achieving specified insstrumental goals over many steps of cognition and actions with tool calls.
I’m much more uneasy about that route, and distrubed that Demis Hassabis described the path to AGI as Gemini combined with AlphaZero. But if that’s only part of the training, and it works well enough, I think a prompted goal and scaffolded double-checks could be enough. Adding reflection and self-editing is another way to ensure that largely useful behavior outweighs occasional mistakes and hostile simulacra in the core LLM/foundation model.
I think my point is kind of like that, but more so emphasizing the amount of feedback we can give compared to evolution, and more importantly training on goals that have denser rewards tends to provide for safer AI systems.
To address this scenario:
Or you may be addressing LLms that are agentized in a different way: by applying RL for achieving specified insstrumental goals over many steps of cognition and actions with tool calls.
I’m much more uneasy about that route, and distrubed [sic] that Demis Hassabis described the path to AGI as Gemini combined with AlphaZero.
The big difference here is that I expect conditional on Demis Hassabis’s plans working, the following things make things easier to constrain the solution in ways that help with safety and alignment:
I don’t expect sparse reward RL to work, and to expect it to require a densely defined reward, which constrains the shape of solutions a lot, and I think there is a real chance we can add other constraints to the reward function to rule out more unsafe solutions.
It will likely involve non-instrumental world models, and in particular I think there are real ways to aim instrumental convergence (Albeit unlike in the case of predictive models, you might not have a base of non-instrumentally convergent behavior, so be careful with how you’ve set up your constraints.)
I should note that a lot of the arguments for RL breaking things more compared to LLMs, while sort of correct, are blunted a lot because compared to natural selection which probably used 1046-1048 flops of compute, which is way more than any realistic run, conditioning on TAI/AGI/ASI occuring in this century, essentially allowed for ridiculously sparse rewards like “inclusive genetic fitness”, and evolution hasn’t interfered nearly as much a human will interfere with their AI.
So to answer the question, my answer is it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I’d probably recommend starting from a base of a pre-trained model like GPT-N though to maximize our safety and alignment chances.
Here are some more links and quotes on Rl and non-instrumental world models:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
I’ve read each of the posts you cite thoroughly, some of them recently. I remain unclear on one thing: how do you expect to have a densely defined reinforcement signal? I can see that happening if you have some other system estimating success in arbitrary situations; that would be dense but very noisy. Which might be fine.
It would be noisy, but still dense. It wouldn’t include goals like “maximize success across tasks and time”. Unless the agent was made coherent and autonomous—in which case the reflectively stable center of all of that RL training might be something like that.
I think mostly about AGI that is fully autonomous and therefore becomes coherent around its goals, for better or worse. I wonder if that might be another important difference of perspective. You said
it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I don’t understand why they wouldn’t intrinsically be agents after that RL training?
I want to understand, because I believe refining my model of what AGI will first look like should help a lot with working through alignment schemes adequately before they’re tried.
My thought is that they’d need to take arbitrary goals, and create arbitrary subgoals, which training wouldn’t always cover. There are an infinite number of valuable tasks in the world. But I can also see the argument that most useful tasks fall into categories, and training on those categories might be not just useful but adequate for most of what we want from AGI.
If that’s the type of scenario you’re addressing, I think that’s plausible for many AGI projects. But I think the same argument I make for LLMs and other “oracle” AGI: someone will turn it into a full real agent very soon; it will have more economic value, but even if it doesn’t, people will do it just for the hell of it, because it’s interesting.
With LLMs it’s as simple as repeating the prompt “keep working on that problem, pursuing goal X, using tools Y”. With another architecture, it might be a little different- but turning adequate intelligence into a true agent is almost trivial. Some monkey will pull that lever almost as soon as it’s available.
You’ve probably heard that argument somewhere before, so I may well be misunderstanding your scenario still.
Thanks for the dialogue here, this is useful for my work on my draft post “how we’ll try to align AGI”.
I remain unclear on one thing: how do you expect to have a densely defined reinforcement signal?
Basically, via lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it’s done early in the training run, before it can try to deceive or manipulate us.
More generally, the abuse of synthetic data means we have complete control over the inputs to the AI model, which means we can very easily detect stuff like deception and takeover risk.
For example, we can feed RL and LLM agents information about interpretability techniques not working, despite them actually working, or feed them exploits that are both easy and large for misaligned AI to do that seem to work, but doesn’t actually work.
It’s best to make large synthetic datasets now, so that we can apply it continuously throughout AGI/ASI training, and in particular do it before it is capable of learning deceptiveness/training games.
I think mostly about AGI that is fully autonomous and therefore becomes coherent around its goals, for better or worse. I wonder if that might be another important difference of perspective. You said
it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I don’t understand why they wouldn’t intrinsically be agents after that RL training?
If that’s the type of scenario you’re addressing, I think that’s plausible for many AGI projects. But I think the same argument I make for LLMs and other “oracle” AGI: someone will turn it into a full real agent very soon; it will have more economic value, but even if it doesn’t, people will do it just for the hell of it, because it’s interesting.
With LLMs it’s as simple as repeating the prompt “keep working on that problem, pursuing goal X, using tools Y”. With another architecture, it might be a little different- but turning adequate intelligence into a true agent is almost trivial. Some monkey will pull that lever almost as soon as it’s available.
You’ve probably heard that argument somewhere before, so I may well be misunderstanding your scenario still.
I was just referring to this post on how RL policies aren’t automatically agents, without other assumptions. I agree that they will likely be agentized by someone if RL doesn’t agentize them, and I agree with your assumptions on why they will be agentic RL/LLM AIs.
Also, the argument against synthetic data working because raters make large amounts of compactly describable errors has evidence against it, at least in the data-constrained case.
At a broader level, my point is that even conditional on you being correct that fully autonomous AI that is coherent across goals will be trained by somebody soon, the path to being coherent and autonomous is both important and influenceable to be more aligned by us.
Thanks for the dialogue here, this is useful for my work on my draft post “how we’ll try to align AGI”.
And thank you for being willing to read so much. I will ask you to read more posts and comments here, so that I can finally explicate what exactly is the plan to align AGI via RL or LLMs, which is large synthetic datasets.
I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
I got caught up reading the top comment thread above the Turntrout/Wentworth exchange you linked. I’d somehow missed that by being off-grid when the excellent All the Shoggoths Merely Players came out. It’s my nomination for SOTA of the current alignment difficulty discussion.
I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
I think you’ve read that and we’ve had a discussion there, but I’m leaving that link here as the next step in this discussion now that we’ve reached approximate convergence.
I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
I agree it has a higher tax rate than RLHF, but to make the case for lower tax rates than people think, it’s because synthetic data will likely be a huge part of what makes AGI into ASI, as models require a lot of data, and synthetic data is a potentially huge industry in futures where AI progress is very high, because the amount of human data is both way too limiting for future AIs, and probably doesn’t show superhuman behavior like we want from LLMs/RL.
Thus huge amounts of synthetic data will be heavily used as part of capabilities progress, meaning we can incentivize them to also put alignment data in the synthetic data.
I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
This is why I think we will need to use targeted removals of capabilities like LEACE combined with using synthetic data to remove infohazardous knowledge, combined with not open-weighting/open-sourcing models as AIs get more capable and only allowing controlled API use.
Keeping a superintelligence ignorant of certain concepts sounds impossible. Even a “real AGI” of the type I expect soon will be able to reason and learn, causing it to rapidly rediscover any concepts you’ve carefully left out of the training set. Leaving out this relatively easy capability (to reason and learn online) will hurt capabilities, so you’d have a huge uphill battle in keeping it out of deployed AGI. At least one current projects have already accomplished limited (but impressive) forms of this as part of their strategy to create useful LM agents. So I don’t think it’s getting rolled back or left out.
I agree with you that there are probably better methods to handle the misuse risk, and note I also pointed out them as options, not exactly guarantees.
And yeah, I agree with this specifically:
but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
Thanks for mentioning that.
Now that I think about it, I agree that it’s only a stop gap for misuse, and yeah if there is even limited generalization ability, I agree that LLMs will be able to rediscover dangerous knowledge, so we will need to make LLMs that don’t let users completely make bio-weapons for example.
I very much agree that instrumentality makes agents agenty. It seems like we need them to be agenty to get stuff done. Whether it’s translating data into a report or researching new cancer drugs, we have instrumental goals we want help with. And those instrumental goals have important subgoals, like making sure no one switches you off before you accomplish the goal.
You know all of that; you’re thinking that useful work gets done using solely training. I think that only works if the training produces a general-purpose search to effectively do instrumental goal-directed behavior with arbitrary subgoals appropriate to the task. But I don’t have a good argument for why I think that human-style problem solving will be vastly more efficient than trying to train useful human-level capabilities into something without real instrumental goal-seeking with flexible subgoals.
I guess the closest I can come is that it seems very difficult to create something smart enough to solve complex tasks, but so inflexible that it can’t figure out new valuable subgoals.
Thanks for those citations, I really appreciate them! Four of them are my articles, and I’m so glad you found them valuable. And and I loved Roger Dearnaley’s why my p(doom) went down on the same topics.
I agree there’s pressure towards instrumental goals once LLMs get agentized, where I think I diverge is that the feedback will be a lot denser and way more constraining than evolution on human minds, so much so that I think a lot of the instrumental goals that does arise is very aimable by default. More generally, I consider the instrumental convergence that made humans destroy everyone else, including gorillas and chimpanzees as very much outilers, and I think that human feedback/human alignment attempts will be far more effective in aiming instrumental convergence than what chimpanzees and gorillas did, or what evolution did to humans.
Another way to say it is conditional on instrumental goals arising in LLMs after agentization, I expect them to be very aimable and controllable by default.
I think I’m understanding you now, and I think I agree.
You might be saying the same thing I’ve expressed something like: LLMs already follow instructions well enough to serve as the cognitive core of an LLM cognitive architecture, where the goals are supplied as prompts from surrounding scaffolding. Improvements in LLMs need merely maintain the same or better levels of aimability. Occasional mistakes and even Waluigi hostile simulacra will be overwhelmed by the remainder of well-aimed behavior and double-checking mechanisms.
Or you may be addressing LLms that are agentized in a different way: by applying RL for achieving specified insstrumental goals over many steps of cognition and actions with tool calls.
I’m much more uneasy about that route, and distrubed that Demis Hassabis described the path to AGI as Gemini combined with AlphaZero. But if that’s only part of the training, and it works well enough, I think a prompted goal and scaffolded double-checks could be enough. Adding reflection and self-editing is another way to ensure that largely useful behavior outweighs occasional mistakes and hostile simulacra in the core LLM/foundation model.
I think my point is kind of like that, but more so emphasizing the amount of feedback we can give compared to evolution, and more importantly training on goals that have denser rewards tends to provide for safer AI systems.
To address this scenario:
The big difference here is that I expect conditional on Demis Hassabis’s plans working, the following things make things easier to constrain the solution in ways that help with safety and alignment:
I don’t expect sparse reward RL to work, and to expect it to require a densely defined reward, which constrains the shape of solutions a lot, and I think there is a real chance we can add other constraints to the reward function to rule out more unsafe solutions.
It will likely involve non-instrumental world models, and in particular I think there are real ways to aim instrumental convergence (Albeit unlike in the case of predictive models, you might not have a base of non-instrumentally convergent behavior, so be careful with how you’ve set up your constraints.)
I should note that a lot of the arguments for RL breaking things more compared to LLMs, while sort of correct, are blunted a lot because compared to natural selection which probably used 1046-1048 flops of compute, which is way more than any realistic run, conditioning on TAI/AGI/ASI occuring in this century, essentially allowed for ridiculously sparse rewards like “inclusive genetic fitness”, and evolution hasn’t interfered nearly as much a human will interfere with their AI.
I got the flops number from this website:
https://www.getguesstimate.com/models/10685
So to answer the question, my answer is it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I’d probably recommend starting from a base of a pre-trained model like GPT-N though to maximize our safety and alignment chances.
Here are some more links and quotes on Rl and non-instrumental world models:
https://www.lesswrong.com/posts/rZ6wam9gFGFQrCWHc/#mT792uAy4ih3qCDfx
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/?commentId=QciMJ9ehR9xbTexcc
Where we can validly turn utility maximization over plans and predictions into world states.
And finally a link on how to control an LLM’s behavior, which while not related too much to RL, is nontheless interesting:
https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1
I’ve read each of the posts you cite thoroughly, some of them recently. I remain unclear on one thing: how do you expect to have a densely defined reinforcement signal? I can see that happening if you have some other system estimating success in arbitrary situations; that would be dense but very noisy. Which might be fine.
It would be noisy, but still dense. It wouldn’t include goals like “maximize success across tasks and time”. Unless the agent was made coherent and autonomous—in which case the reflectively stable center of all of that RL training might be something like that.
I think mostly about AGI that is fully autonomous and therefore becomes coherent around its goals, for better or worse. I wonder if that might be another important difference of perspective. You said
I don’t understand why they wouldn’t intrinsically be agents after that RL training?
I want to understand, because I believe refining my model of what AGI will first look like should help a lot with working through alignment schemes adequately before they’re tried.
My thought is that they’d need to take arbitrary goals, and create arbitrary subgoals, which training wouldn’t always cover. There are an infinite number of valuable tasks in the world. But I can also see the argument that most useful tasks fall into categories, and training on those categories might be not just useful but adequate for most of what we want from AGI.
If that’s the type of scenario you’re addressing, I think that’s plausible for many AGI projects. But I think the same argument I make for LLMs and other “oracle” AGI: someone will turn it into a full real agent very soon; it will have more economic value, but even if it doesn’t, people will do it just for the hell of it, because it’s interesting.
With LLMs it’s as simple as repeating the prompt “keep working on that problem, pursuing goal X, using tools Y”. With another architecture, it might be a little different- but turning adequate intelligence into a true agent is almost trivial. Some monkey will pull that lever almost as soon as it’s available.
You’ve probably heard that argument somewhere before, so I may well be misunderstanding your scenario still.
Thanks for the dialogue here, this is useful for my work on my draft post “how we’ll try to align AGI”.
Basically, via lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it’s done early in the training run, before it can try to deceive or manipulate us.
More generally, the abuse of synthetic data means we have complete control over the inputs to the AI model, which means we can very easily detect stuff like deception and takeover risk.
For example, we can feed RL and LLM agents information about interpretability techniques not working, despite them actually working, or feed them exploits that are both easy and large for misaligned AI to do that seem to work, but doesn’t actually work.
More here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1
It’s best to make large synthetic datasets now, so that we can apply it continuously throughout AGI/ASI training, and in particular do it before it is capable of learning deceptiveness/training games.
I was just referring to this post on how RL policies aren’t automatically agents, without other assumptions. I agree that they will likely be agentized by someone if RL doesn’t agentize them, and I agree with your assumptions on why they will be agentic RL/LLM AIs.
https://www.lesswrong.com/posts/rmfjo4Wmtgq8qa2B7/think-carefully-before-calling-rl-policies-agents
Also, the argument against synthetic data working because raters make large amounts of compactly describable errors has evidence against it, at least in the data-constrained case.
Some relevant links are these:
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/#74DdsQ7wtDnx4ChDX
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/#R9Bfu6tzmuWRCT6DB
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/?commentId=AoxYQR9jLSLtjvLno#AoxYQR9jLSLtjvLno
At a broader level, my point is that even conditional on you being correct that fully autonomous AI that is coherent across goals will be trained by somebody soon, the path to being coherent and autonomous is both important and influenceable to be more aligned by us.
And thank you for being willing to read so much. I will ask you to read more posts and comments here, so that I can finally explicate what exactly is the plan to align AGI via RL or LLMs, which is large synthetic datasets.
I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
I got caught up reading the top comment thread above the Turntrout/Wentworth exchange you linked. I’d somehow missed that by being off-grid when the excellent All the Shoggoths Merely Players came out. It’s my nomination for SOTA of the current alignment difficulty discussion.
I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
I think you’ve read that and we’ve had a discussion there, but I’m leaving that link here as the next step in this discussion now that we’ve reached approximate convergence.
I agree it has a higher tax rate than RLHF, but to make the case for lower tax rates than people think, it’s because synthetic data will likely be a huge part of what makes AGI into ASI, as models require a lot of data, and synthetic data is a potentially huge industry in futures where AI progress is very high, because the amount of human data is both way too limiting for future AIs, and probably doesn’t show superhuman behavior like we want from LLMs/RL.
Thus huge amounts of synthetic data will be heavily used as part of capabilities progress, meaning we can incentivize them to also put alignment data in the synthetic data.
This is why I think we will need to use targeted removals of capabilities like LEACE combined with using synthetic data to remove infohazardous knowledge, combined with not open-weighting/open-sourcing models as AIs get more capable and only allowing controlled API use.
Here’s the LEACE paper and code:
https://github.com/EleutherAI/concept-erasure/pull/2
https://github.com/EleutherAI/concept-erasure
https://github.com/EleutherAI/concept-erasure/releases/tag/v0.2.0
https://arxiv.org/abs/2306.03819
https://blog.eleuther.ai/oracle-leace
I’ll reread that post again.
Agreed on the capabilities advantages of synthetic data; so it might not be much of a tax at all to mix in some alignment.
I don’t think removing infohazardous knowledge will work all the way into dangerous AGI, but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
Keeping a superintelligence ignorant of certain concepts sounds impossible. Even a “real AGI” of the type I expect soon will be able to reason and learn, causing it to rapidly rediscover any concepts you’ve carefully left out of the training set. Leaving out this relatively easy capability (to reason and learn online) will hurt capabilities, so you’d have a huge uphill battle in keeping it out of deployed AGI. At least one current projects have already accomplished limited (but impressive) forms of this as part of their strategy to create useful LM agents. So I don’t think it’s getting rolled back or left out.
I agree with you that there are probably better methods to handle the misuse risk, and note I also pointed out them as options, not exactly guarantees.
And yeah, I agree with this specifically:
Thanks for mentioning that.
Now that I think about it, I agree that it’s only a stop gap for misuse, and yeah if there is even limited generalization ability, I agree that LLMs will be able to rediscover dangerous knowledge, so we will need to make LLMs that don’t let users completely make bio-weapons for example.