So personal intent alignment is basically all we get except in perhaps very small groups.
I like @Edward P. Könings’s statement here on what people are actually doing when they try to improve society:
⦾ Carrying out a simplification or homogenization of the multiple preferences of the individuals that make up that society;
⦾ Modeling your own personal preferences as if these were the preferences of society as a whole.
2. While I agree that there are definitely paths to danger, I see reality as less offense biased than LWers/rationalists tend to think, enough so that I think that a multi-polar scenario doesn’t leave us automatically/highly doomed.
(Biology is probably the closest exception, but I also expect this to be fixable).
(That said, it would be good to have a policy on when to stop open-weighting/open-sourcing powerful AIs, because of the risk.)
Also, this is an uncompleted sentence: “which create more illusory disagreements between those who mean personal intent alignment”.
On footnote 4, I think a general crux is that as LLMs get more coherent and powerful, I still expect them to be corrigible by default, because the utility function that they are maximizing doesn’t imply that they will try to assert their existence, which is:
The ideal predictor’s utility function is instead strictly over the model’s own outputs, conditional on inputs.
And I think there will be more constraints on AI development than on human development, such that at least in training, that unbounded/large instrumental convergence is very unlikely to be rewarded as much as LWers assumed.
I agree that they will be easy to agentize, and that many people will try to agentize LLMs, but the value of a capable, not unbounded/very instrumentally convergent AI is very valuable, as it unblocks the path to corrigibility/instruction following AGI/ASI.
Agree with you in that we should probably aim for instruction following for the first AGI/ASI.
This part is very important, and I agree wholeheartedly with this point:
There’s an intuition that intent alignment isn’t workable for a full AGI; something that’s competent or self-aware usually[5] has its own goals, so doesn’t just follow instructions.
But that intuition is is based on our experience with existing minds. What if that synthetic being’s explicit, considered goal is to approximately follow instructions?
I think it’s possible for a fully self-aware, goal-oriented AGI to have its goal be, loosely speaking, a pointer to someone else’s goals. No human is oriented this way, but it seems conceptually coherent to want to do, with all of your heart, just what someone else wants.
I think a lot of alignment discourse was thrown off in assuming that what the properties of human minds, especially for values/alignment properties were what AI systems had to look like in the limit of ASI, and more generally I think people heavily overestimated how much evidence human/evolution analogies brought on questions of AI alignment, compared to current Deep Learning/AI systems of today.
Indeed, I actually expect a lot of AIs to have corrigibility/personal intent alignment/instruction following/DWIMAC by default, given a minimally instrumentally convergent base.
Finally, I think one important implication is that if we are in a world where it’s easy to align AIs to instruction following/personal intent, politics starts mattering again, and as AI takeoff happens, who your AIs are aligned to on politics will probably become a very important factor in how much you use your AIs.
So personal intent alignment is basically all we get except in perhaps very small groups.
I want to disagree here. I think that a widely acceptable compromise on political rules, and the freedom to pursue happiness on one’s own terms without violating others’ rights, is quite achievable and desirable. I think that having a powerful AI establish/maintain the best possible government given the conflicting sets of values held by all parties is a great outcome. I agree that this isn’t what is generally meant by ‘values alignment’, but I think it’s a more useful thing to talk about.
I do agree that large groups of humans do seem to inevitably have contradictory values such that no perfect resolution is possible. I just think that that is beside the point, and not what we should even be fantasizing about. I also agree that most people who seem excited about ‘values alignment’ mean ‘alignment to their own values’. I’ve had numerous conversations with such people about the problem of people with harmful intent towards others (e.g. sadism, vengeance). I have yet to receive anything even remotely resembling a coherent response to this. Averaging values doesn’t solve the problem, there are weird bad edge cases that that falls into. Instead, you need to focus on a widely (but not necessarily unanimously) acceptable political compromise.
I’m glad you agree on the importance of PIA being workable for real AGI. I sometimes wonder if I’m hallucinating this huge elephant in the room.
I’m not sure I’d a expect DWIMAC type alignment to emerge by default if you mean that an LLM-centered agent might just decide that’s it’s reflectively stable central goal. It might, but I wouldn’t want to bet the farm on it.
If you mean that by default this is what the first successful AGI projects will try, I agree completely. I think it will look like there’s very little sensible choice in the matter once people are really thinking about increasingly competent but still subhuman agents.
Finally, I think one important implication is that if we are in a world where it’s easy to align AIs to instruction following/personal intent, politics starts mattering again, and as AI takeoff happens, who your AIs are aligned to on politics will probably become a very important factor in how much you use your AIs.
This was exactly my conclusion after writing about and discussing this scenario for If we solve alignment, do we die anyway?. Politics are how we’ll succeed or fail at alignment as a species. I hate this conclusion, because politics is one area I have no expertise or competence in.
And I think there will be more constraints on AI development than on human development, such that at least in training, that unbounded/large instrumental convergence is very unlikely to be rewarded as much as LWers assumed.
See also these posts: [...]
I agree that they will be easy to agentize, and that many people will try to agentize LLMs, but the value of a capable, not unbounded/very instrumentally convergent AI is very valuable, as it unblocks the path to corrigibility/instruction following AGI/ASI.
I don’t really understand any of those statements, even after rereading those posts.
You might mean constraints from developers or constraints in AGIs self-improvement/learning.
I see unbounded instrumental convergence as the default in a competent reflective agent; it doesn’t need to be rewarded explicitly, it’s just what beings with goals do.
By “the value of a capable, not unbounded/very instrumentally convergent AI” you might mean either one that is or isn’t very instrumentally convergent. If you mean not very instrumentally convergent, as above, I think that’s the default and hard to avoid. I don’t think it’s possible without some very careful and difficult engineering of bounded goals (like Max Harms proposes for his definition of corrigibility, linked above). Just emitting behaviors that follow instructions isn’t nearly as useful as pursuing user-defined goals autonomously.
The hope is that for an instruction-following PIA AGI, its instrumentally convergent subgoals all align with the principal’s goals. That’s if you’ve defined/trained instruction following just right; but you can adjust it if you catch the discrepencies before the AGI is a lot smarter than you and capable of escape.
Last and least: On the impossibility of value alignment: you can’t align to everyone’s values, as you say. But you can align to the overlap among everyone’s values, something like “I’d like to be able to do whatever I want to the extent it’s possible without other people keeping me from doing what I want”. I think that’s what people typically mean by value alignment. I like Empowerment is (almost) All We Need on this.
You might mean constraints from developers or constraints in AGIs self-improvement/learning.
I’d say the closest thing I’m arguing for is from constraints on the reward function.
Re instrumental convergence being natural and unboundedly dangerous, I think a crux is that the reason why instrumental convergence is so unbounded and natural for humans doesn’t generalize to the AI case, which is that humans were essentially trained on ridiculously sparse reward and basically very long-term feedback from evolution at best on reward, and I think the type of capabilities that naturally leads to instrumental convergence being unboundedly dangerous is also the area where AI research just completely sucks at.
I think future super-intelligent AI like agentized LLMs or Model-Free/Model Based RL will be way less incentivized to learn unboundedly large instrumental convergence (at least the dangerous ones like deceptive alignment and seeking power), because of much, much denser feedback, and much more reward shaping.
I do think it would still be too easy to give it extremely harmful goals, but that’s a separate concern.
The short version is that the incentives to make AI more capable are less related to making them have dangerous instrumental convergence, because you can bound the instrumental convergence way more via very dense feedback, probably thousands or millions of times more dense feedback than what evolution had.
I very much agree that instrumentality makes agents agenty. It seems like we need them to be agenty to get stuff done. Whether it’s translating data into a report or researching new cancer drugs, we have instrumental goals we want help with. And those instrumental goals have important subgoals, like making sure no one switches you off before you accomplish the goal.
You know all of that; you’re thinking that useful work gets done using solely training. I think that only works if the training produces a general-purpose search to effectively do instrumental goal-directed behavior with arbitrary subgoals appropriate to the task. But I don’t have a good argument for why I think that human-style problem solving will be vastly more efficient than trying to train useful human-level capabilities into something without real instrumental goal-seeking with flexible subgoals.
I guess the closest I can come is that it seems very difficult to create something smart enough to solve complex tasks, but so inflexible that it can’t figure out new valuable subgoals.
Thanks for those citations, I really appreciate them! Four of them are my articles, and I’m so glad you found them valuable. And and I loved Roger Dearnaley’s why my p(doom) went down on the same topics.
I agree there’s pressure towards instrumental goals once LLMs get agentized, where I think I diverge is that the feedback will be a lot denser and way more constraining than evolution on human minds, so much so that I think a lot of the instrumental goals that does arise is very aimable by default. More generally, I consider the instrumental convergence that made humans destroy everyone else, including gorillas and chimpanzees as very much outilers, and I think that human feedback/human alignment attempts will be far more effective in aiming instrumental convergence than what chimpanzees and gorillas did, or what evolution did to humans.
Another way to say it is conditional on instrumental goals arising in LLMs after agentization, I expect them to be very aimable and controllable by default.
I think I’m understanding you now, and I think I agree.
You might be saying the same thing I’ve expressed something like: LLMs already follow instructions well enough to serve as the cognitive core of an LLM cognitive architecture, where the goals are supplied as prompts from surrounding scaffolding. Improvements in LLMs need merely maintain the same or better levels of aimability. Occasional mistakes and even Waluigi hostile simulacra will be overwhelmed by the remainder of well-aimed behavior and double-checking mechanisms.
Or you may be addressing LLms that are agentized in a different way: by applying RL for achieving specified insstrumental goals over many steps of cognition and actions with tool calls.
I’m much more uneasy about that route, and distrubed that Demis Hassabis described the path to AGI as Gemini combined with AlphaZero. But if that’s only part of the training, and it works well enough, I think a prompted goal and scaffolded double-checks could be enough. Adding reflection and self-editing is another way to ensure that largely useful behavior outweighs occasional mistakes and hostile simulacra in the core LLM/foundation model.
I think my point is kind of like that, but more so emphasizing the amount of feedback we can give compared to evolution, and more importantly training on goals that have denser rewards tends to provide for safer AI systems.
To address this scenario:
Or you may be addressing LLms that are agentized in a different way: by applying RL for achieving specified insstrumental goals over many steps of cognition and actions with tool calls.
I’m much more uneasy about that route, and distrubed [sic] that Demis Hassabis described the path to AGI as Gemini combined with AlphaZero.
The big difference here is that I expect conditional on Demis Hassabis’s plans working, the following things make things easier to constrain the solution in ways that help with safety and alignment:
I don’t expect sparse reward RL to work, and to expect it to require a densely defined reward, which constrains the shape of solutions a lot, and I think there is a real chance we can add other constraints to the reward function to rule out more unsafe solutions.
It will likely involve non-instrumental world models, and in particular I think there are real ways to aim instrumental convergence (Albeit unlike in the case of predictive models, you might not have a base of non-instrumentally convergent behavior, so be careful with how you’ve set up your constraints.)
I should note that a lot of the arguments for RL breaking things more compared to LLMs, while sort of correct, are blunted a lot because compared to natural selection which probably used 1046-1048 flops of compute, which is way more than any realistic run, conditioning on TAI/AGI/ASI occuring in this century, essentially allowed for ridiculously sparse rewards like “inclusive genetic fitness”, and evolution hasn’t interfered nearly as much a human will interfere with their AI.
So to answer the question, my answer is it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I’d probably recommend starting from a base of a pre-trained model like GPT-N though to maximize our safety and alignment chances.
Here are some more links and quotes on Rl and non-instrumental world models:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
I’ve read each of the posts you cite thoroughly, some of them recently. I remain unclear on one thing: how do you expect to have a densely defined reinforcement signal? I can see that happening if you have some other system estimating success in arbitrary situations; that would be dense but very noisy. Which might be fine.
It would be noisy, but still dense. It wouldn’t include goals like “maximize success across tasks and time”. Unless the agent was made coherent and autonomous—in which case the reflectively stable center of all of that RL training might be something like that.
I think mostly about AGI that is fully autonomous and therefore becomes coherent around its goals, for better or worse. I wonder if that might be another important difference of perspective. You said
it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I don’t understand why they wouldn’t intrinsically be agents after that RL training?
I want to understand, because I believe refining my model of what AGI will first look like should help a lot with working through alignment schemes adequately before they’re tried.
My thought is that they’d need to take arbitrary goals, and create arbitrary subgoals, which training wouldn’t always cover. There are an infinite number of valuable tasks in the world. But I can also see the argument that most useful tasks fall into categories, and training on those categories might be not just useful but adequate for most of what we want from AGI.
If that’s the type of scenario you’re addressing, I think that’s plausible for many AGI projects. But I think the same argument I make for LLMs and other “oracle” AGI: someone will turn it into a full real agent very soon; it will have more economic value, but even if it doesn’t, people will do it just for the hell of it, because it’s interesting.
With LLMs it’s as simple as repeating the prompt “keep working on that problem, pursuing goal X, using tools Y”. With another architecture, it might be a little different- but turning adequate intelligence into a true agent is almost trivial. Some monkey will pull that lever almost as soon as it’s available.
You’ve probably heard that argument somewhere before, so I may well be misunderstanding your scenario still.
Thanks for the dialogue here, this is useful for my work on my draft post “how we’ll try to align AGI”.
I remain unclear on one thing: how do you expect to have a densely defined reinforcement signal?
Basically, via lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it’s done early in the training run, before it can try to deceive or manipulate us.
More generally, the abuse of synthetic data means we have complete control over the inputs to the AI model, which means we can very easily detect stuff like deception and takeover risk.
For example, we can feed RL and LLM agents information about interpretability techniques not working, despite them actually working, or feed them exploits that are both easy and large for misaligned AI to do that seem to work, but doesn’t actually work.
It’s best to make large synthetic datasets now, so that we can apply it continuously throughout AGI/ASI training, and in particular do it before it is capable of learning deceptiveness/training games.
I think mostly about AGI that is fully autonomous and therefore becomes coherent around its goals, for better or worse. I wonder if that might be another important difference of perspective. You said
it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I don’t understand why they wouldn’t intrinsically be agents after that RL training?
If that’s the type of scenario you’re addressing, I think that’s plausible for many AGI projects. But I think the same argument I make for LLMs and other “oracle” AGI: someone will turn it into a full real agent very soon; it will have more economic value, but even if it doesn’t, people will do it just for the hell of it, because it’s interesting.
With LLMs it’s as simple as repeating the prompt “keep working on that problem, pursuing goal X, using tools Y”. With another architecture, it might be a little different- but turning adequate intelligence into a true agent is almost trivial. Some monkey will pull that lever almost as soon as it’s available.
You’ve probably heard that argument somewhere before, so I may well be misunderstanding your scenario still.
I was just referring to this post on how RL policies aren’t automatically agents, without other assumptions. I agree that they will likely be agentized by someone if RL doesn’t agentize them, and I agree with your assumptions on why they will be agentic RL/LLM AIs.
Also, the argument against synthetic data working because raters make large amounts of compactly describable errors has evidence against it, at least in the data-constrained case.
At a broader level, my point is that even conditional on you being correct that fully autonomous AI that is coherent across goals will be trained by somebody soon, the path to being coherent and autonomous is both important and influenceable to be more aligned by us.
Thanks for the dialogue here, this is useful for my work on my draft post “how we’ll try to align AGI”.
And thank you for being willing to read so much. I will ask you to read more posts and comments here, so that I can finally explicate what exactly is the plan to align AGI via RL or LLMs, which is large synthetic datasets.
I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
I got caught up reading the top comment thread above the Turntrout/Wentworth exchange you linked. I’d somehow missed that by being off-grid when the excellent All the Shoggoths Merely Players came out. It’s my nomination for SOTA of the current alignment difficulty discussion.
I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
I think you’ve read that and we’ve had a discussion there, but I’m leaving that link here as the next step in this discussion now that we’ve reached approximate convergence.
I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
I agree it has a higher tax rate than RLHF, but to make the case for lower tax rates than people think, it’s because synthetic data will likely be a huge part of what makes AGI into ASI, as models require a lot of data, and synthetic data is a potentially huge industry in futures where AI progress is very high, because the amount of human data is both way too limiting for future AIs, and probably doesn’t show superhuman behavior like we want from LLMs/RL.
Thus huge amounts of synthetic data will be heavily used as part of capabilities progress, meaning we can incentivize them to also put alignment data in the synthetic data.
I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
This is why I think we will need to use targeted removals of capabilities like LEACE combined with using synthetic data to remove infohazardous knowledge, combined with not open-weighting/open-sourcing models as AIs get more capable and only allowing controlled API use.
Keeping a superintelligence ignorant of certain concepts sounds impossible. Even a “real AGI” of the type I expect soon will be able to reason and learn, causing it to rapidly rediscover any concepts you’ve carefully left out of the training set. Leaving out this relatively easy capability (to reason and learn online) will hurt capabilities, so you’d have a huge uphill battle in keeping it out of deployed AGI. At least one current projects have already accomplished limited (but impressive) forms of this as part of their strategy to create useful LM agents. So I don’t think it’s getting rolled back or left out.
I agree with you that there are probably better methods to handle the misuse risk, and note I also pointed out them as options, not exactly guarantees.
And yeah, I agree with this specifically:
but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
Thanks for mentioning that.
Now that I think about it, I agree that it’s only a stop gap for misuse, and yeah if there is even limited generalization ability, I agree that LLMs will be able to rediscover dangerous knowledge, so we will need to make LLMs that don’t let users completely make bio-weapons for example.
Re value alignment to all of humanity, I’ll say 2 things:
I believe it is mostly impossible except in corner/edge cases like everyone having the same preferences, because of this post:
https://www.lesswrong.com/posts/YYuB8w4nrfWmLzNob/thatcher-s-axiom
So personal intent alignment is basically all we get except in perhaps very small groups.
I like @Edward P. Könings’s statement here on what people are actually doing when they try to improve society:
2. While I agree that there are definitely paths to danger, I see reality as less offense biased than LWers/rationalists tend to think, enough so that I think that a multi-polar scenario doesn’t leave us automatically/highly doomed.
(Biology is probably the closest exception, but I also expect this to be fixable).
(That said, it would be good to have a policy on when to stop open-weighting/open-sourcing powerful AIs, because of the risk.)
Also, this is an uncompleted sentence: “which create more illusory disagreements between those who mean personal intent alignment”.
On footnote 4, I think a general crux is that as LLMs get more coherent and powerful, I still expect them to be corrigible by default, because the utility function that they are maximizing doesn’t imply that they will try to assert their existence, which is:
And I think there will be more constraints on AI development than on human development, such that at least in training, that unbounded/large instrumental convergence is very unlikely to be rewarded as much as LWers assumed.
See also these posts:
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/implied-utilities-of-simulators-are-broad-dense-and-shallow
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
https://www.lesswrong.com/posts/vs49tuFuaMEd4iskA/one-path-to-coherence-conditionalization
I agree that they will be easy to agentize, and that many people will try to agentize LLMs, but the value of a capable, not unbounded/very instrumentally convergent AI is very valuable, as it unblocks the path to corrigibility/instruction following AGI/ASI.
Agree with you in that we should probably aim for instruction following for the first AGI/ASI.
This part is very important, and I agree wholeheartedly with this point:
I think a lot of alignment discourse was thrown off in assuming that what the properties of human minds, especially for values/alignment properties were what AI systems had to look like in the limit of ASI, and more generally I think people heavily overestimated how much evidence human/evolution analogies brought on questions of AI alignment, compared to current Deep Learning/AI systems of today.
Indeed, I actually expect a lot of AIs to have corrigibility/personal intent alignment/instruction following/DWIMAC by default, given a minimally instrumentally convergent base.
Finally, I think one important implication is that if we are in a world where it’s easy to align AIs to instruction following/personal intent, politics starts mattering again, and as AI takeoff happens, who your AIs are aligned to on politics will probably become a very important factor in how much you use your AIs.
I want to disagree here. I think that a widely acceptable compromise on political rules, and the freedom to pursue happiness on one’s own terms without violating others’ rights, is quite achievable and desirable. I think that having a powerful AI establish/maintain the best possible government given the conflicting sets of values held by all parties is a great outcome. I agree that this isn’t what is generally meant by ‘values alignment’, but I think it’s a more useful thing to talk about.
I do agree that large groups of humans do seem to inevitably have contradictory values such that no perfect resolution is possible. I just think that that is beside the point, and not what we should even be fantasizing about. I also agree that most people who seem excited about ‘values alignment’ mean ‘alignment to their own values’. I’ve had numerous conversations with such people about the problem of people with harmful intent towards others (e.g. sadism, vengeance). I have yet to receive anything even remotely resembling a coherent response to this. Averaging values doesn’t solve the problem, there are weird bad edge cases that that falls into. Instead, you need to focus on a widely (but not necessarily unanimously) acceptable political compromise.
Thanks for the detailed response!
I’m glad you agree on the importance of PIA being workable for real AGI. I sometimes wonder if I’m hallucinating this huge elephant in the room.
I’m not sure I’d a expect DWIMAC type alignment to emerge by default if you mean that an LLM-centered agent might just decide that’s it’s reflectively stable central goal. It might, but I wouldn’t want to bet the farm on it.
If you mean that by default this is what the first successful AGI projects will try, I agree completely. I think it will look like there’s very little sensible choice in the matter once people are really thinking about increasingly competent but still subhuman agents.
This was exactly my conclusion after writing about and discussing this scenario for If we solve alignment, do we die anyway?. Politics are how we’ll succeed or fail at alignment as a species. I hate this conclusion, because politics is one area I have no expertise or competence in.
I don’t really understand any of those statements, even after rereading those posts.
You might mean constraints from developers or constraints in AGIs self-improvement/learning.
I see unbounded instrumental convergence as the default in a competent reflective agent; it doesn’t need to be rewarded explicitly, it’s just what beings with goals do.
By “the value of a capable, not unbounded/very instrumentally convergent AI” you might mean either one that is or isn’t very instrumentally convergent. If you mean not very instrumentally convergent, as above, I think that’s the default and hard to avoid. I don’t think it’s possible without some very careful and difficult engineering of bounded goals (like Max Harms proposes for his definition of corrigibility, linked above). Just emitting behaviors that follow instructions isn’t nearly as useful as pursuing user-defined goals autonomously.
The hope is that for an instruction-following PIA AGI, its instrumentally convergent subgoals all align with the principal’s goals. That’s if you’ve defined/trained instruction following just right; but you can adjust it if you catch the discrepencies before the AGI is a lot smarter than you and capable of escape.
Last and least: On the impossibility of value alignment: you can’t align to everyone’s values, as you say. But you can align to the overlap among everyone’s values, something like “I’d like to be able to do whatever I want to the extent it’s possible without other people keeping me from doing what I want”. I think that’s what people typically mean by value alignment. I like Empowerment is (almost) All We Need on this.
Again, thanks for engaging closely with this!
Note, I edited my first comment to also include this link, which I somehow forgot to do, and I especially appreciate footnote 3 on that post:
https://www.lesswrong.com/posts/EBKJq2gkhvdMg5nTQ/instrumentality-makes-agents-agenty
To respond to this:
I’d say the closest thing I’m arguing for is from constraints on the reward function.
Re instrumental convergence being natural and unboundedly dangerous, I think a crux is that the reason why instrumental convergence is so unbounded and natural for humans doesn’t generalize to the AI case, which is that humans were essentially trained on ridiculously sparse reward and basically very long-term feedback from evolution at best on reward, and I think the type of capabilities that naturally leads to instrumental convergence being unboundedly dangerous is also the area where AI research just completely sucks at.
I think future super-intelligent AI like agentized LLMs or Model-Free/Model Based RL will be way less incentivized to learn unboundedly large instrumental convergence (at least the dangerous ones like deceptive alignment and seeking power), because of much, much denser feedback, and much more reward shaping.
I do think it would still be too easy to give it extremely harmful goals, but that’s a separate concern.
The short version is that the incentives to make AI more capable are less related to making them have dangerous instrumental convergence, because you can bound the instrumental convergence way more via very dense feedback, probably thousands or millions of times more dense feedback than what evolution had.
Some more links on the topic:
https://www.lesswrong.com/posts/dcoxvEhAfYcov2LA6/agentized-llms-will-change-the-alignment-landscape
https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1
https://www.lesswrong.com/posts/DfJCTp4MxmTFnYvgF/goals-selected-from-learned-knowledge-an-alternative-to-rl
https://www.lesswrong.com/posts/xqqhwbH2mq6i4iLmK/we-have-promising-alignment-plans-with-low-taxes
https://www.lesswrong.com/posts/ogHr8SvGqg9pW5wsT/capabilities-and-alignment-of-llm-cognitive-architectures
I very much agree that instrumentality makes agents agenty. It seems like we need them to be agenty to get stuff done. Whether it’s translating data into a report or researching new cancer drugs, we have instrumental goals we want help with. And those instrumental goals have important subgoals, like making sure no one switches you off before you accomplish the goal.
You know all of that; you’re thinking that useful work gets done using solely training. I think that only works if the training produces a general-purpose search to effectively do instrumental goal-directed behavior with arbitrary subgoals appropriate to the task. But I don’t have a good argument for why I think that human-style problem solving will be vastly more efficient than trying to train useful human-level capabilities into something without real instrumental goal-seeking with flexible subgoals.
I guess the closest I can come is that it seems very difficult to create something smart enough to solve complex tasks, but so inflexible that it can’t figure out new valuable subgoals.
Thanks for those citations, I really appreciate them! Four of them are my articles, and I’m so glad you found them valuable. And and I loved Roger Dearnaley’s why my p(doom) went down on the same topics.
I agree there’s pressure towards instrumental goals once LLMs get agentized, where I think I diverge is that the feedback will be a lot denser and way more constraining than evolution on human minds, so much so that I think a lot of the instrumental goals that does arise is very aimable by default. More generally, I consider the instrumental convergence that made humans destroy everyone else, including gorillas and chimpanzees as very much outilers, and I think that human feedback/human alignment attempts will be far more effective in aiming instrumental convergence than what chimpanzees and gorillas did, or what evolution did to humans.
Another way to say it is conditional on instrumental goals arising in LLMs after agentization, I expect them to be very aimable and controllable by default.
I think I’m understanding you now, and I think I agree.
You might be saying the same thing I’ve expressed something like: LLMs already follow instructions well enough to serve as the cognitive core of an LLM cognitive architecture, where the goals are supplied as prompts from surrounding scaffolding. Improvements in LLMs need merely maintain the same or better levels of aimability. Occasional mistakes and even Waluigi hostile simulacra will be overwhelmed by the remainder of well-aimed behavior and double-checking mechanisms.
Or you may be addressing LLms that are agentized in a different way: by applying RL for achieving specified insstrumental goals over many steps of cognition and actions with tool calls.
I’m much more uneasy about that route, and distrubed that Demis Hassabis described the path to AGI as Gemini combined with AlphaZero. But if that’s only part of the training, and it works well enough, I think a prompted goal and scaffolded double-checks could be enough. Adding reflection and self-editing is another way to ensure that largely useful behavior outweighs occasional mistakes and hostile simulacra in the core LLM/foundation model.
I think my point is kind of like that, but more so emphasizing the amount of feedback we can give compared to evolution, and more importantly training on goals that have denser rewards tends to provide for safer AI systems.
To address this scenario:
The big difference here is that I expect conditional on Demis Hassabis’s plans working, the following things make things easier to constrain the solution in ways that help with safety and alignment:
I don’t expect sparse reward RL to work, and to expect it to require a densely defined reward, which constrains the shape of solutions a lot, and I think there is a real chance we can add other constraints to the reward function to rule out more unsafe solutions.
It will likely involve non-instrumental world models, and in particular I think there are real ways to aim instrumental convergence (Albeit unlike in the case of predictive models, you might not have a base of non-instrumentally convergent behavior, so be careful with how you’ve set up your constraints.)
I should note that a lot of the arguments for RL breaking things more compared to LLMs, while sort of correct, are blunted a lot because compared to natural selection which probably used 1046-1048 flops of compute, which is way more than any realistic run, conditioning on TAI/AGI/ASI occuring in this century, essentially allowed for ridiculously sparse rewards like “inclusive genetic fitness”, and evolution hasn’t interfered nearly as much a human will interfere with their AI.
I got the flops number from this website:
https://www.getguesstimate.com/models/10685
So to answer the question, my answer is it would be good for you to think of alignment methods on agentized RL systems like AlphaZero, but that they aren’t intrisincally agents, and are not much more dangerous than LLMs provided you’ve constrained the reward function enough.
I’d probably recommend starting from a base of a pre-trained model like GPT-N though to maximize our safety and alignment chances.
Here are some more links and quotes on Rl and non-instrumental world models:
https://www.lesswrong.com/posts/rZ6wam9gFGFQrCWHc/#mT792uAy4ih3qCDfx
https://www.lesswrong.com/posts/k48vB92mjE9Z28C3s/?commentId=QciMJ9ehR9xbTexcc
Where we can validly turn utility maximization over plans and predictions into world states.
And finally a link on how to control an LLM’s behavior, which while not related too much to RL, is nontheless interesting:
https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1
I’ve read each of the posts you cite thoroughly, some of them recently. I remain unclear on one thing: how do you expect to have a densely defined reinforcement signal? I can see that happening if you have some other system estimating success in arbitrary situations; that would be dense but very noisy. Which might be fine.
It would be noisy, but still dense. It wouldn’t include goals like “maximize success across tasks and time”. Unless the agent was made coherent and autonomous—in which case the reflectively stable center of all of that RL training might be something like that.
I think mostly about AGI that is fully autonomous and therefore becomes coherent around its goals, for better or worse. I wonder if that might be another important difference of perspective. You said
I don’t understand why they wouldn’t intrinsically be agents after that RL training?
I want to understand, because I believe refining my model of what AGI will first look like should help a lot with working through alignment schemes adequately before they’re tried.
My thought is that they’d need to take arbitrary goals, and create arbitrary subgoals, which training wouldn’t always cover. There are an infinite number of valuable tasks in the world. But I can also see the argument that most useful tasks fall into categories, and training on those categories might be not just useful but adequate for most of what we want from AGI.
If that’s the type of scenario you’re addressing, I think that’s plausible for many AGI projects. But I think the same argument I make for LLMs and other “oracle” AGI: someone will turn it into a full real agent very soon; it will have more economic value, but even if it doesn’t, people will do it just for the hell of it, because it’s interesting.
With LLMs it’s as simple as repeating the prompt “keep working on that problem, pursuing goal X, using tools Y”. With another architecture, it might be a little different- but turning adequate intelligence into a true agent is almost trivial. Some monkey will pull that lever almost as soon as it’s available.
You’ve probably heard that argument somewhere before, so I may well be misunderstanding your scenario still.
Thanks for the dialogue here, this is useful for my work on my draft post “how we’ll try to align AGI”.
Basically, via lots of synthetic data that always shows the AI acting aligned even when the human behaves badly, as well as synthetic data to make misaligned agents reveal themselves safely, and in particular it’s done early in the training run, before it can try to deceive or manipulate us.
More generally, the abuse of synthetic data means we have complete control over the inputs to the AI model, which means we can very easily detect stuff like deception and takeover risk.
For example, we can feed RL and LLM agents information about interpretability techniques not working, despite them actually working, or feed them exploits that are both easy and large for misaligned AI to do that seem to work, but doesn’t actually work.
More here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
https://www.lesswrong.com/posts/oRQMonLfdLfoGcDEh/a-bitter-lesson-approach-to-aligning-agi-and-asi-1
It’s best to make large synthetic datasets now, so that we can apply it continuously throughout AGI/ASI training, and in particular do it before it is capable of learning deceptiveness/training games.
I was just referring to this post on how RL policies aren’t automatically agents, without other assumptions. I agree that they will likely be agentized by someone if RL doesn’t agentize them, and I agree with your assumptions on why they will be agentic RL/LLM AIs.
https://www.lesswrong.com/posts/rmfjo4Wmtgq8qa2B7/think-carefully-before-calling-rl-policies-agents
Also, the argument against synthetic data working because raters make large amounts of compactly describable errors has evidence against it, at least in the data-constrained case.
Some relevant links are these:
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/#74DdsQ7wtDnx4ChDX
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/#R9Bfu6tzmuWRCT6DB
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/?commentId=AoxYQR9jLSLtjvLno#AoxYQR9jLSLtjvLno
At a broader level, my point is that even conditional on you being correct that fully autonomous AI that is coherent across goals will be trained by somebody soon, the path to being coherent and autonomous is both important and influenceable to be more aligned by us.
And thank you for being willing to read so much. I will ask you to read more posts and comments here, so that I can finally explicate what exactly is the plan to align AGI via RL or LLMs, which is large synthetic datasets.
I just finished reading all of those links. I was familiar with Roger Dearnaleys’ proposal of synthetic data for alignment but not Beren Millidge’s. It’s a solid suggestion that I’ve added to my list of likely stacked alignment approaches, but not fully thought through. It does seem to have a higher tax/effort than the methods so I’m not sure we’ll get around to it before real AGI. But it doesn’t seem unlikely either.
I got caught up reading the top comment thread above the Turntrout/Wentworth exchange you linked. I’d somehow missed that by being off-grid when the excellent All the Shoggoths Merely Players came out. It’s my nomination for SOTA of the current alignment difficulty discussion.
I very much agree that we get to influence the path to coherent autonomous AGI. I think we’ll probably succeed in making aligned AGI- but then quite possibly tragically extinct ourselves with standard human combativeness/paranoia or foolishness—If we solve alignment, do we die anyway?
I think you’ve read that and we’ve had a discussion there, but I’m leaving that link here as the next step in this discussion now that we’ve reached approximate convergence.
I agree it has a higher tax rate than RLHF, but to make the case for lower tax rates than people think, it’s because synthetic data will likely be a huge part of what makes AGI into ASI, as models require a lot of data, and synthetic data is a potentially huge industry in futures where AI progress is very high, because the amount of human data is both way too limiting for future AIs, and probably doesn’t show superhuman behavior like we want from LLMs/RL.
Thus huge amounts of synthetic data will be heavily used as part of capabilities progress, meaning we can incentivize them to also put alignment data in the synthetic data.
This is why I think we will need to use targeted removals of capabilities like LEACE combined with using synthetic data to remove infohazardous knowledge, combined with not open-weighting/open-sourcing models as AIs get more capable and only allowing controlled API use.
Here’s the LEACE paper and code:
https://github.com/EleutherAI/concept-erasure/pull/2
https://github.com/EleutherAI/concept-erasure
https://github.com/EleutherAI/concept-erasure/releases/tag/v0.2.0
https://arxiv.org/abs/2306.03819
https://blog.eleuther.ai/oracle-leace
I’ll reread that post again.
Agreed on the capabilities advantages of synthetic data; so it might not be much of a tax at all to mix in some alignment.
I don’t think removing infohazardous knowledge will work all the way into dangerous AGI, but I don’t think it has to—there’s a whole suite of other alignment techniques for language model agents that should suffice together.
Keeping a superintelligence ignorant of certain concepts sounds impossible. Even a “real AGI” of the type I expect soon will be able to reason and learn, causing it to rapidly rediscover any concepts you’ve carefully left out of the training set. Leaving out this relatively easy capability (to reason and learn online) will hurt capabilities, so you’d have a huge uphill battle in keeping it out of deployed AGI. At least one current projects have already accomplished limited (but impressive) forms of this as part of their strategy to create useful LM agents. So I don’t think it’s getting rolled back or left out.
I agree with you that there are probably better methods to handle the misuse risk, and note I also pointed out them as options, not exactly guarantees.
And yeah, I agree with this specifically:
Thanks for mentioning that.
Now that I think about it, I agree that it’s only a stop gap for misuse, and yeah if there is even limited generalization ability, I agree that LLMs will be able to rediscover dangerous knowledge, so we will need to make LLMs that don’t let users completely make bio-weapons for example.