the more we crank up generalization ability, the better it’s alignment
To me that seems almost correct, in a way that is dangerous. I’d agree with the statement that
the more we crank up generalization ability, the better the AI’s ability to align to any given set of values/goals
But for that to lead to the AI being aligned with “good” values, we also need to somehow get the AI to choose/want to align with “good” values. (Whatever “good” even means; maybe humanity’s CEV?) And that does not happen on its own, AFAICT.
But for that to lead to the AI being aligned with “good” values, we also need to somehow get the AI to choose/want to align with “good” values.
And that does not happen on its own, AFAICT.
I agree that the assumption of generality, on it’s own, doesn’t actually allow you to align an AI’s values to a specific human’s values or intentions, given the embeddedness of the world allowing for everything to be manipulated, including a human’s values.
Thus you can solve the problem of alignment in 2 ways:
Resolve the embedded alignment problems and try to align the AI with a human’s values, or essentially get alignment while online learning in the world.
This is essentially Reinforcement Learning from Human Feedback’s method for alignment.
Dissolve the embedded alignment problems by finding a way to translate the ontology of Cartesianism, including it’s boundaries into an embedded world via offline learning that makes sense.
This is essentially Pretraining from Human Feedback’s method for alignment.
So I made the assumption that the AI can’t hack the data set used for human values, and that assumption is able to be enforced in offline learning, but not online learning.
I don’t quite understand what you’re saying; I get the impression we’re using different ontologies/vocabularies. I’m curious to understand your model of alignment, and below are a bunch of questions. I’m uncertain whether it’s worth the time to bridge the conceptual gap, though—feel free to drop this conversation if it feels opportunity-costly.
(1.)
Are you saying that if we assumed agents to be Cartesian[1], then you’d know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?
(2.)
Resolve the embedded alignment problems [...] This is essentially Reinforcement Learning from Human Feedback’s method for alignment
How does RLHF solve problems of embedded alignment? I’m guessing you’re referring to something other than the problems outlined in Embedded Agency?
(3.)
What exact distinction do you mean by “online” vs “offline”?
Given that any useful/”pivotal” AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn “online”, no?
(4.)
the data set used for human values
What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?
e.g. software agents in some virtual world, programmed such that agents are implemented as some well-defined datatype/class, and agent-world interactions can only happen via a well-defined simple interface, running on a computer that cannot be hacked from within the simulation.
Are you saying that if we assumed agents to be Cartesian[1], then you’d know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?
Yes, and while I can’t totally describe the situation, I can say this:
The first step would be to say scale up the experiment of Pretraining from Human Feedback by using larger data, then curating it for alignment. In particular, we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words.
But the real power of the Cartesian agent for alignment is the fact that the agent-world interactions can only happen on a well-defined simple interface, and the computer isn’t hackable. That immediately means that many AI risk stories evaporate, as you can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment and you can’t amplify Goodhart or hack the human’s values, making alignment simpler since we can ensure that there doesn’t need to be cumbersome protection that incurs alignment taxes. Thus we can bring it back to simple iteration until we succeed.
How does RLHF solve problems of embedded alignment? I’m guessing you’re referring to something other than the problems outlined in Embedded Agency?
You’re right that it doesn’t do well, but the main strategy is to reward AI whenever it takes good looking actions. Problem is, as you can see, it’s trying to align a agent after it’s already capable, and with higher and higher power, this is increasingly dangerous. In particular, since the AI controls the learning schedule, it will probably be incentivized into hacking the human, since the human’s values are just another thing in the world, and there’s no well defined interface that’s under our control.
Pretraining from Human Feedback does alignment before it gets significantly better capabilities.
What exact distinction do you mean by “online” vs “offline”? Given that any useful/”pivotal” AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn “online”, no?
I agree with this that one eventually has to shed the Cartesian boundary and do online learning/training. But the key thing we’ve learned from deep learning is that we can translate the ontology of Cartesianism without too much trouble. In particular, one can do a whole lot of training inside a Cartesian setting called offline learning, and when it moves to the embedded setting so that it’s online setting, it generalizes the capabilities learned in the Cartesian setting really well. If one could do this for alignment, then the problem transforms into a solvable problem, as we can ensure that there is no way to hack the human’s values or hack it’s environment.
And indeed Pretraining from Human Feedback is the closest I’ve seen to actually implement this.
The embeddedness of the world dominates asymptotically, but we only need a finite time of Cartesian learning to generalize in the embedded setting correctly.
To talk about the distinction between offline and online learning, offline learning is when we give batches of data, the AI guided by SGD learns that patterns and algorithms that explain that data, and then we give it new data, rinse and repeat essentially. An important point to notice here is that the human is in control of that data, and there’s no way for the data to be hacked, unlike in online learning. In particular, the interface is well specified: Text.
In online learning, the AI takes the lead, and it selects it’s own data points, and the most we can do is give it reward or punishment. In particular, nearly all human learning is online learning, since we select our own data points.
What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?
Unfortunately, that’s outside my expertise, so I can’t really concretely do this. I am aiming to give something like a possibility proof for alignment, as well as mentioning some practical ways to achieve the idea. The implementation is left to other people.
Now to gesture at a crux I have, I think one of the largest cruxes I have is that I think the human method of learning, which is almost all online, doesn’t need to be replicated in AI. A large part of the problems with alignment in humans ultimately stem from that we are almost always online learners, thus we constantly have chances to hack or manipulate our environments, whether it’s physical, social or anywhere else, and so we do hack our environments. When we train AI, we don’t need to replicate that method, because we’ve learned that we can do a lot more offline learning to get the AI to learn the basic concepts.
If we could reliably do offline learning in humans, I think alignment at least in the single person case could be totally solved, and while larger groups would have problems staying aligned unless they had special properties, we could plausibly solve alignment for much larger groups.
In essence, the human architecture almost totally prohibits offline learning, while AI architecture does permit offline learning.
The real weakness of offline learning is the cost. While this cost is amortizable such that it’s actually pretty cheap over multiple runs of training, and you don’t need to recompute rewards, evolution and us had to work with much smaller energy budgets, at least until the 19th century, and in the 21st century, that energy and data could be applied to compute. Thus large upfront costs couldn’t be purchased, even if it works far better for alignment in the long run.
Also, I’m focusing on the embedded agency problems of Goodhart’s law and subsystem alignment, and I’m not addressing decision theory or embedded world models problems. In essence, I’m focusing only on problems that are alignment related, not capabilities related.
I know this is a long commennt, but I hope you understand why I’m communicating so differently, and why I’m so optimistic about alignment being better as capabilities scale.
scale up the experiment of Pretraining from Human Feedback by using larger data
AFAICT, PHF doesn’t solve any of the core problems of alignment. IIUC, PHF is still using an imperfect reward model trained on a finite amount of human signals-of-approval; I’d tentatively expect scaling up PHF (to ASI) to result in death-or-worse by Goodhart. Haven’t thought about PHF very thoroughly though, so I’m uncertain here.
we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words
Did you mean something like “(somehow) design a data set such that, in order to predict token-sequences in that data set, the AI has to learn the real-world structure of things we care about, like freedom, justice, alignment, etc.”? [1]
can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment
I don’t understand this. What difference are you pointing at with “deceptive” vs “legitimate” generalizations? How does {AI-human (and/or AI-env) interactions being limited to a simple interface} preclude {learning “deceptive” generalizations}?
Side note: I don’t understand what you mean by this (in the given context).
can’t [...] hack the human’s values
I don’t see how this follows. IIUC, the proposition here is something like
If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can’t hack the humans.
Is that a reasonable representation of what you’re saying?
If yes, consider: What if we replace “the AI” with “Anonymous” and “the humans” with “the web server”? Then we get
If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can’t hack the web server
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn’t rely on hardware effects like rowhammering.
(I agree that limiting human-AI interactions to a simple interface would be helpful, but I think it’s far from sufficient (to guarantee any form of safety).)
IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier. I’m still confused as to why you think that. To the extent that I understood the reasons you presented, I think they’re incorrect (as outlined above). (Maybe I’m misunderstanding something.)
I’m kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons.
I’m kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
This will be my last comment here. Thank you for trying to explain why you disagree with me!
IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier.
I”m impressed that you passed my ITT. I think analogies to other alignment problems like the human alignment problem miss that it’s the most difficult setting, but you don’t need to play on that difficulty, because AI is very different from humans.
AFAICT, PHF doesn’t solve any of the core problems of alignment.
While I definitely over claimed on how much it solves the alignment problems, I think this is definitely underselling the accomplishments. It’s an incomplete solution, in that it doesn’t do everything on it’s own, but it does carry a lot of weight.
To talk about deceptive alignment more specifically, deceptive alignment is essentially where the AI isn’t aligned with human goals, and tries to hide that fact. One of the key prerequisites of deceptive alignment is that it is optimizing a non-myopic goal. It’s the most dangerous form of alignment, since we have an AI only aligned for instrumental, not terminal reasons.
What Pretraining from Human Feedback did was it finally married a myopic goal with competitive capabilities, and once the myopic goal of conditional training was added, then deceptive alignment goes away, since a non-myopic goal being optimized is one of the key prerequisites.
I don’t see how this follows. IIUC, the proposition here is something like
If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can’t hack the humans.
Is that a reasonable representation of what you’re
saying? If yes, consider: What if we replace “the AI” with “Anonymous” and “the humans” with “the web server”? Then we get
If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can’t hack the web server
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn’t rely on hardware effects like rowhammering.
This is definitely right, and I did over claim here, though I do remember Pretraining from Human Feedback claimed to do this:
Conditional training (as well as other PHF objectives) is purely offline: the LM is not able to affect its own training distribution. This is unlike RLHF, where the LM learns from self-generated data and thus is more likely to lead to risks from auto-induce distribution shift or gradient hacking.
Which vindicated a narrower claim about the inability of the AI to hack or affect the training distribution, which I don’t know how much it supports my thesis on the immunity to hacking claims.
To port another reason why I’m so optimistic on alignment, I think that alignment is scalable, or put it another way, while Pretraining from Human Feedback is imperfect right now, and even in the imperfections my view is that it would avoid X-risk almost entirely, small, consistent improvements in the vein of empirical work will eventually make it far more aligned than the original Pretraining from Human Feedback work. In the case of more data, they tested it and it showed increasing alignment with more data.
To edit a quote from Thoth Hermes:
Yudkowsky was wrong in the tendency to assume that certain abstractions just don’t apply whenever intelligence or capability is scaled way up.”
This essentially explains my issues with the idea that alignment isn’t scalable.
To me that seems almost correct, in a way that is dangerous. I’d agree with the statement that
But for that to lead to the AI being aligned with “good” values, we also need to somehow get the AI to choose/want to align with “good” values. (Whatever “good” even means; maybe humanity’s CEV?) And that does not happen on its own, AFAICT.
I agree that the assumption of generality, on it’s own, doesn’t actually allow you to align an AI’s values to a specific human’s values or intentions, given the embeddedness of the world allowing for everything to be manipulated, including a human’s values.
Thus you can solve the problem of alignment in 2 ways:
Resolve the embedded alignment problems and try to align the AI with a human’s values, or essentially get alignment while online learning in the world.
This is essentially Reinforcement Learning from Human Feedback’s method for alignment.
Dissolve the embedded alignment problems by finding a way to translate the ontology of Cartesianism, including it’s boundaries into an embedded world via offline learning that makes sense.
This is essentially Pretraining from Human Feedback’s method for alignment.
So I made the assumption that the AI can’t hack the data set used for human values, and that assumption is able to be enforced in offline learning, but not online learning.
I don’t quite understand what you’re saying; I get the impression we’re using different ontologies/vocabularies. I’m curious to understand your model of alignment, and below are a bunch of questions. I’m uncertain whether it’s worth the time to bridge the conceptual gap, though—feel free to drop this conversation if it feels opportunity-costly.
(1.)
Are you saying that if we assumed agents to be Cartesian[1], then you’d know a solution to the problem of {how could a weak agent train and align a very powerful agent}? If yes, could you outline that solution?
(2.)
How does RLHF solve problems of embedded alignment? I’m guessing you’re referring to something other than the problems outlined in Embedded Agency?
(3.)
What exact distinction do you mean by “online” vs “offline”? Given that any useful/”pivotal” AI would need to learn new things about the world (and thus, modify its own models/mind) in order to form and execute a useful/pivotal plan, it would have to learn “online”, no?
(4.)
What kind of data set did you have in mind here? A data set s.t. training an AI on it in some appropriate way would lead to the AI being aligned to human values? Could you give a concrete example of such a data set (and training scheme)?
e.g. software agents in some virtual world, programmed such that agents are implemented as some well-defined datatype/class, and agent-world interactions can only happen via a well-defined simple interface, running on a computer that cannot be hacked from within the simulation.
Yes, and while I can’t totally describe the situation, I can say this:
The first step would be to say scale up the experiment of Pretraining from Human Feedback by using larger data, then curating it for alignment. In particular, we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words.
But the real power of the Cartesian agent for alignment is the fact that the agent-world interactions can only happen on a well-defined simple interface, and the computer isn’t hackable. That immediately means that many AI risk stories evaporate, as you can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment and you can’t amplify Goodhart or hack the human’s values, making alignment simpler since we can ensure that there doesn’t need to be cumbersome protection that incurs alignment taxes. Thus we can bring it back to simple iteration until we succeed.
You’re right that it doesn’t do well, but the main strategy is to reward AI whenever it takes good looking actions. Problem is, as you can see, it’s trying to align a agent after it’s already capable, and with higher and higher power, this is increasingly dangerous. In particular, since the AI controls the learning schedule, it will probably be incentivized into hacking the human, since the human’s values are just another thing in the world, and there’s no well defined interface that’s under our control.
Pretraining from Human Feedback does alignment before it gets significantly better capabilities.
I agree with this that one eventually has to shed the Cartesian boundary and do online learning/training. But the key thing we’ve learned from deep learning is that we can translate the ontology of Cartesianism without too much trouble. In particular, one can do a whole lot of training inside a Cartesian setting called offline learning, and when it moves to the embedded setting so that it’s online setting, it generalizes the capabilities learned in the Cartesian setting really well. If one could do this for alignment, then the problem transforms into a solvable problem, as we can ensure that there is no way to hack the human’s values or hack it’s environment.
And indeed Pretraining from Human Feedback is the closest I’ve seen to actually implement this.
The embeddedness of the world dominates asymptotically, but we only need a finite time of Cartesian learning to generalize in the embedded setting correctly.
To talk about the distinction between offline and online learning, offline learning is when we give batches of data, the AI guided by SGD learns that patterns and algorithms that explain that data, and then we give it new data, rinse and repeat essentially. An important point to notice here is that the human is in control of that data, and there’s no way for the data to be hacked, unlike in online learning. In particular, the interface is well specified: Text.
In online learning, the AI takes the lead, and it selects it’s own data points, and the most we can do is give it reward or punishment. In particular, nearly all human learning is online learning, since we select our own data points.
Unfortunately, that’s outside my expertise, so I can’t really concretely do this. I am aiming to give something like a possibility proof for alignment, as well as mentioning some practical ways to achieve the idea. The implementation is left to other people.
Now to gesture at a crux I have, I think one of the largest cruxes I have is that I think the human method of learning, which is almost all online, doesn’t need to be replicated in AI. A large part of the problems with alignment in humans ultimately stem from that we are almost always online learners, thus we constantly have chances to hack or manipulate our environments, whether it’s physical, social or anywhere else, and so we do hack our environments. When we train AI, we don’t need to replicate that method, because we’ve learned that we can do a lot more offline learning to get the AI to learn the basic concepts.
If we could reliably do offline learning in humans, I think alignment at least in the single person case could be totally solved, and while larger groups would have problems staying aligned unless they had special properties, we could plausibly solve alignment for much larger groups.
In essence, the human architecture almost totally prohibits offline learning, while AI architecture does permit offline learning.
The real weakness of offline learning is the cost. While this cost is amortizable such that it’s actually pretty cheap over multiple runs of training, and you don’t need to recompute rewards, evolution and us had to work with much smaller energy budgets, at least until the 19th century, and in the 21st century, that energy and data could be applied to compute. Thus large upfront costs couldn’t be purchased, even if it works far better for alignment in the long run.
Also, I’m focusing on the embedded agency problems of Goodhart’s law and subsystem alignment, and I’m not addressing decision theory or embedded world models problems. In essence, I’m focusing only on problems that are alignment related, not capabilities related.
I know this is a long commennt, but I hope you understand why I’m communicating so differently, and why I’m so optimistic about alignment being better as capabilities scale.
AFAICT, PHF doesn’t solve any of the core problems of alignment. IIUC, PHF is still using an imperfect reward model trained on a finite amount of human signals-of-approval; I’d tentatively expect scaling up PHF (to ASI) to result in death-or-worse by Goodhart. Haven’t thought about PHF very thoroughly though, so I’m uncertain here.
Did you mean something like “(somehow) design a data set such that, in order to predict token-sequences in that data set, the AI has to learn the real-world structure of things we care about, like freedom, justice, alignment, etc.”? [1]
I don’t understand this. What difference are you pointing at with “deceptive” vs “legitimate” generalizations? How does {AI-human (and/or AI-env) interactions being limited to a simple interface} preclude {learning “deceptive” generalizations}?
I’m under the impression that entirely “legitimate” generalizations can (and apriori probably will) lead to “deception”; see e.g. https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness. Do you disagree with that? (If yes, how?)
Side note: I don’t understand what you mean by this (in the given context).
I don’t see how this follows. IIUC, the proposition here is something like
If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can’t hack the humans.
Is that a reasonable representation of what you’re saying? If yes, consider: What if we replace “the AI” with “Anonymous” and “the humans” with “the web server”? Then we get
If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can’t hack the web server
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn’t rely on hardware effects like rowhammering.
(I agree that limiting human-AI interactions to a simple interface would be helpful, but I think it’s far from sufficient (to guarantee any form of safety).)
IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier. I’m still confused as to why you think that. To the extent that I understood the reasons you presented, I think they’re incorrect (as outlined above). (Maybe I’m misunderstanding something.)
I’m kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons.
This will be my last comment here. Thank you for trying to explain why you disagree with me!
I”m impressed that you passed my ITT. I think analogies to other alignment problems like the human alignment problem miss that it’s the most difficult setting, but you don’t need to play on that difficulty, because AI is very different from humans.
While I definitely over claimed on how much it solves the alignment problems, I think this is definitely underselling the accomplishments. It’s an incomplete solution, in that it doesn’t do everything on it’s own, but it does carry a lot of weight.
To talk about deceptive alignment more specifically, deceptive alignment is essentially where the AI isn’t aligned with human goals, and tries to hide that fact. One of the key prerequisites of deceptive alignment is that it is optimizing a non-myopic goal. It’s the most dangerous form of alignment, since we have an AI only aligned for instrumental, not terminal reasons.
What Pretraining from Human Feedback did was it finally married a myopic goal with competitive capabilities, and once the myopic goal of conditional training was added, then deceptive alignment goes away, since a non-myopic goal being optimized is one of the key prerequisites.
This is definitely right, and I did over claim here, though I do remember Pretraining from Human Feedback claimed to do this:
Which vindicated a narrower claim about the inability of the AI to hack or affect the training distribution, which I don’t know how much it supports my thesis on the immunity to hacking claims.
To port another reason why I’m so optimistic on alignment, I think that alignment is scalable, or put it another way, while Pretraining from Human Feedback is imperfect right now, and even in the imperfections my view is that it would avoid X-risk almost entirely, small, consistent improvements in the vein of empirical work will eventually make it far more aligned than the original Pretraining from Human Feedback work. In the case of more data, they tested it and it showed increasing alignment with more data.
To edit a quote from Thoth Hermes:
This essentially explains my issues with the idea that alignment isn’t scalable.