We are moving rapidly from a world where people deploy manifestly unaligned models (where even talking about alignment barely makes sense) to people deploying models which are misaligned because (i) humans make mistakes in evaluation, (ii) there are high-stakes decisions so we can’t rely on average-case performance.
This seems like a good thing to do if you want to move on to research addressing the problems in RLHF: (i) improving the quality of the evaluations (e.g. by using AI assistance), and (ii) handling high-stakes objective misgeneralization (e.g. by adversarial training).
In addition to “doing the basic thing before the more complicated thing intended to address its failures,” it’s also the case that RLHF is a building block in the more complicated things.
I think that (a) there is a good chance that these boring approaches will work well enough to buy (a significant amount) time for humans or superhuman AIs to make progress on alignment research or coordination, (b) when they fail, there is a good chance that their failures can be productively studied and addressed.
Overall it seems to me like the story here is reasonably good and has worked out reasonably well in practice. I think RLHF is being adopted more quickly than it otherwise would, and plenty of follow-up work is being done. I think many people in labs have a better understanding of what the remaining problems in alignment are; as a result they are significantly more likely to work productively on those problems themselves or to recognize and adopt solutions from elsewhere.
OK, thanks. I’m new to this debate, I take it I’m wandering in to a discussion that may already have been had to death.
I guess I’m worried that RLHF should basically be thought of as capabilities research instead of alignment/safety research. The rationale for this would be: Big companies will do RLHF before the end by default, since their products will embarrass them otherwise. By doing RLHF now and promoting it we help these companies get products to market sooner & free up their time to focus on other capabilities research.
I agree with your claims (a) and (b) but I don’t think they undermine this skeptical take, because I think that if RLHF fails the failures will be different for really powerful systems than for dumb systems.
I think it’d be useful if you spelled out those failures you think will occur in powerful systems, that won’t occur in any intermediate system (assuming some degree of slowness sufficient to allow real world deployment of not-yet-AGI agentic models).
For example, deception: lots of parts of the animal kingdom understand the concept of “hiding” or “lying in wait to strike”, I think? It already showed up in XLand IIRC. Imagine a chatbot trying to make a sale—avoiding problematic details of the product it’s selling seems like a dominant strategy.
There are definitely scarier failure modes that show up in even-more-powerful systems (e.g. actual honest-to-goodness long-term pretending to be harmless in order to end up in situations with more resources, which will never be caught with RLHF), and I agree pure alignment researchers should be focusing on those. But the suggestion that picking the low-hanging fruit won’t build momentum for working on the hardest problems does seem wrong to me.
As another example, consider the Beijing Academy of AI’s government-academia-industry LLM partnership. When their LLMs fail to do what they want, they’ll try RLHF—and it’ll kind of work, but then it’ll fail in a bunch of situations. They’ll be forced to confront the fact that actually, objective robustness is a real thing, and start funding research/taking proto-alignment research way more seriously/as being on the critical path to useful models. Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?
Thanks! I take the point about animals and deception.
Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?
Insofar as the pitch for RLHF is “Yes tech companies are going to do this anyway, but if we do it first then we can gain prestige, people will cite us, etc. and so people will turn to us for advice on the subject later, and then we’ll be able to warn them of the dangers” then actually that makes a lot of sense to me, thanks. I still worry that the effect size might be too small to be worth it, but idk.
I don’t think that there are failures that will occur in powerful systems that won’t occur in any intermediate system. However I’m skeptical that the failures that will occur in powerful systems will also occur in today’s systems. I must say I’m super uncertain about all of this and haven’t thought about it very much.
With that preamble aside, here is some wild speculation:
--Current systems (hopefully?) aren’t reasoning strategically about how to achieve goals & then executing on that reasoning. (You can via prompting get GPT-3 to reason strategically about how to achieve goals… but as far as we know it isn’t doing reasoning like that internally when choosing what tokens to output. Hopefully.) So, the classic worry of “the AI will realize that it needs to play nice in training so that it can do a treacherous turn later in deployment” just doesn’t apply to current systems. (Hopefully.) So if we see e.g. our current GPT-3 chatbot being deceptive about a product it is selling, we can happily train it to not do that and probably it’ll just genuinely learn to be more honest. But if it had strategic awareness and goal-directedness, it would instead learn to be less honest; it would learn to conceal its true intentions from its overseers.
--As humans grow up and learn more and (in some cases) do philosophy they undergo major shifts in how they view the world. This often causes them to change their minds about things they previously learned. For example, maybe at some point they learned to go to church because that’s what good people do because that’s what God says; later on they stop believing in God and stop going to church. And then later still they do some philosophy and adopt some weird ethical theory like utilitarianism and their behavior changes accordingly. Well, what if AIs undergo similar ontological shifts as they get smarter? Then maybe the stuff that works at one level of intelligence will stop working at another. (e.g. telling a kid that God is watching them and He says they should go to church stops working. Later when they become a utilitarian, telling them that killing civilians is murder and murder is wrong stops working too (if they are in a circumstance where the utilitarian calculus says civilian casualties are worth it for the greater good)).
I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.
We are moving rapidly from a world where people deploy manifestly unaligned models (where even talking about alignment barely makes sense) to people deploying models which are misaligned because (i) humans make mistakes in evaluation, (ii) there are high-stakes decisions so we can’t rely on average-case performance.
What it feels like to me is that we are rapidly moving from a world where people deploy manifestly unaligned models to people deploying models which are still manifestly unaligned (where even talking about alignment barely makes sense), but which are getting differentially good at human modeling and deception (and maybe at supervising other AIs, which is where the hope comes from).
I don’t think the models are misaligned because humans are making mistakes in evaluation. The models are misaligned because we have made no progress at actually pointing towards anything like human values or other concepts like corrigibility or myopia.
In other words, models are mostly misaligned because there are strong instrumental convergent incentives towards agency, and we don’t currently have any tools that allow us to shape the type of optimization that artificial systems are doing internally. Learning from human feedback seems if anything to be slightly more the kind of reward that incentivizes dangerous agency. This seems to fit neither into your (1) or (2).
Instruct-GPT is not more aligned than GPT-3. It is more capable at performing many tasks, and we have some hope that some of the tasks at which it is getting better might help with AI Alignment down the line, but right now, at the current state of the AI alignment field, the problem is not that we can’t provide good enough evaluation, or that we can only get good “average-case” performance, it’s that we have systems with random goals that are very far from human values or are capable of being reliably conservative.
And additionally to that, we now have a tool that allows any AI company to trivially train away any surface-level alignment problems, without addressing any of the actual underlying issues, creating a situation with very strong incentives towards learning human deception and manipulation, and a situation where obvious alignment failures are much less likely to surface.
My guess is you are trying to point towards a much more sophisticated and broader thing by your (2) than I interpret you as saying here, but the above is my response to my best interpretation of what you mean by (2).
In other words, models are mostly misaligned because there are strong instrumental convergent incentives towards agency, and we don’t currently have any tools that allow us to shape the type of optimization that artificial systems are doing internally.
In the context of my comment, this appears to be an empirical claim about GPT-3. Is that right? (Otherwise I’m not sure what you are saying.)
If so, I don’t think this is right. On typical inputs I don’t think GPT-3 is instrumentally behaving well on the training distribution because it has a model fo the data-generating process.
I think on distribution you are mostly getting good behavior mostly either by not optimizing, or by optimizing for something we want. I think to the extent it’s malign it’s because there are possible inputs on which it is optimizing for something you don’t want, but those inputs are unlike those that appear in training and you have objective misgeneralization.
In that regime, I think the on-distribution performance is probably aligned and there is not much in-principle obstruction to using adversarial training to improve the robustness of alignment.
Instruct-GPT is not more aligned than GPT-3. It is more capable at performing many tasks, and we have some hope that some of the tasks at which it is getting better might help with AI Alignment down the lin
Could you define the word “alignment” as you are using it?
I’m using roughly the definition here. I think it’s the case that there are many inputs where GPT-3 is not trying to do what you want, but Instruct-GPT is. Indeed, I think Instruct-GPT is actually mostly trying to do what you want to the extent that it is trying to do anything at all. That would lead me to say it is more “aligned.”
I agree there are subtleties like “If I ask instruct-gpt to summarize a story, is it trying to summarize the story? Or trying to use that as evidence about ‘what Paul wants’ and then do that?” And I agree there is a real sense in which it isn’t smart enough for that distinction to be consistently meaningful, and so in that sense you might say my definition of intent alignment doesn’t really apply. (I more often think about models being “benign” or “malign,” more like asking: is it trying to optimize for something despite knowing that you wouldn’t like it.) I don’t think that’s what you are talking about here though.
right now, at the current state of the AI alignment field, the problem is not that we can’t provide good enough evaluation, or that we can only get good “average-case” performance, it’s that we have systems with random goals that are very far from human values or are capable of being reliably conservative.
If you have good oversight, I think you probably get good average case alignment. That’s ultimately an empirical claim about what happens when you do SGD, but the on-paper arguments looks quite good (namely: on-distribution alignment would improve the on-distribution performance and seems easy for SGD to learn relative to the complexity of the model itself) and it appears to match the data so far to the extent we have relevant data.
You seem to be confidently stating it’s false without engaging at all with the argument in favor or presenting or engaging with any empirical evidence.
You seem to be confidently stating it’s false without engaging at all with the argument in favor or presenting or engaging with any empirical evidence.
But which argument in favor did you present? You just said “the models are unaligned for these 2 reasons”, when those reasons do not seem comprehensive to me, and you did not give any justification for why those two reasons are comprehensive (or provide any links).
I tried to give a number of specific alternative reasons that do not seem to be covered by either of your two cases, and included a statement that we might disagree on definitional grounds, but that I don’t actually know what definitions you are using, and so can’t be confident that my critique makes sense.
Now that you’ve provided a definition, I still think what I said holds. My guess is there is a large inferential distance here, so I don’t think it makes sense to try to bridge that whole distance within this comment thread, though I will provide an additional round of responses.
If so, I don’t think this is right. On typical inputs I don’t think GPT-3 is instrumentally behaving well on the training distribution because it has a model fo the data-generating process.
I don’t think your definition of intent-alignment requires any unaligned system to have a model of the data-generating process, so I don’t understand the relevance of this. GPT-3 is not unaligned because it has a model of the data-generating process, and I didn’t claim that.
I did claim that neither GPT-3 nor Instruct-GPT are “trying to do what the operator wants it to do”, according to your definition, and that the primary reason for that is that in as much as its training process did produce a model that has “goals” and so can be modeled in any consequentialist terms, those “goals” do not match up with trying to be helpful to the operator. Most likely, they are a pretty messy objective we don’t really understand (which in the case of GPT-3 might be best described as “trying to generate text that in some simple latent space resembles the training distribution” and I don’t have any short description of what the “goals” of Instruct-GPT might be, though my guess is they are still pretty close to GPT-3s goals).
Indeed, I think Instruct-GPT is actually mostly trying to do what you want to the extent that it is trying to do anything at all. That would lead me to say it is more “aligned.”
I don’t think we know what Instruct-GPT is “trying to do”, and it seems unlikely to me that it is “trying to do what I want”. I agree in some sense it is “more trying to do what I want”, though not in a way that feels obviously very relevant to more capable systems, and not in a way that aligns very well with your intent definition (I feel like if I had to apply your linked definition to Instruct-GPT, I would say something like “ok, seems like it isn’t intent aligned, since the system doesn’t really seem to have much of an intent. And if there is a mechanism in its inner workings that corresponds to intent, we have no idea what thing it is pointed at, so probably it isn’t pointed at the right thing”).
And in either case, even if it is the case that if you squint your eyes a lot the system is “more aligned”, this doesn’t make the sentence “many of today’s systems are aligned unless humans make mistakes in evaluation or are deployed in high-stakes environments” true. “More aligned” is not equal to “aligned”.
The correct sentence seems to me “many of those systems are still mostly unaligned, but might be slightly more aligned than previous systems, though we have some hope that with better evaluation we can push that even further, and the misalignment problems are less bad on lower-stakes problems when we can rely on average-case performance, though overall the difference in alignment between GPT and Instruct-GPT is pretty unclear and probably not very large”.
I think on distribution you are mostly getting good behavior mostly either by not optimizing, or by optimizing for something we want. I think to the extent it’s malign it’s because there are possible inputs on which it is optimizing for something you don’t want, but those inputs are unlike those that appear in training and you have objective misgeneralization.
This seems wrong to me. On-distribution it seems to me that the system is usually optimizing for something that I don’t want. For example, GPT-3 primarily is trying to generate text that represents the distribution that its drawn from, which very rarely aligns with what I want (and is why prompt-engineering has such a large effect, e.g. “you are Albert Einstein” as a prefix improves performance on many tasks). Instruct-GPT does a bit better here, but probably most of its internal optimization power is still thrown at reasoning with the primary “intention” of generating text that is similar to its input distribution, since it seems unlikely that the fine-tuning completely rewrote most of these internal heuristics.
My guess is if Instruct-GPT was intent-aligned even for low-impact tasks, we could get it to be substantially more useful on many tasks. But my guess is what we currently have is mostly a model that is still primarily “trying” to generate text that is similar to its training distribution, with a few heuristics baked in in the human-feedback stage that make that text more likely to be a good fit for the question asked. In as much as the model is “trying to do something”, i.e. what most of its internal optimization power is pointed at, I am very skeptical that that is aligned with my task.
(Similarly, looking at Redwood’s recent model, it seems clear to me that they did not produce a model that “intents” to produce non-injurious completions. The model has two parts, one that is just “trying” to generate text similar to its training distribution, and a second part that is “trying” to detect whether a completion is injurious. This model seems clearly not intent-aligned, since almost none of its optimization power is going towards our target objective.)
If you have good oversight, I think you probably get good average case alignment. That’s ultimately an empirical claim about what happens when you do SGD, but the on-paper arguments looks quite good (namely: on-distribution alignment would improve the on-distribution performance and seems easy for SGD to learn relative to the complexity of the model itself) and it appears to match the data so far to the extent we have relevant data.
My guess is a lot of work is done here by the term “average case alignment”, so I am not fully sure how to respond. I disagree that the on-paper argument looks quite good, though it depends a lot on how narrowly you define “on-distribution”. Given my arguments above, you must either mean something different from intent-alignment (since to me at least it seems clear that Redwood’s model is not intent-aligned), or disagree with me on whether systems like Redwood’s are intent-aligned, in which case I don’t really know how to consistently apply your intent-alignment definition.
I also feel particularly confused about the term “average case alignment”, combined with “intent-alignment”. I can ascribe goals at multiple different levels to a model, and my guess is we both agree that describing current systems as having intentions at all is kind of fraught, but in as much as a model has a coherent goal, it seems like that goal is pretty consistent between different prompts, and so I am confused why we should expect average case alignment to be very different from normal alignment. It seems that if I have a model that is trying to do something, then asking it multiple times, probably won’t make a difference to its intention (I think, I mean, again, this all feels very handwavy, which is part of the reason why it feels so wrong to me to describe current models as “aligned”).
I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning).
I’m interested in other kinds of relevant similarities, since I think those would be exciting and productive things to research. I don’t think the framework “Instruct-GPT and GPT-3 e.g. copy patterns that they saw in the prompt, so they are ‘trying’ to predict the next word and hence are misaligned” is super useful, though I see where it’s coming from and agree that I started it by using the word “aligned”.
Relatedly, and contrary to my original comment, I do agree that there can be bad intentional behavior left over from pre-training. This is a big part what ML researchers are motivated by when they talk about improving the sample-efficiency of RLHF. I usually try to discourage people from working on this issue, because it seems like something that will predictably get better rather than worse as models improve (and I expect you are even less happy with it than I am).
I agree that there is a lot of inferential distance, and it doesn’t seem worth trying to close the gap here. I’ve tried to write down a fair amount about my views, and I’m always interested to read arguments / evidence / intuitions for more pessimistic conclusions.
Similarly, looking at Redwood’s recent model, it seems clear to me that they did not produce a model that “intents” to produce non-injurious completions.
I agree with this, though it’s unrelated to the stated motivation for that project or to its relationship to long-term risk.
I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning).
Phrased this way, I still disagree, but I think I disagree less strongly, and feel less of a need to respond to this. I care particularly much about using terms like “aligned” in consistent ways. Importantly, having powerful intent-aligned systems is much more useful than having powerful systems that just fail to kill you (e.g. because they are very conservative), and so getting to powerful aligned systems is a win-condition in the way that getting to powerful non-catastrophic systems is not.
I agree with this, though it’s unrelated to the stated motivation for that project or to its relationship to long-term risk.
Yep, I didn’t intend to imply that this was in contrast to the intention of the research. It was just on my mind as a recent architecture that I was confident we both had thought about, and so could use as a convenient example.
We are moving rapidly from a world where people deploy manifestly unaligned models (where even talking about alignment barely makes sense) to people deploying models which are misaligned because (i) humans make mistakes in evaluation, (ii) there are high-stakes decisions so we can’t rely on average-case performance.
This seems like a good thing to do if you want to move on to research addressing the problems in RLHF: (i) improving the quality of the evaluations (e.g. by using AI assistance), and (ii) handling high-stakes objective misgeneralization (e.g. by adversarial training).
In addition to “doing the basic thing before the more complicated thing intended to address its failures,” it’s also the case that RLHF is a building block in the more complicated things.
I think that (a) there is a good chance that these boring approaches will work well enough to buy (a significant amount) time for humans or superhuman AIs to make progress on alignment research or coordination, (b) when they fail, there is a good chance that their failures can be productively studied and addressed.
Overall it seems to me like the story here is reasonably good and has worked out reasonably well in practice. I think RLHF is being adopted more quickly than it otherwise would, and plenty of follow-up work is being done. I think many people in labs have a better understanding of what the remaining problems in alignment are; as a result they are significantly more likely to work productively on those problems themselves or to recognize and adopt solutions from elsewhere.
OK, thanks. I’m new to this debate, I take it I’m wandering in to a discussion that may already have been had to death.
I guess I’m worried that RLHF should basically be thought of as capabilities research instead of alignment/safety research. The rationale for this would be: Big companies will do RLHF before the end by default, since their products will embarrass them otherwise. By doing RLHF now and promoting it we help these companies get products to market sooner & free up their time to focus on other capabilities research.
I agree with your claims (a) and (b) but I don’t think they undermine this skeptical take, because I think that if RLHF fails the failures will be different for really powerful systems than for dumb systems.
I think it’d be useful if you spelled out those failures you think will occur in powerful systems, that won’t occur in any intermediate system (assuming some degree of slowness sufficient to allow real world deployment of not-yet-AGI agentic models).
For example, deception: lots of parts of the animal kingdom understand the concept of “hiding” or “lying in wait to strike”, I think? It already showed up in XLand IIRC. Imagine a chatbot trying to make a sale—avoiding problematic details of the product it’s selling seems like a dominant strategy.
There are definitely scarier failure modes that show up in even-more-powerful systems (e.g. actual honest-to-goodness long-term pretending to be harmless in order to end up in situations with more resources, which will never be caught with RLHF), and I agree pure alignment researchers should be focusing on those. But the suggestion that picking the low-hanging fruit won’t build momentum for working on the hardest problems does seem wrong to me.
As another example, consider the Beijing Academy of AI’s government-academia-industry LLM partnership. When their LLMs fail to do what they want, they’ll try RLHF—and it’ll kind of work, but then it’ll fail in a bunch of situations. They’ll be forced to confront the fact that actually, objective robustness is a real thing, and start funding research/taking proto-alignment research way more seriously/as being on the critical path to useful models. Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?
Thanks! I take the point about animals and deception.
Insofar as the pitch for RLHF is “Yes tech companies are going to do this anyway, but if we do it first then we can gain prestige, people will cite us, etc. and so people will turn to us for advice on the subject later, and then we’ll be able to warn them of the dangers” then actually that makes a lot of sense to me, thanks. I still worry that the effect size might be too small to be worth it, but idk.
I don’t think that there are failures that will occur in powerful systems that won’t occur in any intermediate system. However I’m skeptical that the failures that will occur in powerful systems will also occur in today’s systems. I must say I’m super uncertain about all of this and haven’t thought about it very much.
With that preamble aside, here is some wild speculation:
--Current systems (hopefully?) aren’t reasoning strategically about how to achieve goals & then executing on that reasoning. (You can via prompting get GPT-3 to reason strategically about how to achieve goals… but as far as we know it isn’t doing reasoning like that internally when choosing what tokens to output. Hopefully.) So, the classic worry of “the AI will realize that it needs to play nice in training so that it can do a treacherous turn later in deployment” just doesn’t apply to current systems. (Hopefully.) So if we see e.g. our current GPT-3 chatbot being deceptive about a product it is selling, we can happily train it to not do that and probably it’ll just genuinely learn to be more honest. But if it had strategic awareness and goal-directedness, it would instead learn to be less honest; it would learn to conceal its true intentions from its overseers.
--As humans grow up and learn more and (in some cases) do philosophy they undergo major shifts in how they view the world. This often causes them to change their minds about things they previously learned. For example, maybe at some point they learned to go to church because that’s what good people do because that’s what God says; later on they stop believing in God and stop going to church. And then later still they do some philosophy and adopt some weird ethical theory like utilitarianism and their behavior changes accordingly. Well, what if AIs undergo similar ontological shifts as they get smarter? Then maybe the stuff that works at one level of intelligence will stop working at another. (e.g. telling a kid that God is watching them and He says they should go to church stops working. Later when they become a utilitarian, telling them that killing civilians is murder and murder is wrong stops working too (if they are in a circumstance where the utilitarian calculus says civilian casualties are worth it for the greater good)).
I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.
What it feels like to me is that we are rapidly moving from a world where people deploy manifestly unaligned models to people deploying models which are still manifestly unaligned (where even talking about alignment barely makes sense), but which are getting differentially good at human modeling and deception (and maybe at supervising other AIs, which is where the hope comes from).
I don’t think the models are misaligned because humans are making mistakes in evaluation. The models are misaligned because we have made no progress at actually pointing towards anything like human values or other concepts like corrigibility or myopia.
In other words, models are mostly misaligned because there are strong instrumental convergent incentives towards agency, and we don’t currently have any tools that allow us to shape the type of optimization that artificial systems are doing internally. Learning from human feedback seems if anything to be slightly more the kind of reward that incentivizes dangerous agency. This seems to fit neither into your (1) or (2).
Instruct-GPT is not more aligned than GPT-3. It is more capable at performing many tasks, and we have some hope that some of the tasks at which it is getting better might help with AI Alignment down the line, but right now, at the current state of the AI alignment field, the problem is not that we can’t provide good enough evaluation, or that we can only get good “average-case” performance, it’s that we have systems with random goals that are very far from human values or are capable of being reliably conservative.
And additionally to that, we now have a tool that allows any AI company to trivially train away any surface-level alignment problems, without addressing any of the actual underlying issues, creating a situation with very strong incentives towards learning human deception and manipulation, and a situation where obvious alignment failures are much less likely to surface.
My guess is you are trying to point towards a much more sophisticated and broader thing by your (2) than I interpret you as saying here, but the above is my response to my best interpretation of what you mean by (2).
In the context of my comment, this appears to be an empirical claim about GPT-3. Is that right? (Otherwise I’m not sure what you are saying.)
If so, I don’t think this is right. On typical inputs I don’t think GPT-3 is instrumentally behaving well on the training distribution because it has a model fo the data-generating process.
I think on distribution you are mostly getting good behavior mostly either by not optimizing, or by optimizing for something we want. I think to the extent it’s malign it’s because there are possible inputs on which it is optimizing for something you don’t want, but those inputs are unlike those that appear in training and you have objective misgeneralization.
In that regime, I think the on-distribution performance is probably aligned and there is not much in-principle obstruction to using adversarial training to improve the robustness of alignment.
Could you define the word “alignment” as you are using it?
I’m using roughly the definition here. I think it’s the case that there are many inputs where GPT-3 is not trying to do what you want, but Instruct-GPT is. Indeed, I think Instruct-GPT is actually mostly trying to do what you want to the extent that it is trying to do anything at all. That would lead me to say it is more “aligned.”
I agree there are subtleties like “If I ask instruct-gpt to summarize a story, is it trying to summarize the story? Or trying to use that as evidence about ‘what Paul wants’ and then do that?” And I agree there is a real sense in which it isn’t smart enough for that distinction to be consistently meaningful, and so in that sense you might say my definition of intent alignment doesn’t really apply. (I more often think about models being “benign” or “malign,” more like asking: is it trying to optimize for something despite knowing that you wouldn’t like it.) I don’t think that’s what you are talking about here though.
If you have good oversight, I think you probably get good average case alignment. That’s ultimately an empirical claim about what happens when you do SGD, but the on-paper arguments looks quite good (namely: on-distribution alignment would improve the on-distribution performance and seems easy for SGD to learn relative to the complexity of the model itself) and it appears to match the data so far to the extent we have relevant data.
You seem to be confidently stating it’s false without engaging at all with the argument in favor or presenting or engaging with any empirical evidence.
But which argument in favor did you present? You just said “the models are unaligned for these 2 reasons”, when those reasons do not seem comprehensive to me, and you did not give any justification for why those two reasons are comprehensive (or provide any links).
I tried to give a number of specific alternative reasons that do not seem to be covered by either of your two cases, and included a statement that we might disagree on definitional grounds, but that I don’t actually know what definitions you are using, and so can’t be confident that my critique makes sense.
Now that you’ve provided a definition, I still think what I said holds. My guess is there is a large inferential distance here, so I don’t think it makes sense to try to bridge that whole distance within this comment thread, though I will provide an additional round of responses.
I don’t think your definition of intent-alignment requires any unaligned system to have a model of the data-generating process, so I don’t understand the relevance of this. GPT-3 is not unaligned because it has a model of the data-generating process, and I didn’t claim that.
I did claim that neither GPT-3 nor Instruct-GPT are “trying to do what the operator wants it to do”, according to your definition, and that the primary reason for that is that in as much as its training process did produce a model that has “goals” and so can be modeled in any consequentialist terms, those “goals” do not match up with trying to be helpful to the operator. Most likely, they are a pretty messy objective we don’t really understand (which in the case of GPT-3 might be best described as “trying to generate text that in some simple latent space resembles the training distribution” and I don’t have any short description of what the “goals” of Instruct-GPT might be, though my guess is they are still pretty close to GPT-3s goals).
I don’t think we know what Instruct-GPT is “trying to do”, and it seems unlikely to me that it is “trying to do what I want”. I agree in some sense it is “more trying to do what I want”, though not in a way that feels obviously very relevant to more capable systems, and not in a way that aligns very well with your intent definition (I feel like if I had to apply your linked definition to Instruct-GPT, I would say something like “ok, seems like it isn’t intent aligned, since the system doesn’t really seem to have much of an intent. And if there is a mechanism in its inner workings that corresponds to intent, we have no idea what thing it is pointed at, so probably it isn’t pointed at the right thing”).
And in either case, even if it is the case that if you squint your eyes a lot the system is “more aligned”, this doesn’t make the sentence “many of today’s systems are aligned unless humans make mistakes in evaluation or are deployed in high-stakes environments” true. “More aligned” is not equal to “aligned”.
The correct sentence seems to me “many of those systems are still mostly unaligned, but might be slightly more aligned than previous systems, though we have some hope that with better evaluation we can push that even further, and the misalignment problems are less bad on lower-stakes problems when we can rely on average-case performance, though overall the difference in alignment between GPT and Instruct-GPT is pretty unclear and probably not very large”.
This seems wrong to me. On-distribution it seems to me that the system is usually optimizing for something that I don’t want. For example, GPT-3 primarily is trying to generate text that represents the distribution that its drawn from, which very rarely aligns with what I want (and is why prompt-engineering has such a large effect, e.g. “you are Albert Einstein” as a prefix improves performance on many tasks). Instruct-GPT does a bit better here, but probably most of its internal optimization power is still thrown at reasoning with the primary “intention” of generating text that is similar to its input distribution, since it seems unlikely that the fine-tuning completely rewrote most of these internal heuristics.
My guess is if Instruct-GPT was intent-aligned even for low-impact tasks, we could get it to be substantially more useful on many tasks. But my guess is what we currently have is mostly a model that is still primarily “trying” to generate text that is similar to its training distribution, with a few heuristics baked in in the human-feedback stage that make that text more likely to be a good fit for the question asked. In as much as the model is “trying to do something”, i.e. what most of its internal optimization power is pointed at, I am very skeptical that that is aligned with my task.
(Similarly, looking at Redwood’s recent model, it seems clear to me that they did not produce a model that “intents” to produce non-injurious completions. The model has two parts, one that is just “trying” to generate text similar to its training distribution, and a second part that is “trying” to detect whether a completion is injurious. This model seems clearly not intent-aligned, since almost none of its optimization power is going towards our target objective.)
My guess is a lot of work is done here by the term “average case alignment”, so I am not fully sure how to respond. I disagree that the on-paper argument looks quite good, though it depends a lot on how narrowly you define “on-distribution”. Given my arguments above, you must either mean something different from intent-alignment (since to me at least it seems clear that Redwood’s model is not intent-aligned), or disagree with me on whether systems like Redwood’s are intent-aligned, in which case I don’t really know how to consistently apply your intent-alignment definition.
I also feel particularly confused about the term “average case alignment”, combined with “intent-alignment”. I can ascribe goals at multiple different levels to a model, and my guess is we both agree that describing current systems as having intentions at all is kind of fraught, but in as much as a model has a coherent goal, it seems like that goal is pretty consistent between different prompts, and so I am confused why we should expect average case alignment to be very different from normal alignment. It seems that if I have a model that is trying to do something, then asking it multiple times, probably won’t make a difference to its intention (I think, I mean, again, this all feels very handwavy, which is part of the reason why it feels so wrong to me to describe current models as “aligned”).
I currently think that the main relevant similarities between Instruct-GPT and a model that is trying to kill you, are about errors of the overseer (i.e. bad outputs to which they would give a high reward) or high-stakes errors (i.e. bad outputs which can have catastrophic effects before they are corrected by fine-tuning).
I’m interested in other kinds of relevant similarities, since I think those would be exciting and productive things to research. I don’t think the framework “Instruct-GPT and GPT-3 e.g. copy patterns that they saw in the prompt, so they are ‘trying’ to predict the next word and hence are misaligned” is super useful, though I see where it’s coming from and agree that I started it by using the word “aligned”.
Relatedly, and contrary to my original comment, I do agree that there can be bad intentional behavior left over from pre-training. This is a big part what ML researchers are motivated by when they talk about improving the sample-efficiency of RLHF. I usually try to discourage people from working on this issue, because it seems like something that will predictably get better rather than worse as models improve (and I expect you are even less happy with it than I am).
I agree that there is a lot of inferential distance, and it doesn’t seem worth trying to close the gap here. I’ve tried to write down a fair amount about my views, and I’m always interested to read arguments / evidence / intuitions for more pessimistic conclusions.
I agree with this, though it’s unrelated to the stated motivation for that project or to its relationship to long-term risk.
Phrased this way, I still disagree, but I think I disagree less strongly, and feel less of a need to respond to this. I care particularly much about using terms like “aligned” in consistent ways. Importantly, having powerful intent-aligned systems is much more useful than having powerful systems that just fail to kill you (e.g. because they are very conservative), and so getting to powerful aligned systems is a win-condition in the way that getting to powerful non-catastrophic systems is not.
Yep, I didn’t intend to imply that this was in contrast to the intention of the research. It was just on my mind as a recent architecture that I was confident we both had thought about, and so could use as a convenient example.