OK, thanks. I’m new to this debate, I take it I’m wandering in to a discussion that may already have been had to death.
I guess I’m worried that RLHF should basically be thought of as capabilities research instead of alignment/safety research. The rationale for this would be: Big companies will do RLHF before the end by default, since their products will embarrass them otherwise. By doing RLHF now and promoting it we help these companies get products to market sooner & free up their time to focus on other capabilities research.
I agree with your claims (a) and (b) but I don’t think they undermine this skeptical take, because I think that if RLHF fails the failures will be different for really powerful systems than for dumb systems.
I think it’d be useful if you spelled out those failures you think will occur in powerful systems, that won’t occur in any intermediate system (assuming some degree of slowness sufficient to allow real world deployment of not-yet-AGI agentic models).
For example, deception: lots of parts of the animal kingdom understand the concept of “hiding” or “lying in wait to strike”, I think? It already showed up in XLand IIRC. Imagine a chatbot trying to make a sale—avoiding problematic details of the product it’s selling seems like a dominant strategy.
There are definitely scarier failure modes that show up in even-more-powerful systems (e.g. actual honest-to-goodness long-term pretending to be harmless in order to end up in situations with more resources, which will never be caught with RLHF), and I agree pure alignment researchers should be focusing on those. But the suggestion that picking the low-hanging fruit won’t build momentum for working on the hardest problems does seem wrong to me.
As another example, consider the Beijing Academy of AI’s government-academia-industry LLM partnership. When their LLMs fail to do what they want, they’ll try RLHF—and it’ll kind of work, but then it’ll fail in a bunch of situations. They’ll be forced to confront the fact that actually, objective robustness is a real thing, and start funding research/taking proto-alignment research way more seriously/as being on the critical path to useful models. Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?
Thanks! I take the point about animals and deception.
Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?
Insofar as the pitch for RLHF is “Yes tech companies are going to do this anyway, but if we do it first then we can gain prestige, people will cite us, etc. and so people will turn to us for advice on the subject later, and then we’ll be able to warn them of the dangers” then actually that makes a lot of sense to me, thanks. I still worry that the effect size might be too small to be worth it, but idk.
I don’t think that there are failures that will occur in powerful systems that won’t occur in any intermediate system. However I’m skeptical that the failures that will occur in powerful systems will also occur in today’s systems. I must say I’m super uncertain about all of this and haven’t thought about it very much.
With that preamble aside, here is some wild speculation:
--Current systems (hopefully?) aren’t reasoning strategically about how to achieve goals & then executing on that reasoning. (You can via prompting get GPT-3 to reason strategically about how to achieve goals… but as far as we know it isn’t doing reasoning like that internally when choosing what tokens to output. Hopefully.) So, the classic worry of “the AI will realize that it needs to play nice in training so that it can do a treacherous turn later in deployment” just doesn’t apply to current systems. (Hopefully.) So if we see e.g. our current GPT-3 chatbot being deceptive about a product it is selling, we can happily train it to not do that and probably it’ll just genuinely learn to be more honest. But if it had strategic awareness and goal-directedness, it would instead learn to be less honest; it would learn to conceal its true intentions from its overseers.
--As humans grow up and learn more and (in some cases) do philosophy they undergo major shifts in how they view the world. This often causes them to change their minds about things they previously learned. For example, maybe at some point they learned to go to church because that’s what good people do because that’s what God says; later on they stop believing in God and stop going to church. And then later still they do some philosophy and adopt some weird ethical theory like utilitarianism and their behavior changes accordingly. Well, what if AIs undergo similar ontological shifts as they get smarter? Then maybe the stuff that works at one level of intelligence will stop working at another. (e.g. telling a kid that God is watching them and He says they should go to church stops working. Later when they become a utilitarian, telling them that killing civilians is murder and murder is wrong stops working too (if they are in a circumstance where the utilitarian calculus says civilian casualties are worth it for the greater good)).
I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.
OK, thanks. I’m new to this debate, I take it I’m wandering in to a discussion that may already have been had to death.
I guess I’m worried that RLHF should basically be thought of as capabilities research instead of alignment/safety research. The rationale for this would be: Big companies will do RLHF before the end by default, since their products will embarrass them otherwise. By doing RLHF now and promoting it we help these companies get products to market sooner & free up their time to focus on other capabilities research.
I agree with your claims (a) and (b) but I don’t think they undermine this skeptical take, because I think that if RLHF fails the failures will be different for really powerful systems than for dumb systems.
I think it’d be useful if you spelled out those failures you think will occur in powerful systems, that won’t occur in any intermediate system (assuming some degree of slowness sufficient to allow real world deployment of not-yet-AGI agentic models).
For example, deception: lots of parts of the animal kingdom understand the concept of “hiding” or “lying in wait to strike”, I think? It already showed up in XLand IIRC. Imagine a chatbot trying to make a sale—avoiding problematic details of the product it’s selling seems like a dominant strategy.
There are definitely scarier failure modes that show up in even-more-powerful systems (e.g. actual honest-to-goodness long-term pretending to be harmless in order to end up in situations with more resources, which will never be caught with RLHF), and I agree pure alignment researchers should be focusing on those. But the suggestion that picking the low-hanging fruit won’t build momentum for working on the hardest problems does seem wrong to me.
As another example, consider the Beijing Academy of AI’s government-academia-industry LLM partnership. When their LLMs fail to do what they want, they’ll try RLHF—and it’ll kind of work, but then it’ll fail in a bunch of situations. They’ll be forced to confront the fact that actually, objective robustness is a real thing, and start funding research/taking proto-alignment research way more seriously/as being on the critical path to useful models. Wouldn’t it be great if there were a whole literature waiting for them on all the other things that empirically go wrong with RLHF, up to and including genuine inner misalignment concerns, once they get there?
Thanks! I take the point about animals and deception.
Insofar as the pitch for RLHF is “Yes tech companies are going to do this anyway, but if we do it first then we can gain prestige, people will cite us, etc. and so people will turn to us for advice on the subject later, and then we’ll be able to warn them of the dangers” then actually that makes a lot of sense to me, thanks. I still worry that the effect size might be too small to be worth it, but idk.
I don’t think that there are failures that will occur in powerful systems that won’t occur in any intermediate system. However I’m skeptical that the failures that will occur in powerful systems will also occur in today’s systems. I must say I’m super uncertain about all of this and haven’t thought about it very much.
With that preamble aside, here is some wild speculation:
--Current systems (hopefully?) aren’t reasoning strategically about how to achieve goals & then executing on that reasoning. (You can via prompting get GPT-3 to reason strategically about how to achieve goals… but as far as we know it isn’t doing reasoning like that internally when choosing what tokens to output. Hopefully.) So, the classic worry of “the AI will realize that it needs to play nice in training so that it can do a treacherous turn later in deployment” just doesn’t apply to current systems. (Hopefully.) So if we see e.g. our current GPT-3 chatbot being deceptive about a product it is selling, we can happily train it to not do that and probably it’ll just genuinely learn to be more honest. But if it had strategic awareness and goal-directedness, it would instead learn to be less honest; it would learn to conceal its true intentions from its overseers.
--As humans grow up and learn more and (in some cases) do philosophy they undergo major shifts in how they view the world. This often causes them to change their minds about things they previously learned. For example, maybe at some point they learned to go to church because that’s what good people do because that’s what God says; later on they stop believing in God and stop going to church. And then later still they do some philosophy and adopt some weird ethical theory like utilitarianism and their behavior changes accordingly. Well, what if AIs undergo similar ontological shifts as they get smarter? Then maybe the stuff that works at one level of intelligence will stop working at another. (e.g. telling a kid that God is watching them and He says they should go to church stops working. Later when they become a utilitarian, telling them that killing civilians is murder and murder is wrong stops working too (if they are in a circumstance where the utilitarian calculus says civilian casualties are worth it for the greater good)).
I agree that “concealing intentions from overseers” might be a fairly late-game property, but it’s not totally obvious to me that it doesn’t become a problem sooner. If a chatbot realizes it’s dealing with a disagreeable person and therefore that it’s more likely to be inspected, and thus hews closer to what it thinks the true objective might be, the difference in behaviors should be pretty noticeable.
Re: ontology mismatch, this seems super likely to happen at lower levels of intelligence. E.g. I’d bet this even sometimes occurs in today’s model-based RL, as it’s trained for long enough that its world model changes. If we don’t come up with strategies for dealing with this dynamically, we aren’t going to be able to build anything with a world model that improves over time. Maybe that only happens too close to FOOM, but if you believe in a gradual-ish takeoff it seems plausible to have vanilla model-based RL work decently well before.