I agree a fraction of the way, but when doing a dependency check, I feel like there are some conditions where the standard arguments go through.
I sketched out my view on the dependencies Where do you get your capabilities from?. The TL;DR is that I think ChatGPT-style training basically consists of two different ways of obtaining capabilities:
Imitating internet text, which gains them capabilities to do the-sorts-of-things-humans-do because generating such text requires some such capabilities.
Reinforcement learning from human feedback on plans, where people evaluate the implications of the proposals the AI comes up with, and rate how good they are.
I think both of these are basically quite safe. They do have some issues, but probably not of the style usually discussed by rationalists working in AI alignment, and possibly not even issues going beyond any other technological development.
The basic principle for why they are safe-ish is that all of the capabilities they gain are obtained through human capabilities. So for example, while RLHF-on-plans may optimize for tricking human raters to the detriment of how the rater intended the plans to work out, this “tricking” will also sacrifice the capabilities of the plans, because the only reason more effective plans are rated better is because humans recognize their effectiveness and rate them better.
Or relatedly, consider nearest unblocked strategy, a common proposal for why alignment is hard. This only applies if the AI is able to consider an infinitude of strategies, which again only applies if it can generate its own strategies once the original human-generated strategies have been blocked.
Why should we believe the “consistent-across-situations inner goals → deceptive alignment” mechanistic claim about how SGD works? Here are the main arguments I’m aware of:
These are the main arguments that the rationalist community seems to be pushing about it. For instance, one time you asked about it, and LW just encouraged magical thinking around these arguments. (Even from people who I’d have thought would be clearer-thinking)
The non-magical way I’d analyze the smiling reward is that while people imagine that in theory updating the policy with antecedent-computation-reinforcement should make it maximize reward, in practice the information signal in this is very sparse and imprecise, so in practice it is going to take exponential time, and therefore something else will happen beforehand.
What is this “something else”? Probably something like: the human is going to reason about whether the AI’s activities are making progress on something desirable, and in those cases, the human is going to press the antecedent-computation-reinforcement button. Which again boils down to the AI copying the human’s capabilities, and therefore being relatively safe. (E.g. if you start seeing the AI studying how to deceive humans, you’re probably gonna punish it instead, or at least if you reward it then it more falls under the dual-use risk framework than the alignment risk framework.)
(Of course one could say “what if we just instruct the human to only reward the AI based on results and not incremental progress?”, but in that case the answer to what is going to happen before the AI does a treacherous turn is “the company training the AI runs out of money”.)
There’s something seriously wrong with how LWers fixated on reward-for-smiling and rationalized an explanation of why simplicity bias or similar would make this go into a treacherous turn.
OK, so this is how far I’m with you. Rationalist stories of AI progress are basically very wrong, and some commonly endorsed threat models don’t point at anything serious. But what then?
The short technical answer is that reward is only antecedent-computation-reinforcement for policy-gradient-based reinforcement learning, and model-based approaches (traditionally temporal-difference learning, but I think the SOTA is DreamerV3, which is pretty cool) use the reward to learn a value function, which they then optimize in a more classical way, allowing them to create novel capabilities in precisely the way that can eventually lead to deception, treacherous turn, etc..
One obvious proposal is “shit, let’s not do that, chatgpt seems like a pretty good and safe alternative, and it’s not like there’s any hype behind this anyway”. I’m not sure that proposal is right, because e.g. AlphaStar was pretty hype, and it was trained with one of these methods. But it sure does seem that there’s a lot of opportunity in ChatGPT now, so at least this seems like a directionally-correct update for a lot of people to make (e.g. stop complaining so much about the possibility of an even bigger GPT-5; it’s almost certainly safer to scale it up than it is to come up with algorithms that can improve model power without improving model scale).
However, I do think there are a handful of places where this falls apart:
Your question briefly mentioned “with clever exploration bonuses”. LW didn’t really reply much to it, but it seems likely that this could be the thing that does your question in. Maybe there’s some model-free clever exploration bonuses, but if so I have never heard of them. The most advanced exploration bonuses I have heard of are from the Dreamer line of models, and it has precisely the sort of consequentialist reasoning abilities that start being dangerous.
My experience is that language models exhibit a phenomenon I call “transposons”, where (especially when fed back into themselves after deep levels of scaffolding) there are some ideas that they end up copying too much after prompting, clogging up the context window. I expect there will end up being strong incentives for people to come up with techniques to remove transposons, and I expect the most effective techniques will be based on some sort of consequences-based feedback system which again brings us back to essentially the original AI threat model.
I think security is going to be a big issue, along different lines: hostile nations, terrorists, criminals, spammers, trolls, competing companies, etc.. In order to achieve strong security, you need to be robust against adversarial attacks, which probably means continually coming up with new capabilities to fend them off. I guess one could imagine that humans will be coming up with those new capabilities, but that seems probably destroyed by adversaries using AIs to come up with security holes, and regardless it seems like having humans come up with the new capabilities will be extremely expensive, so probably people will focus on the AI side of things. To some extent, people might classify this as dual-use threats rather than alignment threats, but at least the arms race element would also generate alignment threats I think?
I think the way a lot of singularitarians thought of AI is that general agency consists of advanced adaptability and wisdom, and we expected that researchers would develop a lot of artificial adaptability, and then eventually the systems would become adaptable enough to generate wisdom faster than people do, and then overtake society. However what happened was that a relatively shallow form of adaptability was developed, and then people loaded all of human wisdom into that shallow adaptability, which turned out to be immensely profitable. But I’m not sure it’s going to continue to work this way; eventually we’re gonna run out of human wisdom to dump into the models, plus the models reduce the incentive for humans to share our wisdom in public. So just as a general principle, continued progress seems like it’s eventually going to switch to improving the ability to generate novel capabilities, which in turn is going to put us back to the traditional story? (Except possibly with a weirder takeoff? Slow takeoff followed by ??? takeoff or something. Idk.) Possibly this will be delayed because the creators of the LLMs find patches, e.g. I think we’re going to end up seeing the possibility of allowing people to “edit” the LLM responses rather than just doing upvotes/downvotes, but this again motivates some capabilities development because you need to filter out spammers/trolls/etc., which is probably best done through considering the consequences of the edits.
If we zoom out a bit, what’s going on here? Here’s some answers:
Conditioning on capabilities: There are strong reasons to expect capabilities to keep on improving, so we should condition on that. This leads to a lot of alignment issues, due to folk decision theory theorems. However, we don’t know how capabilities will develop, so we lose the ability to reason mechanistically when conditioning on capabilities, which means that there isn’t a mechanistic answer to all questions related to it. If one doesn’t keep track of this chain of reasoning, one might assume that there is a known mechanistic answer, which leads to the types of magical thinking seen in that other thread.
If people don’t follow-the-trying, that can lead to a lot of misattributions.
When considering agency in general, rather than consequentialist-based agency in particular, then instrumental convergence is more intuitive in disjunctive form than implication form. I.e. rather than “if an AI robustly achieves goals, then it will resist being shut down”, say “either an AI resists being shut down, or it doesn’t robustly achieve goals”.
Don’t trust rationalists too much.
One additional speculative thing I have been thinking of which is kind of off-topic and doesn’t fit in neatly with the other things:
Could there be “natural impact regularization” or “impact regularization by default”? Specifically, imagine you use general-purpose search procedure which recursively invokes itself to solve subgoals for the purpose of solving some bigger goal.
If the search procedure’s solutions to subgoals “change things too much”, then they’re probably not going to be useful. E.g. for Rubik’s cubes, if you want to swap some of the cuboids, it does you know good if those swaps leave the rest of the cube scrambled.
Thus, to some extent, powerful capabilities would have to rely on some sort of impact regularization.
The bias-focused rationalists like to map decision-theoretic insights to human mistakes, but I instead like to map decision-theoretic insights to human capacities and experiences. I’m thinking that natural impact regularization is related to the notion of “elegance” in engineering. Like if you have some bloated tool to solve a problem, then even if it’s not strictly speaking an issue because you can afford the resources, it might feel ugly because it’s excessive and puts mild constaints on your other underconstrained decisions, and so on. Meanwhile a simple, minimal solution often doesn’t have this.
Natural impact regularization wouldn’t guarantee safety, since it’s still allows deviations that don’t interfere with the AI’s function, but it sort of reduces one source of danger which I had been thinking about lately, namely I had been thinking that the instrumental incentive is to search for powerful methods of influencing the world, where “power” connotes the sort of raw power that unstoppably forces a lot of change, but really the instrumental incentive is often to search for “precise” methods of influencing the world, where one can push in a lot of information to effect narrow change. (A complication is that any one agent can only have so much bandwidth. I’ve been thinking bandwidth is probably going to become a huge area of agent foundations, and that it’s been underexplored so far. (Perhaps because everyone working in alignment sucks at managing their bandwidth?))
Maybe another word for it would be “natural inner alignment”, since in a sense the point is that capabilities inevitably select for inner alignment.
I agree a fraction of the way, but when doing a dependency check, I feel like there are some conditions where the standard arguments go through.
I sketched out my view on the dependencies Where do you get your capabilities from?. The TL;DR is that I think ChatGPT-style training basically consists of two different ways of obtaining capabilities:
Imitating internet text, which gains them capabilities to do the-sorts-of-things-humans-do because generating such text requires some such capabilities.
Reinforcement learning from human feedback on plans, where people evaluate the implications of the proposals the AI comes up with, and rate how good they are.
I think both of these are basically quite safe. They do have some issues, but probably not of the style usually discussed by rationalists working in AI alignment, and possibly not even issues going beyond any other technological development.
The basic principle for why they are safe-ish is that all of the capabilities they gain are obtained through human capabilities. So for example, while RLHF-on-plans may optimize for tricking human raters to the detriment of how the rater intended the plans to work out, this “tricking” will also sacrifice the capabilities of the plans, because the only reason more effective plans are rated better is because humans recognize their effectiveness and rate them better.
Or relatedly, consider nearest unblocked strategy, a common proposal for why alignment is hard. This only applies if the AI is able to consider an infinitude of strategies, which again only applies if it can generate its own strategies once the original human-generated strategies have been blocked.
These are the main arguments that the rationalist community seems to be pushing about it. For instance, one time you asked about it, and LW just encouraged magical thinking around these arguments. (Even from people who I’d have thought would be clearer-thinking)
The non-magical way I’d analyze the smiling reward is that while people imagine that in theory updating the policy with antecedent-computation-reinforcement should make it maximize reward, in practice the information signal in this is very sparse and imprecise, so in practice it is going to take exponential time, and therefore something else will happen beforehand.
What is this “something else”? Probably something like: the human is going to reason about whether the AI’s activities are making progress on something desirable, and in those cases, the human is going to press the antecedent-computation-reinforcement button. Which again boils down to the AI copying the human’s capabilities, and therefore being relatively safe. (E.g. if you start seeing the AI studying how to deceive humans, you’re probably gonna punish it instead, or at least if you reward it then it more falls under the dual-use risk framework than the alignment risk framework.)
(Of course one could say “what if we just instruct the human to only reward the AI based on results and not incremental progress?”, but in that case the answer to what is going to happen before the AI does a treacherous turn is “the company training the AI runs out of money”.)
There’s something seriously wrong with how LWers fixated on reward-for-smiling and rationalized an explanation of why simplicity bias or similar would make this go into a treacherous turn.
OK, so this is how far I’m with you. Rationalist stories of AI progress are basically very wrong, and some commonly endorsed threat models don’t point at anything serious. But what then?
The short technical answer is that reward is only antecedent-computation-reinforcement for policy-gradient-based reinforcement learning, and model-based approaches (traditionally temporal-difference learning, but I think the SOTA is DreamerV3, which is pretty cool) use the reward to learn a value function, which they then optimize in a more classical way, allowing them to create novel capabilities in precisely the way that can eventually lead to deception, treacherous turn, etc..
One obvious proposal is “shit, let’s not do that, chatgpt seems like a pretty good and safe alternative, and it’s not like there’s any hype behind this anyway”. I’m not sure that proposal is right, because e.g. AlphaStar was pretty hype, and it was trained with one of these methods. But it sure does seem that there’s a lot of opportunity in ChatGPT now, so at least this seems like a directionally-correct update for a lot of people to make (e.g. stop complaining so much about the possibility of an even bigger GPT-5; it’s almost certainly safer to scale it up than it is to come up with algorithms that can improve model power without improving model scale).
However, I do think there are a handful of places where this falls apart:
Your question briefly mentioned “with clever exploration bonuses”. LW didn’t really reply much to it, but it seems likely that this could be the thing that does your question in. Maybe there’s some model-free clever exploration bonuses, but if so I have never heard of them. The most advanced exploration bonuses I have heard of are from the Dreamer line of models, and it has precisely the sort of consequentialist reasoning abilities that start being dangerous.
My experience is that language models exhibit a phenomenon I call “transposons”, where (especially when fed back into themselves after deep levels of scaffolding) there are some ideas that they end up copying too much after prompting, clogging up the context window. I expect there will end up being strong incentives for people to come up with techniques to remove transposons, and I expect the most effective techniques will be based on some sort of consequences-based feedback system which again brings us back to essentially the original AI threat model.
I think security is going to be a big issue, along different lines: hostile nations, terrorists, criminals, spammers, trolls, competing companies, etc.. In order to achieve strong security, you need to be robust against adversarial attacks, which probably means continually coming up with new capabilities to fend them off. I guess one could imagine that humans will be coming up with those new capabilities, but that seems probably destroyed by adversaries using AIs to come up with security holes, and regardless it seems like having humans come up with the new capabilities will be extremely expensive, so probably people will focus on the AI side of things. To some extent, people might classify this as dual-use threats rather than alignment threats, but at least the arms race element would also generate alignment threats I think?
I think the way a lot of singularitarians thought of AI is that general agency consists of advanced adaptability and wisdom, and we expected that researchers would develop a lot of artificial adaptability, and then eventually the systems would become adaptable enough to generate wisdom faster than people do, and then overtake society. However what happened was that a relatively shallow form of adaptability was developed, and then people loaded all of human wisdom into that shallow adaptability, which turned out to be immensely profitable. But I’m not sure it’s going to continue to work this way; eventually we’re gonna run out of human wisdom to dump into the models, plus the models reduce the incentive for humans to share our wisdom in public. So just as a general principle, continued progress seems like it’s eventually going to switch to improving the ability to generate novel capabilities, which in turn is going to put us back to the traditional story? (Except possibly with a weirder takeoff? Slow takeoff followed by ??? takeoff or something. Idk.) Possibly this will be delayed because the creators of the LLMs find patches, e.g. I think we’re going to end up seeing the possibility of allowing people to “edit” the LLM responses rather than just doing upvotes/downvotes, but this again motivates some capabilities development because you need to filter out spammers/trolls/etc., which is probably best done through considering the consequences of the edits.
If we zoom out a bit, what’s going on here? Here’s some answers:
Conditioning on capabilities: There are strong reasons to expect capabilities to keep on improving, so we should condition on that. This leads to a lot of alignment issues, due to folk decision theory theorems. However, we don’t know how capabilities will develop, so we lose the ability to reason mechanistically when conditioning on capabilities, which means that there isn’t a mechanistic answer to all questions related to it. If one doesn’t keep track of this chain of reasoning, one might assume that there is a known mechanistic answer, which leads to the types of magical thinking seen in that other thread.
If people don’t follow-the-trying, that can lead to a lot of misattributions.
When considering agency in general, rather than consequentialist-based agency in particular, then instrumental convergence is more intuitive in disjunctive form than implication form. I.e. rather than “if an AI robustly achieves goals, then it will resist being shut down”, say “either an AI resists being shut down, or it doesn’t robustly achieve goals”.
Don’t trust rationalists too much.
One additional speculative thing I have been thinking of which is kind of off-topic and doesn’t fit in neatly with the other things:
Could there be “natural impact regularization” or “impact regularization by default”? Specifically, imagine you use general-purpose search procedure which recursively invokes itself to solve subgoals for the purpose of solving some bigger goal.
If the search procedure’s solutions to subgoals “change things too much”, then they’re probably not going to be useful. E.g. for Rubik’s cubes, if you want to swap some of the cuboids, it does you know good if those swaps leave the rest of the cube scrambled.
Thus, to some extent, powerful capabilities would have to rely on some sort of impact regularization.
The bias-focused rationalists like to map decision-theoretic insights to human mistakes, but I instead like to map decision-theoretic insights to human capacities and experiences. I’m thinking that natural impact regularization is related to the notion of “elegance” in engineering. Like if you have some bloated tool to solve a problem, then even if it’s not strictly speaking an issue because you can afford the resources, it might feel ugly because it’s excessive and puts mild constaints on your other underconstrained decisions, and so on. Meanwhile a simple, minimal solution often doesn’t have this.
Natural impact regularization wouldn’t guarantee safety, since it’s still allows deviations that don’t interfere with the AI’s function, but it sort of reduces one source of danger which I had been thinking about lately, namely I had been thinking that the instrumental incentive is to search for powerful methods of influencing the world, where “power” connotes the sort of raw power that unstoppably forces a lot of change, but really the instrumental incentive is often to search for “precise” methods of influencing the world, where one can push in a lot of information to effect narrow change. (A complication is that any one agent can only have so much bandwidth. I’ve been thinking bandwidth is probably going to become a huge area of agent foundations, and that it’s been underexplored so far. (Perhaps because everyone working in alignment sucks at managing their bandwidth?))
Maybe another word for it would be “natural inner alignment”, since in a sense the point is that capabilities inevitably select for inner alignment.
Sorry if I’m getting too rambly.