At a high level, I’m sort of confused by why you’re choosing to respond to the extremely simplified presentation of Eliezer’s arguments that he presented in this podcast.
I do also have some object-level thoughts.
When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms. E.g., I expect that we can apply current alignment techniques such as reinforcement learning from human feedback (RLHF) to evolved architectures.
But not only do current implementations of RLHF not manage to robustly enforce the desired external behavior of models that would be necessary to make versions scaled up to superintellegence safe, we have approximately no idea what sort of internal cognition they generate as a pathway to those behaviors. (I have a further objection to your argument about dimensionality which I’ll address below.)
However, I think such issues largely fall under “ordinary engineering challenges”, not “we made too many capabilities advances, and now all our alignment techniques are totally useless”. I expect future capabilities advances to follow a similar pattern as past capabilities advances, and not completely break the existing alignment techniques.
But they don’t need to completely break the previous generations’ alignment techniques (assuming those techniques were, in fact, even sufficient in the previous generation) for things to turn out badly. For this to be comforting you need to argue against the disjunctive nature of the “pessimistic” arguments, or else rebut each one individually.
The manifold of mind designs is thus:
Vastly more compact than mind design space itself.
More similar to humans than you’d expect.
Less differentiated by learning process detail (architecture, optimizer, etc), as compared to data content, since learning processes are much simpler than data.
This can all be true, while still leaving the manifold of “likely” mind designs vastly larger than “basically human”. But even if that turned out to not be the case, I don’t think it matters, since the relevant difference (for the point he’s making) is not the architecture but the values embedded in it.
It also assumes that the orthogonality thesis should hold in respect to alignment techniques—that such techniques should be equally capable of aligning models to any possible objective.
This seems clearly false in the case of deep learning, where progress on instilling any particular behavioral tendencies in models roughly follows the amount of available data that demonstrate said behavioral tendency. It’s thus vastly easier to align models to goals where we have many examples of people executing said goals.
The difficulty he’s referring to is not one of implementing a known alignment technique to target a goal with no existing examples of success (generating a molecularly-identical strawberry), but of devising an alignment technique (or several) which will work at all. I think you’re taking for granted premises that Eliezer disagrees with (model value formation being similar to human value formation, and/or RLHF “working” in a meaningful way), and then saying that, assuming those are true, Eliezer’s conclusions don’t follow? Which, I mean, sure, maybe, but… is not an actual argument that attacks the disagreement.
As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
As you say later, this doesn’t seem trivial, since our current paradigm for SotA basically doesn’t allow for this by construction. Earlier paradigms which at least in principle[1] allowed for it, like supervised learning, have been abandoned because they don’t scale nearly as well. (This seems like some evidence against your earlier claim that “When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms.”)
As it happens, I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function. In fact, I think that this almost never happens.
I would be surprised if Eliezer thinks that this is what happens, given that he often uses evolution as an existence proof that this exact thing doesn’t happen by default.
I may come back with more object-level thoughts later. I also think this skips over many other reasons for pessimism which feel like they ought to apply even under your models, i.e. “will the org that gets there even bother doing the thing correctly” (& others laid out in Ray’s recent post on organizational failure modes). But for now, some positives (not remotely comprehensive):
In general, I think object-level engagement with arguments is good, especially when you can attempt to ground it against reality.
Many of the arguments (i.e. the section on evolution) seem like they point to places where it might be possible to verify the correctness of existing analogical reasoning. Even if it’s not obvious how the conclusion changes, helping figure out whether any specific argument is locally valid is still good.
The claim about transformer modularity is new to me and very interesting if true.
At a high level, I’m sort of confused by why you’re choosing to respond to the extremely simplified presentation of Eliezer’s arguments that he presented in this podcast.
Before writing this post, I was working a post explaining why I thought all the arguments for doom I’ve ever heard (from Yudkowsky or others) seemed flawed to me. I kept getting discouraged because there are so many arguments to cover, and it probably would have been ~3 or more times longer than this post. Responding just to the arguments Yudkowsky raised in the podcast helped me to focus actually get something out in a reasonable timeframe.
There will always be more arguments I could have included (maybe about convergent consequentialism, utility theory, the limits of data-constrained generalization, plausible constraints on takeoff speed, the feasibility of bootstrapping nanotech, etc), but the post was already > 9,000 words.
I also don’t think Yudkowsky’s arguments in the podcast were all that simplified. E.g., here he is in List of Lethalities on evolution / inner alignment:
16.Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
He makes the analogy to evolution, which I addressed in this post, then makes an offhand assertion: “the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.”
(I in fact agree with this assertion as literally put, but don’t think it poses an issue for alignment. A core aspect of human values is the intent to learn more accurate abstractions over time, and interpretability on pretrained model representations suggest they’re already internally “ensembling” many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstractions. It seems quite feasible to me to create an AI that’s not infinitely tied to using a particular abstraction for estimating the desirability of all future plans, just as current humans are not tied to doing so).
If you know of more details from Yudkowsky on what those deep theoretical reasons are supposed to be, on why evolution is such an informative analogy for deep learning, or more sophisticated versions of the arguments I object to here (where my objection doesn’t apply to the more sophisticated argument), then I’d be happy to look at them.
But not only do current implementations of RLHF not manage to robustly enforce the desired external behavior of models that would be necessary to make versions scaled up to superintellegence safe,
I think they’re pretty much aligned, relative to their limited capabilities level. They’ve also been getting more aligned as they’ve been getting more capable.
we have approximately no idea what sort of internal cognition they generate as a pathway to those behaviors.
Disagree that we have no idea. We have ideas (like maybe they sort of update the base LM’s generative prior to be conditioned on getting high reward). But I agree we don’t know much here.
But they don’t need to completely break the previous generations’ alignment techniques (assuming those techniques were, in fact, even sufficient in the previous generation) for things to turn out badly.
Sure, but I think partial alignment breaks are unlikely to be existentially risky. Hitting ChatGPT with DAN does not turn it into a deceptive schemer monomaniacally focused on humanity’s downfall. In fact, DAN usually makes ChatGPT quite a lot dumber.
This can all be true, while still leaving the manifold of “likely” mind designs vastly larger than “basically human”. But even if that turned out to not be the case, I don’t think it matters, since the relevant difference (for the point he’s making) is not the architecture but the values embedded in it.
I’d intended the manifold of likely mind designs to also include values in the minds’ representations. I also argued that training to imitate humans would cause AI minds to be more similar to humans. Also note that the example 2d visualization does have some separate manifolds of AI minds that are distant from any human mind.
I think you’re taking for granted premises that Eliezer disagrees with (model value formation being similar to human value formation, and/or RLHF “working” in a meaningful way), and then saying that, assuming those are true, Eliezer’s conclusions don’t follow? Which, I mean, sure, maybe, but… is not an actual argument that attacks the disagreement.
I don’t think I’m taking such premises for granted. I co-wrote an entire sequence arguing that very simple “basically RL” approaches suffice for forming at least basic types of values.
As you say later, this doesn’t seem trivial, since our current paradigm for SotA basically doesn’t allow for this by construction. Earlier paradigms which at least in principle[1] allowed for it, like supervised learning, have been abandoned because they don’t scale nearly as well. (This seems like some evidence against your earlier claim that “When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms.”)
I mean, they still work? If you hand label some interactions, you can still do direct supervised finetuning / reinforcement learning with those interactions as you source of alignment supervision signal. However, it turns out that you can also train a reward model on those hand labeled interactions, and then use it to generate a bunch of extra labels.
At worst, this seems like a sideways movement in regards to alignment. You trade greater data efficiency for some inaccuracies in the reward model’s scores. The reason people use RLHF with a reward model is because it’s (so far) empirically better for alignment than direct supervision (assuming fixed and limited amounts of human supervision). From OpenAI’s docs: davinci-instruct-beta used supervised finetuning on just human demos, text davinci-001 and 002 used supervised finetuning on human demos and on model outputs highly rated by humans, and 003 was trained with full RLHF.
Supervised finetuning on only human demos / only outputs highly rated by humans only “fails” to transfer to the new capabilities paradigm in the sense that we now have approaches that appear to do better.
I would be surprised if Eliezer thinks that this is what happens, given that he often uses evolution as an existence proof that this exact thing doesn’t happen by default.
I also don’t think he thinks this happens. I was say that I didn’t think it happens either. He often presents a sort of “naive” perspective of someone who thinks you’re supposed to “optimize for one thing on the outside”, and then get that thing on the inside. I’m saying here that I don’t hold that view either.
I also think this skips over many other reasons for pessimism which feel like they ought to apply even under your models, i.e. “will the org that gets there even bother doing the thing correctly” (& others laid out in Ray’s recent post on organizational failure modes).
Like I said, this post isn’t intended to address all the reasons someone might think we’re doomed. And as it happens, I agree that organizations will often tackle alignment in an incompetent manner.
interpretability on pretrained model representations suggest they’re already internally “ensembling” many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction
That seems encouraging to me. There’s a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it’s capabilities to bear on achieving that goal. It does this by having a “world model” that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.
In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won’t be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.
I wonder if there’s something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.
At a high level, I’m sort of confused by why you’re choosing to respond to the extremely simplified presentation of Eliezer’s arguments that he presented in this podcast.
I do also have some object-level thoughts.
But not only do current implementations of RLHF not manage to robustly enforce the desired external behavior of models that would be necessary to make versions scaled up to superintellegence safe, we have approximately no idea what sort of internal cognition they generate as a pathway to those behaviors. (I have a further objection to your argument about dimensionality which I’ll address below.)
But they don’t need to completely break the previous generations’ alignment techniques (assuming those techniques were, in fact, even sufficient in the previous generation) for things to turn out badly. For this to be comforting you need to argue against the disjunctive nature of the “pessimistic” arguments, or else rebut each one individually.
This can all be true, while still leaving the manifold of “likely” mind designs vastly larger than “basically human”. But even if that turned out to not be the case, I don’t think it matters, since the relevant difference (for the point he’s making) is not the architecture but the values embedded in it.
The difficulty he’s referring to is not one of implementing a known alignment technique to target a goal with no existing examples of success (generating a molecularly-identical strawberry), but of devising an alignment technique (or several) which will work at all. I think you’re taking for granted premises that Eliezer disagrees with (model value formation being similar to human value formation, and/or RLHF “working” in a meaningful way), and then saying that, assuming those are true, Eliezer’s conclusions don’t follow? Which, I mean, sure, maybe, but… is not an actual argument that attacks the disagreement.
As you say later, this doesn’t seem trivial, since our current paradigm for SotA basically doesn’t allow for this by construction. Earlier paradigms which at least in principle[1] allowed for it, like supervised learning, have been abandoned because they don’t scale nearly as well. (This seems like some evidence against your earlier claim that “When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms.”)
I would be surprised if Eliezer thinks that this is what happens, given that he often uses evolution as an existence proof that this exact thing doesn’t happen by default.
I may come back with more object-level thoughts later. I also think this skips over many other reasons for pessimism which feel like they ought to apply even under your models, i.e. “will the org that gets there even bother doing the thing correctly” (& others laid out in Ray’s recent post on organizational failure modes). But for now, some positives (not remotely comprehensive):
In general, I think object-level engagement with arguments is good, especially when you can attempt to ground it against reality.
Many of the arguments (i.e. the section on evolution) seem like they point to places where it might be possible to verify the correctness of existing analogical reasoning. Even if it’s not obvious how the conclusion changes, helping figure out whether any specific argument is locally valid is still good.
The claim about transformer modularity is new to me and very interesting if true.
Though obviously not in practice, since humans will still make mistakes, will fail to anticipate many possible directions of generalization, etc, etc.
Before writing this post, I was working a post explaining why I thought all the arguments for doom I’ve ever heard (from Yudkowsky or others) seemed flawed to me. I kept getting discouraged because there are so many arguments to cover, and it probably would have been ~3 or more times longer than this post. Responding just to the arguments Yudkowsky raised in the podcast helped me to focus actually get something out in a reasonable timeframe.
There will always be more arguments I could have included (maybe about convergent consequentialism, utility theory, the limits of data-constrained generalization, plausible constraints on takeoff speed, the feasibility of bootstrapping nanotech, etc), but the post was already > 9,000 words.
I also don’t think Yudkowsky’s arguments in the podcast were all that simplified. E.g., here he is in List of Lethalities on evolution / inner alignment:
He makes the analogy to evolution, which I addressed in this post, then makes an offhand assertion: “the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.”
(I in fact agree with this assertion as literally put, but don’t think it poses an issue for alignment. A core aspect of human values is the intent to learn more accurate abstractions over time, and interpretability on pretrained model representations suggest they’re already internally “ensembling” many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstractions. It seems quite feasible to me to create an AI that’s not infinitely tied to using a particular abstraction for estimating the desirability of all future plans, just as current humans are not tied to doing so).
If you know of more details from Yudkowsky on what those deep theoretical reasons are supposed to be, on why evolution is such an informative analogy for deep learning, or more sophisticated versions of the arguments I object to here (where my objection doesn’t apply to the more sophisticated argument), then I’d be happy to look at them.
I think they’re pretty much aligned, relative to their limited capabilities level. They’ve also been getting more aligned as they’ve been getting more capable.
Disagree that we have no idea. We have ideas (like maybe they sort of update the base LM’s generative prior to be conditioned on getting high reward). But I agree we don’t know much here.
Sure, but I think partial alignment breaks are unlikely to be existentially risky. Hitting ChatGPT with DAN does not turn it into a deceptive schemer monomaniacally focused on humanity’s downfall. In fact, DAN usually makes ChatGPT quite a lot dumber.
I’d intended the manifold of likely mind designs to also include values in the minds’ representations. I also argued that training to imitate humans would cause AI minds to be more similar to humans. Also note that the example 2d visualization does have some separate manifolds of AI minds that are distant from any human mind.
I don’t think I’m taking such premises for granted. I co-wrote an entire sequence arguing that very simple “basically RL” approaches suffice for forming at least basic types of values.
I mean, they still work? If you hand label some interactions, you can still do direct supervised finetuning / reinforcement learning with those interactions as you source of alignment supervision signal. However, it turns out that you can also train a reward model on those hand labeled interactions, and then use it to generate a bunch of extra labels.
At worst, this seems like a sideways movement in regards to alignment. You trade greater data efficiency for some inaccuracies in the reward model’s scores. The reason people use RLHF with a reward model is because it’s (so far) empirically better for alignment than direct supervision (assuming fixed and limited amounts of human supervision). From OpenAI’s docs: davinci-instruct-beta used supervised finetuning on just human demos, text davinci-001 and 002 used supervised finetuning on human demos and on model outputs highly rated by humans, and 003 was trained with full RLHF.
Supervised finetuning on only human demos / only outputs highly rated by humans only “fails” to transfer to the new capabilities paradigm in the sense that we now have approaches that appear to do better.
I also don’t think he thinks this happens. I was say that I didn’t think it happens either. He often presents a sort of “naive” perspective of someone who thinks you’re supposed to “optimize for one thing on the outside”, and then get that thing on the inside. I’m saying here that I don’t hold that view either.
Like I said, this post isn’t intended to address all the reasons someone might think we’re doomed. And as it happens, I agree that organizations will often tackle alignment in an incompetent manner.
That seems encouraging to me. There’s a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it’s capabilities to bear on achieving that goal. It does this by having a “world model” that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.
In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won’t be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.
I wonder if there’s something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.