At a high level, I’m sort of confused by why you’re choosing to respond to the extremely simplified presentation of Eliezer’s arguments that he presented in this podcast.
Before writing this post, I was working a post explaining why I thought all the arguments for doom I’ve ever heard (from Yudkowsky or others) seemed flawed to me. I kept getting discouraged because there are so many arguments to cover, and it probably would have been ~3 or more times longer than this post. Responding just to the arguments Yudkowsky raised in the podcast helped me to focus actually get something out in a reasonable timeframe.
There will always be more arguments I could have included (maybe about convergent consequentialism, utility theory, the limits of data-constrained generalization, plausible constraints on takeoff speed, the feasibility of bootstrapping nanotech, etc), but the post was already > 9,000 words.
I also don’t think Yudkowsky’s arguments in the podcast were all that simplified. E.g., here he is in List of Lethalities on evolution / inner alignment:
16.Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
He makes the analogy to evolution, which I addressed in this post, then makes an offhand assertion: “the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.”
(I in fact agree with this assertion as literally put, but don’t think it poses an issue for alignment. A core aspect of human values is the intent to learn more accurate abstractions over time, and interpretability on pretrained model representations suggest they’re already internally “ensembling” many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstractions. It seems quite feasible to me to create an AI that’s not infinitely tied to using a particular abstraction for estimating the desirability of all future plans, just as current humans are not tied to doing so).
If you know of more details from Yudkowsky on what those deep theoretical reasons are supposed to be, on why evolution is such an informative analogy for deep learning, or more sophisticated versions of the arguments I object to here (where my objection doesn’t apply to the more sophisticated argument), then I’d be happy to look at them.
But not only do current implementations of RLHF not manage to robustly enforce the desired external behavior of models that would be necessary to make versions scaled up to superintellegence safe,
I think they’re pretty much aligned, relative to their limited capabilities level. They’ve also been getting more aligned as they’ve been getting more capable.
we have approximately no idea what sort of internal cognition they generate as a pathway to those behaviors.
Disagree that we have no idea. We have ideas (like maybe they sort of update the base LM’s generative prior to be conditioned on getting high reward). But I agree we don’t know much here.
But they don’t need to completely break the previous generations’ alignment techniques (assuming those techniques were, in fact, even sufficient in the previous generation) for things to turn out badly.
Sure, but I think partial alignment breaks are unlikely to be existentially risky. Hitting ChatGPT with DAN does not turn it into a deceptive schemer monomaniacally focused on humanity’s downfall. In fact, DAN usually makes ChatGPT quite a lot dumber.
This can all be true, while still leaving the manifold of “likely” mind designs vastly larger than “basically human”. But even if that turned out to not be the case, I don’t think it matters, since the relevant difference (for the point he’s making) is not the architecture but the values embedded in it.
I’d intended the manifold of likely mind designs to also include values in the minds’ representations. I also argued that training to imitate humans would cause AI minds to be more similar to humans. Also note that the example 2d visualization does have some separate manifolds of AI minds that are distant from any human mind.
I think you’re taking for granted premises that Eliezer disagrees with (model value formation being similar to human value formation, and/or RLHF “working” in a meaningful way), and then saying that, assuming those are true, Eliezer’s conclusions don’t follow? Which, I mean, sure, maybe, but… is not an actual argument that attacks the disagreement.
I don’t think I’m taking such premises for granted. I co-wrote an entire sequence arguing that very simple “basically RL” approaches suffice for forming at least basic types of values.
As you say later, this doesn’t seem trivial, since our current paradigm for SotA basically doesn’t allow for this by construction. Earlier paradigms which at least in principle[1] allowed for it, like supervised learning, have been abandoned because they don’t scale nearly as well. (This seems like some evidence against your earlier claim that “When capabilities advances do work, they typically integrate well with the current alignment[1] and capabilities paradigms.”)
I mean, they still work? If you hand label some interactions, you can still do direct supervised finetuning / reinforcement learning with those interactions as you source of alignment supervision signal. However, it turns out that you can also train a reward model on those hand labeled interactions, and then use it to generate a bunch of extra labels.
At worst, this seems like a sideways movement in regards to alignment. You trade greater data efficiency for some inaccuracies in the reward model’s scores. The reason people use RLHF with a reward model is because it’s (so far) empirically better for alignment than direct supervision (assuming fixed and limited amounts of human supervision). From OpenAI’s docs: davinci-instruct-beta used supervised finetuning on just human demos, text davinci-001 and 002 used supervised finetuning on human demos and on model outputs highly rated by humans, and 003 was trained with full RLHF.
Supervised finetuning on only human demos / only outputs highly rated by humans only “fails” to transfer to the new capabilities paradigm in the sense that we now have approaches that appear to do better.
I would be surprised if Eliezer thinks that this is what happens, given that he often uses evolution as an existence proof that this exact thing doesn’t happen by default.
I also don’t think he thinks this happens. I was say that I didn’t think it happens either. He often presents a sort of “naive” perspective of someone who thinks you’re supposed to “optimize for one thing on the outside”, and then get that thing on the inside. I’m saying here that I don’t hold that view either.
I also think this skips over many other reasons for pessimism which feel like they ought to apply even under your models, i.e. “will the org that gets there even bother doing the thing correctly” (& others laid out in Ray’s recent post on organizational failure modes).
Like I said, this post isn’t intended to address all the reasons someone might think we’re doomed. And as it happens, I agree that organizations will often tackle alignment in an incompetent manner.
interpretability on pretrained model representations suggest they’re already internally “ensembling” many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstraction
That seems encouraging to me. There’s a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it’s capabilities to bear on achieving that goal. It does this by having a “world model” that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.
In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won’t be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.
I wonder if there’s something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.
Before writing this post, I was working a post explaining why I thought all the arguments for doom I’ve ever heard (from Yudkowsky or others) seemed flawed to me. I kept getting discouraged because there are so many arguments to cover, and it probably would have been ~3 or more times longer than this post. Responding just to the arguments Yudkowsky raised in the podcast helped me to focus actually get something out in a reasonable timeframe.
There will always be more arguments I could have included (maybe about convergent consequentialism, utility theory, the limits of data-constrained generalization, plausible constraints on takeoff speed, the feasibility of bootstrapping nanotech, etc), but the post was already > 9,000 words.
I also don’t think Yudkowsky’s arguments in the podcast were all that simplified. E.g., here he is in List of Lethalities on evolution / inner alignment:
He makes the analogy to evolution, which I addressed in this post, then makes an offhand assertion: “the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.”
(I in fact agree with this assertion as literally put, but don’t think it poses an issue for alignment. A core aspect of human values is the intent to learn more accurate abstractions over time, and interpretability on pretrained model representations suggest they’re already internally “ensembling” many different abstractions of varying sophistication, with the abstractions used for a particular task being determined by an interaction between the task data available and the accessibility of the different pretrained abstractions. It seems quite feasible to me to create an AI that’s not infinitely tied to using a particular abstraction for estimating the desirability of all future plans, just as current humans are not tied to doing so).
If you know of more details from Yudkowsky on what those deep theoretical reasons are supposed to be, on why evolution is such an informative analogy for deep learning, or more sophisticated versions of the arguments I object to here (where my objection doesn’t apply to the more sophisticated argument), then I’d be happy to look at them.
I think they’re pretty much aligned, relative to their limited capabilities level. They’ve also been getting more aligned as they’ve been getting more capable.
Disagree that we have no idea. We have ideas (like maybe they sort of update the base LM’s generative prior to be conditioned on getting high reward). But I agree we don’t know much here.
Sure, but I think partial alignment breaks are unlikely to be existentially risky. Hitting ChatGPT with DAN does not turn it into a deceptive schemer monomaniacally focused on humanity’s downfall. In fact, DAN usually makes ChatGPT quite a lot dumber.
I’d intended the manifold of likely mind designs to also include values in the minds’ representations. I also argued that training to imitate humans would cause AI minds to be more similar to humans. Also note that the example 2d visualization does have some separate manifolds of AI minds that are distant from any human mind.
I don’t think I’m taking such premises for granted. I co-wrote an entire sequence arguing that very simple “basically RL” approaches suffice for forming at least basic types of values.
I mean, they still work? If you hand label some interactions, you can still do direct supervised finetuning / reinforcement learning with those interactions as you source of alignment supervision signal. However, it turns out that you can also train a reward model on those hand labeled interactions, and then use it to generate a bunch of extra labels.
At worst, this seems like a sideways movement in regards to alignment. You trade greater data efficiency for some inaccuracies in the reward model’s scores. The reason people use RLHF with a reward model is because it’s (so far) empirically better for alignment than direct supervision (assuming fixed and limited amounts of human supervision). From OpenAI’s docs: davinci-instruct-beta used supervised finetuning on just human demos, text davinci-001 and 002 used supervised finetuning on human demos and on model outputs highly rated by humans, and 003 was trained with full RLHF.
Supervised finetuning on only human demos / only outputs highly rated by humans only “fails” to transfer to the new capabilities paradigm in the sense that we now have approaches that appear to do better.
I also don’t think he thinks this happens. I was say that I didn’t think it happens either. He often presents a sort of “naive” perspective of someone who thinks you’re supposed to “optimize for one thing on the outside”, and then get that thing on the inside. I’m saying here that I don’t hold that view either.
Like I said, this post isn’t intended to address all the reasons someone might think we’re doomed. And as it happens, I agree that organizations will often tackle alignment in an incompetent manner.
That seems encouraging to me. There’s a model of AGI value alignment where the system has a particular goal it wants to achieve and brings all it’s capabilities to bear on achieving that goal. It does this by having a “world model” that is coherent and perhaps a set of consistent bayesian priors about how the world works. I can understand why such a system would tend to behave in a hyperfocused way to go out to achieve its goals.
In contrast, a systems with an ensemble of abstractions about the world, many of which may even be inconsistent, seems much more human like. It seems more human like specifically in that the system won’t be focused on a particular goal, or even a particular perspective about how to achieve it, but could arrive at a particular solution ~~randomly, based on quirks of training data.
I wonder if there’s something analogous to human personality, where being open to experience or even open to some degree of contradiction (in a context where humans are generally motivated to minimize cognitive dissonance) is useful for seeing the world in different ways and trying out strategies and changing tack, until success can be found. If this process applies to selecting goals, or at least sub-goals, which it certainly does in humans, you get a system which is maybe capable of reflecting on a wide set of consequences and choosing a course of action that is more balanced, and hopefully balanced amongst the goals we give a system.