For the latter (toy environments) part, I think we need to distinguish a couple possible types of argument:
Alignment is a property of a training procedure. I.e., the goal is to find a training procedure that will reliably build aligned models, in whatever environment we run it in. We run that training procedure in sandbox environments, and it always builds aligned models. Next, we run that same training procedure (from scratch) in the real world, and we should expect it to likewise build an aligned model.
Alignment is a property of a particular trained model. So we train a model in a sandbox, and verify that it’s aligned (somehow), and then use that very same trained model in the real world.
And also:
A. We’re going to have strong theoretical reasons to expect alignment, and we’re going to use sandbox testing to validate those theories.
B. We’re going to have an unprincipled approach that might or might not create aligned models, and we’re going to use sandbox testing to explore / tweak specific trained models and/or explore / tweak the training approach.
I think Nate is talking about 2 & B, and you’re talking about 1 & (not sure about A vs B).
I think that 2 is fraught because “aligned” has a different meaning in a sandbox versus the real world. In the sandbox, an “aligned” model would be trying to help / empower / whatever the sandbox inhabitants, and in the real world, an “aligned” model would be trying to help / empower / whatever “humanity”.
I think that 1 is potentially fraught too, at least in the absence of A, in that it’s conceivable that we’d find a training procedure that will reliably build aligned models when run in sandboxes while reliably building misaligned models when run in the real world.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
I think there’s a continuum between B and A, and the more we can move from B towards A, the better I feel.
And I think my own time is better spent on trying to move from B towards A, compared to thinking through how to make the most realistic sandboxes possible. But I’m happy for you and anyone else to be doing the latter. And I’m also strongly in favor of people building tools and culture to make it more likely that future AGI programmers will actually do sandbox testing—I have advocated for one aspect of that here.
I mostly agree with these distinctions, but I would add a third and layer them:
Aligned architecture: An architectural prior that results in aligned agents after training in most complex environments of relevance.
Aligned training procedure: The combination of an architectural prior and early training environment design (not a specific environment, but a set of constraints/properties) which results in agents which are then aligned in a wide variety of deployment environments—including those substantially different than the early training environment (out of dist robustness).
Aligned agent—the specific outcome of 1 or 2 which is known to be aligned (at least currently)
Current foundation models also vaguely follow this flow in the sense that the company/org controls the initial training environment and may enforce ‘alignment’ there before deployment (at which point other actors may be allowed to continue training through transfer or fine-tuning)
Humans are only 2 aligned rather than 1 aligned, and I think there are some good reasons for this. Given only simple circuits to guide/align the learning process, I think the best that can be done is to actually exploit distribution shift and use a specialized early training environment that is carefully crafted to provide positive examples that can be classified robustly—these then form the basis of a robust learned distance based classifier that leverages the generative model (I have some more on this probably should put into a short post—but that’s basically the proxy matching idea).
So ideally we’d want 1, but we probably will need to settle for 2. Once we have designs (arch + early training) that work well in sandbox, we’d probably still want to raise them initially in a special early training env that has the right properties, then gradually introduce to the real world and deploy (resulting in 3).
As for A vs B: ideally you may want A but you settle mostly for B. That’s just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
Well that isn’t quite right—when the AGI advances to human-level there are human-equivalents in the sandbox.
Also sandboxes shouldn’t be exactly the same as the real world! The point is to create a distribution of environments that is wider than the real world, to thereby cover it and especially the future.
For the latter (toy environments) part, I think we need to distinguish a couple possible types of argument:
Alignment is a property of a training procedure. I.e., the goal is to find a training procedure that will reliably build aligned models, in whatever environment we run it in. We run that training procedure in sandbox environments, and it always builds aligned models. Next, we run that same training procedure (from scratch) in the real world, and we should expect it to likewise build an aligned model.
Alignment is a property of a particular trained model. So we train a model in a sandbox, and verify that it’s aligned (somehow), and then use that very same trained model in the real world.
And also:
A. We’re going to have strong theoretical reasons to expect alignment, and we’re going to use sandbox testing to validate those theories.
B. We’re going to have an unprincipled approach that might or might not create aligned models, and we’re going to use sandbox testing to explore / tweak specific trained models and/or explore / tweak the training approach.
I think Nate is talking about 2 & B, and you’re talking about 1 & (not sure about A vs B).
I think that 2 is fraught because “aligned” has a different meaning in a sandbox versus the real world. In the sandbox, an “aligned” model would be trying to help / empower / whatever the sandbox inhabitants, and in the real world, an “aligned” model would be trying to help / empower / whatever “humanity”.
I think that 1 is potentially fraught too, at least in the absence of A, in that it’s conceivable that we’d find a training procedure that will reliably build aligned models when run in sandboxes while reliably building misaligned models when run in the real world.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
I think there’s a continuum between B and A, and the more we can move from B towards A, the better I feel.
And I think my own time is better spent on trying to move from B towards A, compared to thinking through how to make the most realistic sandboxes possible. But I’m happy for you and anyone else to be doing the latter. And I’m also strongly in favor of people building tools and culture to make it more likely that future AGI programmers will actually do sandbox testing—I have advocated for one aspect of that here.
I mostly agree with these distinctions, but I would add a third and layer them:
Aligned architecture: An architectural prior that results in aligned agents after training in most complex environments of relevance.
Aligned training procedure: The combination of an architectural prior and early training environment design (not a specific environment, but a set of constraints/properties) which results in agents which are then aligned in a wide variety of deployment environments—including those substantially different than the early training environment (out of dist robustness).
Aligned agent—the specific outcome of 1 or 2 which is known to be aligned (at least currently)
Current foundation models also vaguely follow this flow in the sense that the company/org controls the initial training environment and may enforce ‘alignment’ there before deployment (at which point other actors may be allowed to continue training through transfer or fine-tuning)
Humans are only 2 aligned rather than 1 aligned, and I think there are some good reasons for this. Given only simple circuits to guide/align the learning process, I think the best that can be done is to actually exploit distribution shift and use a specialized early training environment that is carefully crafted to provide positive examples that can be classified robustly—these then form the basis of a robust learned distance based classifier that leverages the generative model (I have some more on this probably should put into a short post—but that’s basically the proxy matching idea).
So ideally we’d want 1, but we probably will need to settle for 2. Once we have designs (arch + early training) that work well in sandbox, we’d probably still want to raise them initially in a special early training env that has the right properties, then gradually introduce to the real world and deploy (resulting in 3).
As for A vs B: ideally you may want A but you settle mostly for B. That’s just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.
Well that isn’t quite right—when the AGI advances to human-level there are human-equivalents in the sandbox.
Also sandboxes shouldn’t be exactly the same as the real world! The point is to create a distribution of environments that is wider than the real world, to thereby cover it and especially the future.