I mostly agree with these distinctions, but I would add a third and layer them:
Aligned architecture: An architectural prior that results in aligned agents after training in most complex environments of relevance.
Aligned training procedure: The combination of an architectural prior and early training environment design (not a specific environment, but a set of constraints/properties) which results in agents which are then aligned in a wide variety of deployment environments—including those substantially different than the early training environment (out of dist robustness).
Aligned agent—the specific outcome of 1 or 2 which is known to be aligned (at least currently)
Current foundation models also vaguely follow this flow in the sense that the company/org controls the initial training environment and may enforce ‘alignment’ there before deployment (at which point other actors may be allowed to continue training through transfer or fine-tuning)
Humans are only 2 aligned rather than 1 aligned, and I think there are some good reasons for this. Given only simple circuits to guide/align the learning process, I think the best that can be done is to actually exploit distribution shift and use a specialized early training environment that is carefully crafted to provide positive examples that can be classified robustly—these then form the basis of a robust learned distance based classifier that leverages the generative model (I have some more on this probably should put into a short post—but that’s basically the proxy matching idea).
So ideally we’d want 1, but we probably will need to settle for 2. Once we have designs (arch + early training) that work well in sandbox, we’d probably still want to raise them initially in a special early training env that has the right properties, then gradually introduce to the real world and deploy (resulting in 3).
As for A vs B: ideally you may want A but you settle mostly for B. That’s just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.
I think A is definitely what I’m hoping for, and that 1 & A would make me very optimistic. I think B is pretty fraught, again for the same reason as above—no sandbox is exactly the same as the real world, with “the likely absence of real live humans in the sandbox” being a particularly important aspect of that.
Well that isn’t quite right—when the AGI advances to human-level there are human-equivalents in the sandbox.
Also sandboxes shouldn’t be exactly the same as the real world! The point is to create a distribution of environments that is wider than the real world, to thereby cover it and especially the future.
I mostly agree with these distinctions, but I would add a third and layer them:
Aligned architecture: An architectural prior that results in aligned agents after training in most complex environments of relevance.
Aligned training procedure: The combination of an architectural prior and early training environment design (not a specific environment, but a set of constraints/properties) which results in agents which are then aligned in a wide variety of deployment environments—including those substantially different than the early training environment (out of dist robustness).
Aligned agent—the specific outcome of 1 or 2 which is known to be aligned (at least currently)
Current foundation models also vaguely follow this flow in the sense that the company/org controls the initial training environment and may enforce ‘alignment’ there before deployment (at which point other actors may be allowed to continue training through transfer or fine-tuning)
Humans are only 2 aligned rather than 1 aligned, and I think there are some good reasons for this. Given only simple circuits to guide/align the learning process, I think the best that can be done is to actually exploit distribution shift and use a specialized early training environment that is carefully crafted to provide positive examples that can be classified robustly—these then form the basis of a robust learned distance based classifier that leverages the generative model (I have some more on this probably should put into a short post—but that’s basically the proxy matching idea).
So ideally we’d want 1, but we probably will need to settle for 2. Once we have designs (arch + early training) that work well in sandbox, we’d probably still want to raise them initially in a special early training env that has the right properties, then gradually introduce to the real world and deploy (resulting in 3).
As for A vs B: ideally you may want A but you settle mostly for B. That’s just how the world often works, how DL progressed, etc. We now have more established theory of how DL works as approx bayesian inference, but what actually drove most progress was B style tinkering.
Well that isn’t quite right—when the AGI advances to human-level there are human-equivalents in the sandbox.
Also sandboxes shouldn’t be exactly the same as the real world! The point is to create a distribution of environments that is wider than the real world, to thereby cover it and especially the future.