Tangential question (and maybe this isn’t the sort of thing to go into too much detail on a public forum), but I’m quite curious about what alignment training would look like in practice. Are there notes on this anywhere?
For instance, what should imagine a “training episode” to be like? Something similar to Ought’s experiments with factorized cognition? Some person doing “work as a CEO of an EA org” while they have an input and mouse movement tracker on their laptop? The AI playing some kind of open-ended game to gather resources and negotiate in self-play and people looking at it and distributing positive and negative points for various actions? (Probably not this one – I don’t see why that would lead to alignment.) The AI writing up plans for how it would do a given assistance task and people rate these plans in terms of safety, norm following, and common sense understanding (on top of plans actually being workable)?
It seems like “alignment training” is such a vague category that I don’t really know what to envision, which bottlenecks my thinking in a lot of related areas and is a bit frustrating.
(I guess there’s more than one question implicit in my query. On the one hand, I’m wonder how systems with various “pivotal” / “transformative” capabilities would be trained to be safe/aligned. On the other hand, I’m wondering what sort of system people have in mind, whether it’ll be an AI CEO or some more domain-limited application.)
It’s hard to predict (especially if timelines are long), but if I had to guess I would say that something similar to human feedback on diverse tasks will be the unaligned benchmark we will be trying to beat. In that setting, a training episode is an episode of an RL environment in which the system being trained performs some task and obtains reward chosen by humans.
It’s even harder to predict what our aligned alternatives to this will look like, but they may need to be at least somewhat similar to this in order to remain competitive. In that case, a training episode might look more-or-less the same, but with reward chosen in a more sophisticated way and/or with other training objectives mixed in. The post I linked to discusses some possible modifications.
Tangential question (and maybe this isn’t the sort of thing to go into too much detail on a public forum), but I’m quite curious about what alignment training would look like in practice. Are there notes on this anywhere?
For instance, what should imagine a “training episode” to be like? Something similar to Ought’s experiments with factorized cognition? Some person doing “work as a CEO of an EA org” while they have an input and mouse movement tracker on their laptop? The AI playing some kind of open-ended game to gather resources and negotiate in self-play and people looking at it and distributing positive and negative points for various actions? (Probably not this one – I don’t see why that would lead to alignment.) The AI writing up plans for how it would do a given assistance task and people rate these plans in terms of safety, norm following, and common sense understanding (on top of plans actually being workable)?
It seems like “alignment training” is such a vague category that I don’t really know what to envision, which bottlenecks my thinking in a lot of related areas and is a bit frustrating.
(I guess there’s more than one question implicit in my query. On the one hand, I’m wonder how systems with various “pivotal” / “transformative” capabilities would be trained to be safe/aligned. On the other hand, I’m wondering what sort of system people have in mind, whether it’ll be an AI CEO or some more domain-limited application.)
It’s hard to predict (especially if timelines are long), but if I had to guess I would say that something similar to human feedback on diverse tasks will be the unaligned benchmark we will be trying to beat. In that setting, a training episode is an episode of an RL environment in which the system being trained performs some task and obtains reward chosen by humans.
It’s even harder to predict what our aligned alternatives to this will look like, but they may need to be at least somewhat similar to this in order to remain competitive. In that case, a training episode might look more-or-less the same, but with reward chosen in a more sophisticated way and/or with other training objectives mixed in. The post I linked to discusses some possible modifications.