Epistemic status: I’m somewhat confident this is a useful axis to describe/consider alignment strategies/perspectives, but I’m pretty uncertain which is better. I could be missing important considerations, or weighing the considerations listed inaccurately.
When thinking about technical alignment strategy, I’m unsure whether it’s better to focus on trying to align systems that already exist (and are misaligned), or to focus on procedures that train aligned models from the start.
The first case is harder, which means focusing on it is closer to minimax optimization (minimize the badness of the worst-case outcome), which is often a good metric to optimize. Plus, it could be argued that it’s more realistic, because the current AI paradigm of “learn a bunch of stuff about the world with a SSL prediction loss, then fine-tune it to point at actually useful tasks” is very similar to this, and might require us to align pre-trained unaligned systems.
However, I don’t think only focusing on this paradigm is necessarily correct. I think it’s important we have methods that can train (more-or-less) from scratch (and still be roughly competitive with other methods). I’m uncertain how hard fine-tuning SSL prediction models is, and I think the frame of “we’re training models to be a certain way” admits different solutions and strategies than “we need to align a potential superintelligence”.
One way to cache this out more concretely is the extent to which you view amplification or debate as fine-tuning for alignment, versus a method of learning new capabilities (while maintaining alignment properties).
SSL models trained on real observations can be thought of as maps, so that tuning them without keeping this point of view in mind risks changing the map in a way that is motivated by something other than aligning it with the territory.
In particular, fine-tuning might distort the map, and amplification might generate sloppy fiction to be accepted as territory. A proper use of fine-tuning in this frame is as search for high fidelity depictions of aligned agents, zooming in the map by conditioning it to be about particular situations we are looking for. This is different from realigning the whole map with reinforcement learning, which might get it to lose touch with the ground truth of original training data. And a proper use of amplification is as reflection on blank spaces on the map, or low fidelity regions, extrapolating past/future details and other hidden variables of the territory from what the map does show, and learning how it looks when added to the map.
Epistemic status: I’m somewhat confident this is a useful axis to describe/consider alignment strategies/perspectives, but I’m pretty uncertain which is better. I could be missing important considerations, or weighing the considerations listed inaccurately.
When thinking about technical alignment strategy, I’m unsure whether it’s better to focus on trying to align systems that already exist (and are misaligned), or to focus on procedures that train aligned models from the start.
The first case is harder, which means focusing on it is closer to minimax optimization (minimize the badness of the worst-case outcome), which is often a good metric to optimize. Plus, it could be argued that it’s more realistic, because the current AI paradigm of “learn a bunch of stuff about the world with a SSL prediction loss, then fine-tune it to point at actually useful tasks” is very similar to this, and might require us to align pre-trained unaligned systems.
However, I don’t think only focusing on this paradigm is necessarily correct. I think it’s important we have methods that can train (more-or-less) from scratch (and still be roughly competitive with other methods). I’m uncertain how hard fine-tuning SSL prediction models is, and I think the frame of “we’re training models to be a certain way” admits different solutions and strategies than “we need to align a potential superintelligence”.
One way to cache this out more concretely is the extent to which you view amplification or debate as fine-tuning for alignment, versus a method of learning new capabilities (while maintaining alignment properties).
SSL models trained on real observations can be thought of as maps, so that tuning them without keeping this point of view in mind risks changing the map in a way that is motivated by something other than aligning it with the territory.
In particular, fine-tuning might distort the map, and amplification might generate sloppy fiction to be accepted as territory. A proper use of fine-tuning in this frame is as search for high fidelity depictions of aligned agents, zooming in the map by conditioning it to be about particular situations we are looking for. This is different from realigning the whole map with reinforcement learning, which might get it to lose touch with the ground truth of original training data. And a proper use of amplification is as reflection on blank spaces on the map, or low fidelity regions, extrapolating past/future details and other hidden variables of the territory from what the map does show, and learning how it looks when added to the map.