It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data.
The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via fine-tuning from human preference data, or something else) seems to be where most of the variance in outcomes/safety will come in, at least in the current paradigm. Therefore, I think it could be worth ML safety researchers focusing on analyzing and optimizing this second stage as a way of narrowing the problem/experiment space. I think mech interp focused on the reward model used in RLHF could be an interesting direction here.
It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data.
The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via fine-tuning from human preference data, or something else) seems to be where most of the variance in outcomes/safety will come in, at least in the current paradigm. Therefore, I think it could be worth ML safety researchers focusing on analyzing and optimizing this second stage as a way of narrowing the problem/experiment space. I think mech interp focused on the reward model used in RLHF could be an interesting direction here.