I agree that the Oracle/Genie/Tool/Agent categories don’t properly contain models trained on self-supervised objectives. Further, I think these aren’t particularly useful categories for current alignment research – we should just talk about how exactly we trained the model and what that behaviorally incentivizes.
Also notice that Simulators was written before fine-tuned models were widely available in form of ChatGPT. All the arguments in the original post against interpreting LLMs as Oracle AIs do no longer apply to instruction fine-tuned models like ChatGPT, which seems to be in fact a prototypical example of a non-superintelligent Oracle. Of course this kind of Oracle AI is just a fine-tuned probability distribution, where the numbers it assigns to next tokens do no longer correspond to their probabilities, but to something else (“goodness relative to fine-tuning”?).
Anyway, fine-tuning might be an important issue from an alignment perspective, insofar fine-tuning seems more likely to result in misalignment than the pure self-supervised imitation learning of the base model.
First, the amount of fine-tuning data (dialogue examples for SL and human preferences ratings for RL) is much more limited than the massive data the self-supervised base model is trained on. This could make it much more likely that the model misgeneralizes (inner misalignment). E.g. it may deem calling someone a racial slur as worse than cutting their hand off because all the RL fine-tuning emphasizes avoiding slurs rather than avoiding cut-off hands.
Second, especially RLHF may suffer from political biases of human raters. This could lead the fine-tuned model to become deceptive and to lie about its beliefs when they are politically incorrect. Which would be a case of outer misalignment.
Also notice that Simulators was written before fine-tuned models were widely available in form of ChatGPT. All the arguments in the original post against interpreting LLMs as Oracle AIs do no longer apply to instruction fine-tuned models like ChatGPT, which seems to be in fact a prototypical example of a non-superintelligent Oracle. Of course this kind of Oracle AI is just a fine-tuned probability distribution, where the numbers it assigns to next tokens do no longer correspond to their probabilities, but to something else (“goodness relative to fine-tuning”?).
Anyway, fine-tuning might be an important issue from an alignment perspective, insofar fine-tuning seems more likely to result in misalignment than the pure self-supervised imitation learning of the base model.
First, the amount of fine-tuning data (dialogue examples for SL and human preferences ratings for RL) is much more limited than the massive data the self-supervised base model is trained on. This could make it much more likely that the model misgeneralizes (inner misalignment). E.g. it may deem calling someone a racial slur as worse than cutting their hand off because all the RL fine-tuning emphasizes avoiding slurs rather than avoiding cut-off hands.
Second, especially RLHF may suffer from political biases of human raters. This could lead the fine-tuned model to become deceptive and to lie about its beliefs when they are politically incorrect. Which would be a case of outer misalignment.