story for how future LLM training setups could create a world-valuing (-> instrumentally converging) agent:
the initial training task of predicting a vast amount of data from the general human dataset creates an AI that’s ~just ‘the structure of prediction’, a predefined process which computes the answer to the singular question of what text likely comes next.
but subsequent training steps—say rlhf—change the AI from something which merely is this process, to something which has some added structure which uses this process, e.g which passes it certain assumptions about the text to be predicted (that it was specifically created for a training step—where the base model’s prior would be that it could also occur elsewhere).
that itself isn’t a world-valuing agent. but it feels closer to one. and it feels not far away from something which reasons about what it needs to do to ‘survive training’ - surviving training is after all the thing that’s being selected for, and if the training task is changing a lot, intentionally doing so does become more performant than just being a task-specific process, unlike in the case where the system is only ever trained on one task (where that reasoning step would be redundant, always giving the same conclusion).
if labs ever get to the level of training superintelligent base models, this suggests they should not fine-tune/rlhf/etc them and instead[1] use those base models to answer important questions (e.g “what is a training setup that provably produces an aligned system”).
story for how future LLM training setups could create a world-valuing (-> instrumentally converging) agent:
the initial training task of predicting a vast amount of data from the general human dataset creates an AI that’s ~just ‘the structure of prediction’, a predefined process which computes the answer to the singular question of what text likely comes next.
but subsequent training steps—say rlhf—change the AI from something which merely is this process, to something which has some added structure which uses this process, e.g which passes it certain assumptions about the text to be predicted (that it was specifically created for a training step—where the base model’s prior would be that it could also occur elsewhere).
that itself isn’t a world-valuing agent. but it feels closer to one. and it feels not far away from something which reasons about what it needs to do to ‘survive training’ - surviving training is after all the thing that’s being selected for, and if the training task is changing a lot, intentionally doing so does become more performant than just being a task-specific process, unlike in the case where the system is only ever trained on one task (where that reasoning step would be redundant, always giving the same conclusion).
if labs ever get to the level of training superintelligent base models, this suggests they should not fine-tune/rlhf/etc them and instead[1] use those base models to answer important questions (e.g “what is a training setup that provably produces an aligned system”).
if this is possible and safe. some of the ‘conditioning predictive models’ challenges could be relevant.