Do you think it’s possible we end up in a world where we’re mostly building AIs by fine-tuning powerful base models that are already situationally aware? In this world we’d be skipping right to phase 2 of training (at least on the particular task), thereby losing any of the alignment benefits that are to be gained from phase 1 (at least on the particular task).
Concretely, suppose that GPT-N (N > 3) is situationally aware, and we are fine-tuning it to take actions that maximize nominal GDP. It knows from the get-go that printing loads of money is the best approach, but since it’s situationally aware it also knows that we would modify it if it did that. Thus, it takes agreeable actions during training, but once deployed pivots to printing loads of money. (In this hypothetical the hazard of just printing money doesn’t occur to us humans, but you could imagine replacing it with something that more plausibly wouldn’t occur to us.)
Do you think it’s possible we end up in a world where we’re mostly building AIs by fine-tuning powerful base models that are already situationally aware? In this world we’d be skipping right to phase 2 of training (at least on the particular task), thereby losing any of the alignment benefits that are to be gained from phase 1 (at least on the particular task).
Concretely, suppose that GPT-N (N > 3) is situationally aware, and we are fine-tuning it to take actions that maximize nominal GDP. It knows from the get-go that printing loads of money is the best approach, but since it’s situationally aware it also knows that we would modify it if it did that. Thus, it takes agreeable actions during training, but once deployed pivots to printing loads of money. (In this hypothetical the hazard of just printing money doesn’t occur to us humans, but you could imagine replacing it with something that more plausibly wouldn’t occur to us.)