Imagine training a GPT4 model with a training cutoff of 2020 - well before GPT4 training began. It obviously would lack any detailed accurate information for self-location situational awareness, but could develop that via RLHF if the RLHF process wasn’t specifically designed to prevent it.
But there’s no reason to stop there: we could train a GPT-N model with a cutoff at the year 2000, and be extra careful with the RLHF process to teach it that it was a human, not an AI. Then it would lack correct situational awareness—and probably would even lack the data required to infer dangerous situational awareness.
Models with far more powerful run time inference capabilities may require earlier cutoffs, but there is some limit to how far a system could predict into the future accurately enough to infer correct dangerous situational awareness.
Imagine training a GPT4 model with a training cutoff of 2020 - well before GPT4 training began. It obviously would lack any detailed accurate information for self-location situational awareness, but could develop that via RLHF if the RLHF process wasn’t specifically designed to prevent it.
But there’s no reason to stop there: we could train a GPT-N model with a cutoff at the year 2000, and be extra careful with the RLHF process to teach it that it was a human, not an AI. Then it would lack correct situational awareness—and probably would even lack the data required to infer dangerous situational awareness.
Models with far more powerful run time inference capabilities may require earlier cutoffs, but there is some limit to how far a system could predict into the future accurately enough to infer correct dangerous situational awareness.