the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions.
Right, but then we continue the training process, which shapes the semi-outer-aligned algorithms into something that is is more inner aligned?
Or is the thought that this is happening late in the game, after the algorithm is strategically aware and deceptively aligned, spoofing your adversarial test cases, while awaiting a treacherous turn?
But would that even work? SGD still updates the parameters of the system even in cases where the model “passes”. It seems like it would be shaped more and more to do the things that the humans want, over many trillions (?) of training steps. Even if it starts out deceptively aligned, does it stay deceptively aligned?
Right, but then we continue the training process, which shapes the semi-outer-aligned algorithms into something that is is more inner aligned?
Or is the thought that this is happening late in the game, after the algorithm is strategically aware and deceptively aligned, spoofing your adversarial test cases, while awaiting a treacherous turn?
But would that even work? SGD still updates the parameters of the system even in cases where the model “passes”. It seems like it would be shaped more and more to do the things that the humans want, over many trillions (?) of training steps. Even if it starts out deceptively aligned, does it stay deceptively aligned?