Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!
Thanks for posting this! Not only does a model have to develop complex situational awareness and have a long-term goal to become deceptively aligned, but it also has to develop these around the same time as it learns to understand the training goal, or earlier. I recently wrote a detailed, object-level argument that this is very unlikely. I would love to hear what you think of it!
I’ll give it a listen now.