Back in 2000, a computer scientist named Charles Ofria was studying the evolution of simulated organisms. He wanted to limit their replication rate, so he programmed the simulation to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent.
Eventually, organisms evolved to recognize and pause their replication in the test environment and resume normal replication when released. Ofria tried solving this by making it harder to detect the test environment, but the organisms simply started to randomly accelerate their replication so as to slip through the test environment some of the time.
I don’t see why a model would need to be cognitively able to process its own alignment for its alignment to be improper, and I think this assumption is quite key to the main claims of the post.
In this post, I’m talking about deceptive alignment. The threat model you’re talking about here doesn’t really count as deceptive alignment, because the organisms weren’t explicitly using a world model to optimize their choices to cause the bad outcome. AIs like that might still be a problem (e.g. I think that deceptively aligned AI probably contributes less than half of my P(doom from AI)), but I think we should think of them somewhat separately from deceptively aligned models, because they pose risk by somewhat different mechanisms.
Is this true? In Thoughts On (Solving) Deep Deception, Jozdien gives the following example that suggests otherwise to me:
I don’t see why a model would need to be cognitively able to process its own alignment for its alignment to be improper, and I think this assumption is quite key to the main claims of the post.
In this post, I’m talking about deceptive alignment. The threat model you’re talking about here doesn’t really count as deceptive alignment, because the organisms weren’t explicitly using a world model to optimize their choices to cause the bad outcome. AIs like that might still be a problem (e.g. I think that deceptively aligned AI probably contributes less than half of my P(doom from AI)), but I think we should think of them somewhat separately from deceptively aligned models, because they pose risk by somewhat different mechanisms.