Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
This is a pretty helpful answer.
(Though you keep referencing the AI’s chain of thought. I wasn’t imagining training over the chain of thought. I was imagining training over the AI’s outputs, whatever those are in the relevant domain.)
I don’t undertand what it would mean for “outputs” to be corrigible, so I feel like you must be talking about internal chain of thoughts here? The output of a corrigible AI and a non-corrigibile AI is the same for almost all tasks? They both try to perform any task as well as possible, the difference is how they relate to the task and how they handle interference.
This is a pretty helpful answer.
(Though you keep referencing the AI’s chain of thought. I wasn’t imagining training over the chain of thought. I was imagining training over the AI’s outputs, whatever those are in the relevant domain.)
I don’t undertand what it would mean for “outputs” to be corrigible, so I feel like you must be talking about internal chain of thoughts here? The output of a corrigible AI and a non-corrigibile AI is the same for almost all tasks? They both try to perform any task as well as possible, the difference is how they relate to the task and how they handle interference.