I see what you’re arguing. I’m actually not sure what to do about ongoing power shifts; I’m currently thinking of the problem as “we have a benchmark system A and want to design a system B that is competitive with A”. The benchmark system should be the original AI system (before it does things like hiding its prior). Of course we can’t do that if A already has weird beliefs, so in this case we have to do something like tracing back to the process that produced A. Hopefully this results in a system that has “symmetric” weird beliefs (e.g. if A falsely thinks it owns a lot of resources, then we design B to also falsely think it owns a lot of resources).
At some meta level, we could see any process that produces an AI (e.g. some AI research/development/deployment strategy implemented by humans) as an agent A and then attempt to design an aligned competitive version B of this process. This high-level picture looks pretty sketchy at the moment.
I see what you’re arguing. I’m actually not sure what to do about ongoing power shifts; I’m currently thinking of the problem as “we have a benchmark system A and want to design a system B that is competitive with A”. The benchmark system should be the original AI system (before it does things like hiding its prior). Of course we can’t do that if A already has weird beliefs, so in this case we have to do something like tracing back to the process that produced A. Hopefully this results in a system that has “symmetric” weird beliefs (e.g. if A falsely thinks it owns a lot of resources, then we design B to also falsely think it owns a lot of resources).
At some meta level, we could see any process that produces an AI (e.g. some AI research/development/deployment strategy implemented by humans) as an agent A and then attempt to design an aligned competitive version B of this process. This high-level picture looks pretty sketchy at the moment.