OK, now I get what you are saying! Interesting. I am skeptical that this will work for most alignment problems, due to lack of simple conceptual core maybe. In particular, I doubt that corrigibility and non-deceptiveness have simple conceptual cores. I hope I’m wrong.
Well, if you worry that these properties don’t have a simple conceptual core, maybe you can do the trick where you try to formalize a subset of them with a small conceptual core. That’s basically Evan move on Myopia as a more easy to study subset of non-deceptiveness.
OK, now I get what you are saying! Interesting. I am skeptical that this will work for most alignment problems, due to lack of simple conceptual core maybe. In particular, I doubt that corrigibility and non-deceptiveness have simple conceptual cores. I hope I’m wrong.
Well, if you worry that these properties don’t have a simple conceptual core, maybe you can do the trick where you try to formalize a subset of them with a small conceptual core. That’s basically Evan move on Myopia as a more easy to study subset of non-deceptiveness.