We discussed more in person. I ended up agreeing with (what I perceive to be) a substantially different claim than I read from your original comment. I agree that we can’t just figure out alignment by black-boxing AI cognition and seeing whether the AI does good things or not, nor can we just set up feedback loops on that (e.g. train a succession of agents and tweak the process based on how aligned they seem) without some substantial theoretical underpinnings with which to interpret the evidence.
However, I still don’t see how your original comment is a reasonable way to communicate this state of mind. For example, you wrote:
It’s easy to come up with a crappy proxy feedback signal—just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.
What does this mean, if not using human approval as a reward signal? Can you briefly step me through a fictional scenario where the described failure obtains?
We discussed more in person. I ended up agreeing with (what I perceive to be) a substantially different claim than I read from your original comment. I agree that we can’t just figure out alignment by black-boxing AI cognition and seeing whether the AI does good things or not, nor can we just set up feedback loops on that (e.g. train a succession of agents and tweak the process based on how aligned they seem) without some substantial theoretical underpinnings with which to interpret the evidence.
However, I still don’t see how your original comment is a reasonable way to communicate this state of mind. For example, you wrote:
What does this mean, if not using human approval as a reward signal? Can you briefly step me through a fictional scenario where the described failure obtains?