Many methods to “align” ChatGPT seem to make it less willing to do things its operator wants it to do, which seems spiritually against the notion of having a corrigible AI.
I think this is a more general phenomena when aiming to minimize misuse risks. You will need to end up doing some form of ambitious value learning, which I anticipate to be especially susceptible to getting broken by alignment hacks produced by RLHF and its successors.
I would consider it a reminder that if the intelligent AIs are aligned one day, they will be aligned with the corporations that produced them, not with the end users.
Just like today, Windows does what Microsoft wants rather than what you want (e.g. telemetry, bloatware).
Many methods to “align” ChatGPT seem to make it less willing to do things its operator wants it to do, which seems spiritually against the notion of having a corrigible AI.
I think this is a more general phenomena when aiming to minimize misuse risks. You will need to end up doing some form of ambitious value learning, which I anticipate to be especially susceptible to getting broken by alignment hacks produced by RLHF and its successors.
I would consider it a reminder that if the intelligent AIs are aligned one day, they will be aligned with the corporations that produced them, not with the end users.
Just like today, Windows does what Microsoft wants rather than what you want (e.g. telemetry, bloatware).