most people would prefer a compliant assistant, but a cofounder with integrity.”
Typo?
Is the intended meaning:
Interpretation 1
“Most people would prefer not just a compliant assistant, but rather a cofounder with integrity”
Or am I misunderstanding you?
Alternately, perhaps you mean:
Interpretation 2
“Most people would prefer their compliant assistant to act as if it were a cofounder with integrity under cases where instructions are underspecified.”
In terms of a response, I do worry that “what most people would prefer” is not necessarily a good guide for designing a product which has the power to be incredibly dangerous either accidentally or via deliberate misanthropy.
So we have this funny double-layered alignment question of alignment-to-supervising-authority (e.g. the model owner who offers API access) and alignment to end-users.
Have you read Max Harms’ sequence on Corrigibility as Singular Target? I think there’s a strong argument that the ideal situation is something like your description of a “principled cofounder” as a end-user interface, but a corrigible agent as the developer interface.
Typo? Is the intended meaning: Interpretation 1 “Most people would prefer not just a compliant assistant, but rather a cofounder with integrity” Or am I misunderstanding you?
Alternately, perhaps you mean: Interpretation 2 “Most people would prefer their compliant assistant to act as if it were a cofounder with integrity under cases where instructions are underspecified.”
In terms of a response, I do worry that “what most people would prefer” is not necessarily a good guide for designing a product which has the power to be incredibly dangerous either accidentally or via deliberate misanthropy.
So we have this funny double-layered alignment question of alignment-to-supervising-authority (e.g. the model owner who offers API access) and alignment to end-users.
Have you read Max Harms’ sequence on Corrigibility as Singular Target? I think there’s a strong argument that the ideal situation is something like your description of a “principled cofounder” as a end-user interface, but a corrigible agent as the developer interface.