You don’t have to be able to simulate something to trust it for this or that. EG, the specification of alphazero is much simpler than the final weights, and knowing its training process, without knowing its weights, you can still trust that it will never, say, take a bribe to throw a match. Even if it comprehended bribery, we know from its spec info that it’s solely interested in winning whatever match it’s currently playing, and no sum would be enough.
To generalize, if we know something’s utility function, and if we know it had a robust design, even if we know nothing else about its history, we know what it’ll do.
A promise-keeping capacity is a property utility functions can have.
A promise-keeping capacity is a property utility functions can have.
Yeah, definitely cruxy. It may be a property that utility functions could have, but it’s not a property that any necessarily do have. Moreover, we have zero examples of robust-designed agents with known utility functions, so it’s extremely unclear whether that will become the norm, let alone the universal assumption.
You don’t have to be able to simulate something to trust it for this or that. EG, the specification of alphazero is much simpler than the final weights, and knowing its training process, without knowing its weights, you can still trust that it will never, say, take a bribe to throw a match. Even if it comprehended bribery, we know from its spec info that it’s solely interested in winning whatever match it’s currently playing, and no sum would be enough.
To generalize, if we know something’s utility function, and if we know it had a robust design, even if we know nothing else about its history, we know what it’ll do.
A promise-keeping capacity is a property utility functions can have.
Yeah, definitely cruxy. It may be a property that utility functions could have, but it’s not a property that any necessarily do have. Moreover, we have zero examples of robust-designed agents with known utility functions, so it’s extremely unclear whether that will become the norm, let alone the universal assumption.