For one thing, you’d have to explicitly come up with the utility function before you can prove the AI follows it.
You can either make an AI that will proveably do what you mean, or make one that will hopefully figure out what you meant when you said “do what I mean,” and do that.
When I picture what a proven-Friendly AI looks like, I think of something where it’s goals are 1)Using a sample of simulated humans, generalize to unpack ‘do what I mean’ followed by 2)Make satisfying that your utility function.
Proving those two steps each rigorously would produce a proven-Friendly AI without an explicit utility function. Proving step 1 to be safe would obviously be very difficult; proving step 2 to be safe would probably be comparatively easy. Both, however, are plausibly rigorously provable.
For one thing, you’d have to explicitly come up with the utility function before you can prove the AI follows it.
You can either make an AI that will proveably do what you mean, or make one that will hopefully figure out what you meant when you said “do what I mean,” and do that.
When I picture what a proven-Friendly AI looks like, I think of something where it’s goals are 1)Using a sample of simulated humans, generalize to unpack ‘do what I mean’ followed by 2)Make satisfying that your utility function.
Proving those two steps each rigorously would produce a proven-Friendly AI without an explicit utility function. Proving step 1 to be safe would obviously be very difficult; proving step 2 to be safe would probably be comparatively easy. Both, however, are plausibly rigorously provable.
This is what I mean by an explicit utility function. An implicit one is where it never actually calculates utility, like how humans work.