You admit that friendliness is not guaranteed. That means that you’re not wrong, which is a good sign, but it doesn’t fix the problem that friendliness isn’t guaranteed. You have as many tries as you want for intelligence, but only one for friendliness. How do you expect to manage it in the first try?
It also doesn’t seem to be clear to me that this is the best strategy. In order to get that provably friendly thing to work, you have to deal with an explicit, unchanging utility function, which means that friendliness has to be right from the beginning. If you deal with an implicit utility function that will change as the AI comes to understand itself better, you could program an AI to recognise pictures of smiles, then let it learn that the smiles correspond to happy humans and update its utility function accordingly, until it (hopefully) decides on “do what we mean”.
It seems to me that part of the friendliness proof would require proving that the AI will follow its explicit utility function. This would be impossible. The AI is not capable of perfect solomonoff induction, and will alway have some bias, no matter how small. This means that its implicit utility function will never quite match its explicit utility function. Am I missing something here?
In order to get that provably friendly thing to work
Again, I think “provably friendly thing” mischaracterizes what MIRI thinks will be possible.
I’m not sure exactly what you’re saying in the rest of your comment. Have you read the section on indirect normativity in Superintelligence? I’d start there.
Given the apparent misconceptions about MIRI’s work even among LWers, it seems like you need to write a Main post clarifying what MIRI does and does not claim, and does and does not work on.
Again, I think “provably friendly thing” mischaracterizes what MIRI thinks will be possible.
From what I can gather, there’s still supposed to be some kind of proof, even if it’s just the mathematical kind where you’re not really certain because there might be an error in it. The intent is to have some sort of program that maximizes utility function U, and then explicitly write the utility function as something along the lines of “do what I mean”.
Have you read the section on indirect normativity in Superintelligence? I’d start there.
I’m not sure what you’re referring to. Can you give me a link?
For one thing, you’d have to explicitly come up with the utility function before you can prove the AI follows it.
You can either make an AI that will proveably do what you mean, or make one that will hopefully figure out what you meant when you said “do what I mean,” and do that.
When I picture what a proven-Friendly AI looks like, I think of something where it’s goals are 1)Using a sample of simulated humans, generalize to unpack ‘do what I mean’ followed by 2)Make satisfying that your utility function.
Proving those two steps each rigorously would produce a proven-Friendly AI without an explicit utility function. Proving step 1 to be safe would obviously be very difficult; proving step 2 to be safe would probably be comparatively easy. Both, however, are plausibly rigorously provable.
Those points were excellent, and it is no credit to LW that the comment was on negative karma when I encountered it.
No, the approach based on proveable correctness isn’t a 100% guarantee, and, since it involves an unupdateable UF, and has the additional disadvantage that if you don’t get the UF right first time, you can’t tweak it.
The alternative family of approaches, based on flexibility, training and acculturation have often been put forward by MIRIs critics....and MIRI has never been quantiified why the one approach is better than the other.
You admit that friendliness is not guaranteed. That means that you’re not wrong, which is a good sign, but it doesn’t fix the problem that friendliness isn’t guaranteed. You have as many tries as you want for intelligence, but only one for friendliness. How do you expect to manage it in the first try?
It also doesn’t seem to be clear to me that this is the best strategy. In order to get that provably friendly thing to work, you have to deal with an explicit, unchanging utility function, which means that friendliness has to be right from the beginning. If you deal with an implicit utility function that will change as the AI comes to understand itself better, you could program an AI to recognise pictures of smiles, then let it learn that the smiles correspond to happy humans and update its utility function accordingly, until it (hopefully) decides on “do what we mean”.
It seems to me that part of the friendliness proof would require proving that the AI will follow its explicit utility function. This would be impossible. The AI is not capable of perfect solomonoff induction, and will alway have some bias, no matter how small. This means that its implicit utility function will never quite match its explicit utility function. Am I missing something here?
Typo?
Again, I think “provably friendly thing” mischaracterizes what MIRI thinks will be possible.
I’m not sure exactly what you’re saying in the rest of your comment. Have you read the section on indirect normativity in Superintelligence? I’d start there.
Given the apparent misconceptions about MIRI’s work even among LWers, it seems like you need to write a Main post clarifying what MIRI does and does not claim, and does and does not work on.
Fixed.
From what I can gather, there’s still supposed to be some kind of proof, even if it’s just the mathematical kind where you’re not really certain because there might be an error in it. The intent is to have some sort of program that maximizes utility function U, and then explicitly write the utility function as something along the lines of “do what I mean”.
I’m not sure what you’re referring to. Can you give me a link?
Superintelligence is a recent book by Nick Bostrom
I think this is incorrect. If it isn’t, it at least requires some proof.
For one thing, you’d have to explicitly come up with the utility function before you can prove the AI follows it.
You can either make an AI that will proveably do what you mean, or make one that will hopefully figure out what you meant when you said “do what I mean,” and do that.
When I picture what a proven-Friendly AI looks like, I think of something where it’s goals are 1)Using a sample of simulated humans, generalize to unpack ‘do what I mean’ followed by 2)Make satisfying that your utility function.
Proving those two steps each rigorously would produce a proven-Friendly AI without an explicit utility function. Proving step 1 to be safe would obviously be very difficult; proving step 2 to be safe would probably be comparatively easy. Both, however, are plausibly rigorously provable.
This is what I mean by an explicit utility function. An implicit one is where it never actually calculates utility, like how humans work.
Those points were excellent, and it is no credit to LW that the comment was on negative karma when I encountered it.
No, the approach based on proveable correctness isn’t a 100% guarantee, and, since it involves an unupdateable UF, and has the additional disadvantage that if you don’t get the UF right first time, you can’t tweak it.
The alternative family of approaches, based on flexibility, training and acculturation have often been put forward by MIRIs critics....and MIRI has never been quantiified why the one approach is better than the other.