Very little. I don’t like my odds. If Eliezer has provable friendliness theorems but not an AI, it’s in his and everyone’s interest to distribute the generalized theorem to everyone possible so that anyone working on recursive AGI has a chance to make it friendly, which means the algorithms will be checked by many, publicly. If Eliezer has the theorems and an AI ready to implement, there’s nothing I can do about it at all. So why worry?
You could Trust Eliezer and everyone who checked his theorems and then put money towards the supercomputer cluster which is testing every string of bits for ‘friendly AI’.
In fact, at that point that project should become humanity’s only notable priority.
You could Trust Eliezer and everyone who checked his theorems and then put money towards the supercomputer cluster which is testing every string of bits for ‘friendly AI’.
In fact, at that point that project should become humanity’s only notable priority.
Probably not. That project does not sound efficient (to the extent that it sounds unfeasible).
Does the expected return on “this has a one in 2^(10^10) chance of generating a proven friendly AI” lose to anything else in terms of expected result? If so, we should be doing that to the exclusion of FAI research already.
Does the expected return on “this has a one in 2^(10^10) chance of generating a proven friendly AI” lose to anything else in terms of expected result?
Yes. It loses to any sane strategy of finding an FAI that is not “testing every string of bits”.
Testing every string of bits is a nice ‘proof of concept’ translation of “can prove a string of bits is an FAI if it in fact is an FAI” to “actually have an FAI”. It just isn’t one that works in practice unless, say, you have a halting oracle in your pocket.
I write a program which recognizes AGI and halts when it finds one. It tests all programs less than 1 GB. If it halts, I narrow the search space. If it doesn’t, I expand the search space. Each search takes the time used by the halting oracle.
We already have a way to prove that a string of bits is FAI; it is trivial to generalize that to AI.
“I have a string of bits. I do not know if it is FAI or not, but I have a way of testing that is perfectly specific and very sensitive. The odds of an AI researcher writing a FAI are vanishingly small. Should I test it if it was written by an AI researcher? Should I test it if it isn’t?
We already have a way to prove that a string of bits is FAI; it is trivial to generalize that to AI.
Yes, if you have that proof and a halting oracle then you have AGI. That is vastly different to the claim “If I had a halting oracle I would have an AGI”.
That proof was a prerequisite for the idea to try the brute force method of finding the FAI.
Someone could also set an end condition that only an AGI is likely to ever reach and use iterated UTMs to find a program that reached the desired end condition. I’m not sure how to a priori figure out what someone smarter than me would do and test for it, but it isn’t the first thing I would do with a constant-time halting oracle. The first thing I would do is test a program that found additional implications of first order logic and halted when it found a contradiction. Then I would add the inverse of the Gödel statement and test that program.
I usually assume “provably friendly” means “will provably optimise for complex human-like values correctly” and thus includes both actual humanity-wide values and one person’s values (and the two options can plausibly be switched between at a late stage of the design process).
And, well, I meant for it to be a little funny, so I’ll take that as a win!
Friendly means something like “will optimize for the appropriate complex human-like values correctly.”
Saying “we don’t have clear criteria for appropriate human values” is just another way of saying that defining Friendly is hard.
Provably Friendly means we have a mathematical proof that an AI will be Friendly before we start running the AI.
An AI that gives its designer ultimate power over humanity is almost certainly not Friendly, even if it was Provably designer-godlike-powers implementing.
How do you define “appropriate”? It seems a little circular. Friendly AI is AI that optimises for appropriate values, and appropriate values are the ones for which we’d want a Friendly AI to optimise.
You might say that “appropriate” values are ones which “we” would like to see the future optimised towards, but I think whether these even exist humanity-wide is an open question (and I’m leaning towards “no”), in which case you should probably have a contingency definition for what to do if they, in fact, do not.
I would also be shocked if there were a “provable” definition of “appropriate” (as opposed to the friendliness of the program being provable with respect to some definition of “appropriate”).
I think you’re assigning too little weight to provably friendly but this is pretty funny
How much would you be willing to wager that you will be able to follow the proof of friendly for the specific AI which gets implemented?
Very little. I don’t like my odds. If Eliezer has provable friendliness theorems but not an AI, it’s in his and everyone’s interest to distribute the generalized theorem to everyone possible so that anyone working on recursive AGI has a chance to make it friendly, which means the algorithms will be checked by many, publicly. If Eliezer has the theorems and an AI ready to implement, there’s nothing I can do about it at all. So why worry?
You could Trust Eliezer and everyone who checked his theorems and then put money towards the supercomputer cluster which is testing every string of bits for ‘friendly AI’.
In fact, at that point that project should become humanity’s only notable priority.
Probably not. That project does not sound efficient (to the extent that it sounds unfeasible).
Does the expected return on “this has a one in 2^(10^10) chance of generating a proven friendly AI” lose to anything else in terms of expected result? If so, we should be doing that to the exclusion of FAI research already.
Yes. It loses to any sane strategy of finding an FAI that is not “testing every string of bits”.
Testing every string of bits is a nice ‘proof of concept’ translation of “can prove a string of bits is an FAI if it in fact is an FAI” to “actually have an FAI”. It just isn’t one that works in practice unless, say, you have a halting oracle in your pocket.
If I had a halting oracle, I would have AGI.
By your logic, testing an actual AI against the proofs of friendliness isn’t worth the expense, because there might be an inconclusive result.
No you wouldn’t.
No, my logic doesn’t claim that either.
I write a program which recognizes AGI and halts when it finds one. It tests all programs less than 1 GB. If it halts, I narrow the search space. If it doesn’t, I expand the search space. Each search takes the time used by the halting oracle.
We already have a way to prove that a string of bits is FAI; it is trivial to generalize that to AI.
“I have a string of bits. I do not know if it is FAI or not, but I have a way of testing that is perfectly specific and very sensitive. The odds of an AI researcher writing a FAI are vanishingly small. Should I test it if it was written by an AI researcher? Should I test it if it isn’t?
Yes, if you have that proof and a halting oracle then you have AGI. That is vastly different to the claim “If I had a halting oracle I would have an AGI”.
That proof was a prerequisite for the idea to try the brute force method of finding the FAI.
Someone could also set an end condition that only an AGI is likely to ever reach and use iterated UTMs to find a program that reached the desired end condition. I’m not sure how to a priori figure out what someone smarter than me would do and test for it, but it isn’t the first thing I would do with a constant-time halting oracle. The first thing I would do is test a program that found additional implications of first order logic and halted when it found a contradiction. Then I would add the inverse of the Gödel statement and test that program.
I usually assume “provably friendly” means “will provably optimise for complex human-like values correctly” and thus includes both actual humanity-wide values and one person’s values (and the two options can plausibly be switched between at a late stage of the design process).
And, well, I meant for it to be a little funny, so I’ll take that as a win!
Friendly means something like “will optimize for the appropriate complex human-like values correctly.”
Saying “we don’t have clear criteria for appropriate human values” is just another way of saying that defining Friendly is hard.
Provably Friendly means we have a mathematical proof that an AI will be Friendly before we start running the AI.
An AI that gives its designer ultimate power over humanity is almost certainly not Friendly, even if it was Provably designer-godlike-powers implementing.
How do you define “appropriate”? It seems a little circular. Friendly AI is AI that optimises for appropriate values, and appropriate values are the ones for which we’d want a Friendly AI to optimise.
You might say that “appropriate” values are ones which “we” would like to see the future optimised towards, but I think whether these even exist humanity-wide is an open question (and I’m leaning towards “no”), in which case you should probably have a contingency definition for what to do if they, in fact, do not.
I would also be shocked if there were a “provable” definition of “appropriate” (as opposed to the friendliness of the program being provable with respect to some definition of “appropriate”).