If the tool is not sufficiently reflective to recommend improvements to itself, it will never become a worthy substituted for FAI. This case is not interesting.
If the tool is sufficiently reflective to recommend improvements to itself, it will recommend that it be modified to just implement its proposed policies instead of printing them. So we would not actually implement that policy. But what then makes it recommend a policy that we will actually want to implement? What tweak to the program should we apply in that situation?
But what then makes it recommend a policy that we will actually want to implement?
First of all, I’m assuming that we’re taking as axiomatic that the tool “wants” to improve itself (or else why would it have even bothered to consider recommending that it be modified to improve itself?); i.e. improving itself is favorable according to its utility function.
Then: It will recommend a policy that we will actually want to implement, because its model of the universe includes our minds and it can see that if it recommends a policy we will actually want to implement leads it to a higher ranked state in its utility function.
If the tool is sufficiently reflective to recommend improvements to itself, it will recommend that it be modified to just implement its proposed policies instead of printing them.
Perhaps. I noticed a related problem: someone will want to create a self-modifying AI. Let’s say we ask the Oracle AI about this plan. At present (as I understand it) we have no mathematical way to predict the effects of self-modification. (Hence Eliezer’s desire for a new decision theory that can do this.) So how did we give our non-self-modifying Oracle that ability? Wouldn’t we need to know the math of getting the right answer in order to write a program that gets the right answer? And if it can’t answer the question:
What will it even do at that point?
If it happens to fail safely, will humans as we know them interpret this non-answer to mean we should delay our plan for self-modifying AI?
If the tool is not sufficiently reflective to recommend improvements to itself, it will never become a worthy substituted for FAI. This case is not interesting.
If the tool is sufficiently reflective to recommend improvements to itself, it will recommend that it be modified to just implement its proposed policies instead of printing them. So we would not actually implement that policy. But what then makes it recommend a policy that we will actually want to implement? What tweak to the program should we apply in that situation?
First of all, I’m assuming that we’re taking as axiomatic that the tool “wants” to improve itself (or else why would it have even bothered to consider recommending that it be modified to improve itself?); i.e. improving itself is favorable according to its utility function.
Then: It will recommend a policy that we will actually want to implement, because its model of the universe includes our minds and it can see that if it recommends a policy we will actually want to implement leads it to a higher ranked state in its utility function.
Perhaps. I noticed a related problem: someone will want to create a self-modifying AI. Let’s say we ask the Oracle AI about this plan. At present (as I understand it) we have no mathematical way to predict the effects of self-modification. (Hence Eliezer’s desire for a new decision theory that can do this.) So how did we give our non-self-modifying Oracle that ability? Wouldn’t we need to know the math of getting the right answer in order to write a program that gets the right answer? And if it can’t answer the question:
What will it even do at that point?
If it happens to fail safely, will humans as we know them interpret this non-answer to mean we should delay our plan for self-modifying AI?