But what then makes it recommend a policy that we will actually want to implement?
First of all, I’m assuming that we’re taking as axiomatic that the tool “wants” to improve itself (or else why would it have even bothered to consider recommending that it be modified to improve itself?); i.e. improving itself is favorable according to its utility function.
Then: It will recommend a policy that we will actually want to implement, because its model of the universe includes our minds and it can see that if it recommends a policy we will actually want to implement leads it to a higher ranked state in its utility function.
First of all, I’m assuming that we’re taking as axiomatic that the tool “wants” to improve itself (or else why would it have even bothered to consider recommending that it be modified to improve itself?); i.e. improving itself is favorable according to its utility function.
Then: It will recommend a policy that we will actually want to implement, because its model of the universe includes our minds and it can see that if it recommends a policy we will actually want to implement leads it to a higher ranked state in its utility function.