I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn’t need to understand code or human language. It simply searches for plans that maximize diamond tools.
But assuming you could ask that question through a suitable interface the SI understood—and given some reasons to trust that giving the correct answers is instrumentally rational for the SI—then yes I agree that should work.
Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately).
My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn’t seem to me to answer the question about utility functions. It doesn’t explain how the code that’s found, actually encodes the idea of diamonds and does its thinking in a way that’s really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don’t, so we, unlike the superintelligence, can’t use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.)
Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction: 1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).
I think that framing is rather strange, because in the minecraft example the superintelligent diamond tool maximizer doesn’t need to understand code or human language. It simply searches for plans that maximize diamond tools.
But assuming you could ask that question through a suitable interface the SI understood—and given some reasons to trust that giving the correct answers is instrumentally rational for the SI—then yes I agree that should work.
Ok. So yeah, I agree that in the hypothetical, actually being able to ask that question to the SI is the hard part (as opposed, for example, to it being hard for the SI to answer accurately).
My framing is definitely different than yours. The statement, as I framed it, could be interesting, but it doesn’t seem to me to answer the question about utility functions. It doesn’t explain how the code that’s found, actually encodes the idea of diamonds and does its thinking in a way that’s really, thoroughly aimed at making there be diamonds. It does that somehow, and the superintelligence knows how it does that. Be we don’t, so we, unlike the superintelligence, can’t use that analysis to be justifiedly confident that the code will actually lead to diamonds. (We can be justifiedly confident of that by some other route, e.g. because we asked the SI.)
Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Yeah.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction:
1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).