Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction: 1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).
Sure, but at that point you have substituted trust in the code representing the idea of diamonds for trust in a SI aligned to give you the correct code.
Yeah.
Maybe a more central thing to how our views are differing, is that I don’t view training signals as identical to utility functions. They’re obviously somehow related, but they have different roles in systems. So to me changing the training signal obviously will affect the trained system’s goals in some way, but it won’t be identical to the operation of writing some objective to an agent’s utility function, and the non-identicality will become very relevant for a very intelligent system.
Another thing to say, if you like the outer / inner alignment distinction:
1. Yes, if you have an agent that’s competent to predict some feature X of the world “sufficiently well”, and you’re able to extract the agent’s prediction, then you’ve made a lot of progress towards outer alignment for X; but
2. unfortunately your predictor agent is probably dangerous, if it’s able to predict X even when asking about what happens when very intelligent systems are acting, and
3. there’s still the problem of inner alignment (and in particular we haven’t clarified utility functions—the way in which the trained system chooses its thinking and its actions to be useful to achieve its goal—which we wouldn’t need if we had the predictor-agent, but that agent is unsafe).