I think the thing you call a “wrapper mind” is actually quite difficult to build in practice. I have a draft document providing some intuition as to why this is the case. The core idea is that there’s no “ground truth” regarding how to draw a learning system’s agency boundaries. Specialization within the learning system tends to produce a sort of “fuzzy haze” of possible internal subagents with varying values. Systems with learned values will reliably face issues with those values balancing / conflicting against each other and changing over time.
I have a hard time understanding how this could be possible. If I have an oracle, and then I write a computer program like:
while(true) {
var cb = ask_oracle("retrieve the syscall that will produce the most paperclips in conjuction with what further iterations of this loop will run");
cb();
}
Is that not sufficient? Am I not allowed to use ask_oracle as an abstraction for consulting a machine intelligence? Why not?
I agree that, if you have an oracle that already knows how to pursue any arbitrary specified goals, then it’s easy to make a competent wrapper agent. However, I don’t think it’s that easy to factor out values from the “search over effective plans” part of cognition. If you train a system to be competent at pursuing many different goals, then what I think you have, by default, is a system that actuallyhas many different goals. It’s not trivial to then completely replace those goals with your intended single goal.
If you train a system to be competent at pursuing a single goal, what I think you end up with (assuming a complex environment) is a system that’s inner misaligned wrt that single goal, and instead pursues a broader distribution over shallow, environment-specific proxies for that single goal. See: evolution versus the human reward system / values.
I think the thing you call a “wrapper mind” is actually quite difficult to build in practice. I have a draft document providing some intuition as to why this is the case. The core idea is that there’s no “ground truth” regarding how to draw a learning system’s agency boundaries. Specialization within the learning system tends to produce a sort of “fuzzy haze” of possible internal subagents with varying values. Systems with learned values will reliably face issues with those values balancing / conflicting against each other and changing over time.
I have a hard time understanding how this could be possible. If I have an oracle, and then I write a computer program like:
Is that not sufficient? Am I not allowed to use ask_oracle as an abstraction for consulting a machine intelligence? Why not?
I agree that, if you have an oracle that already knows how to pursue any arbitrary specified goals, then it’s easy to make a competent wrapper agent. However, I don’t think it’s that easy to factor out values from the “search over effective plans” part of cognition. If you train a system to be competent at pursuing many different goals, then what I think you have, by default, is a system that actually has many different goals. It’s not trivial to then completely replace those goals with your intended single goal.
If you train a system to be competent at pursuing a single goal, what I think you end up with (assuming a complex environment) is a system that’s inner misaligned wrt that single goal, and instead pursues a broader distribution over shallow, environment-specific proxies for that single goal. See: evolution versus the human reward system / values.