Expecting wrapper-minds as an appropriate notion of human value is the result of following selection theoremreasoning. Consistent decision making seems to imply wrapper-minds, and furthermore there is a convergent drive towards their formation as mesa-optimizers under optimization pressure. It is therefore expected that AGIs become wrapper-minds in short order (or at least eventually) even if they are not immediately designed this way.
As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.
Why do you think consistent decision-making implies wrapper-minds?
What is “optimization pressure”, and where is it coming from? What is optimizing the policy networks? SGD? Are the policy networks supposed to be optimizing themselves to become wrapper minds?
wrapper-minds are best at achieving goals, including humanity’s goals
Seems doubtful to me, insofar as we imagine wrapper-minds to be grader-optimizers which globally optimize the output of some utility function over all states/universe-histories/whatever, or EU function over all plans.
As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.
There are two wrapper-mind conclusions, and the purpose of my comment was to frame the distinction between them. The post seems to be conflating them in the context of AI risk, mostly talking about one of them while alluding to AI risk relevance that seems to instead mostly concern the other. I cited standard reasons for taking either of them seriously, in the forms that make conflating them easy. That doesn’t mean I accept relevance of those reasons.
You can take a look at this comment for something about my own position on human values, which doesn’t seem relevant to this post or my comments here. Specifically, I agree that human values don’t have wrapper-mind character, as should be expressed in people or likely to get expressed in sufficiently human-like AGIs, but I expect that it’s a good idea for humans or those AGIs to eventually build wrapper-minds to manage the universe (and this point seems much more relevant to AI risk). I’ve maintained this distinction for a while.
As I currently understand your points, they seem like not much evidence at all towards the wrapper-mind conclusion.
Why are wrapper-minds an “appropriate notion” of human values, when AFAICT they seem diametrically opposite on many axes (e.g. time-varying, context-dependent)?
Why do you think consistent decision-making implies wrapper-minds?
What is “optimization pressure”, and where is it coming from? What is optimizing the policy networks? SGD? Are the policy networks supposed to be optimizing themselves to become wrapper minds?
Why should we expect unitary mesa-optimizers with globally activated goals, when AFAICT we have never observed this, nor seen relatively large amounts of evidence for it? (I’d be excited to pin down a bet with you about policy internals and generalization.)
Seems doubtful to me, insofar as we imagine wrapper-minds to be grader-optimizers which globally optimize the output of some utility function over all states/universe-histories/whatever, or EU function over all plans.
There are two wrapper-mind conclusions, and the purpose of my comment was to frame the distinction between them. The post seems to be conflating them in the context of AI risk, mostly talking about one of them while alluding to AI risk relevance that seems to instead mostly concern the other. I cited standard reasons for taking either of them seriously, in the forms that make conflating them easy. That doesn’t mean I accept relevance of those reasons.
You can take a look at this comment for something about my own position on human values, which doesn’t seem relevant to this post or my comments here. Specifically, I agree that human values don’t have wrapper-mind character, as should be expressed in people or likely to get expressed in sufficiently human-like AGIs, but I expect that it’s a good idea for humans or those AGIs to eventually build wrapper-minds to manage the universe (and this point seems much more relevant to AI risk). I’ve maintained this distinction for a while.