As an example, here are three possible reactions to a no-ghost update:
Suppose that many (EDIT: a few) of your value shards take as input the ghost latent variable in your world model. You learn ghosts aren’t real. Let’s say this basically sets the ghost-related latent variable value to false in all shard-relevant contexts. Then it seems perfectly fine that most of my shards keep on bidding away and determining my actions (e.g. protect my family), since most of my value shards are not in fact functions of the ghost latent variable. While it’s indeed possible to contrive minds where most of their values are functions of a variable in the world model which will get removed by the learning process, it doesn’t seem particularly concerning to me. (But I’m also probably not trying to tackle the problems in this post, or the superproblems which spawned them.)
There’s a small element of inner alignment to this, as well. Although an RL agent such as AIXI will want to wirehead if it forms an “accurate” model of how it gets reward, we can also see this as only one model consistent with the data, another being that reward is actually coming from task achievement (IE, the AI could internalize the intended values). Although this model will usually have at least slightly worse predictive accuracy, we can counterbalance that with process-level feedback which tells the system that’s a better way of thinking about it.
This doesn’t seem relevant for non-AIXI RL agents which don’t end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?
This doesn’t seem relevant for non-AIXI RL agents which don’t end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.
Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can’t distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate.
In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won’t, unless of course they’re just not very good at their jobs.
So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all?
Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)
Suppose that
many(EDIT: a few) of your value shards take as input theghost
latent variable in your world model. You learn ghosts aren’t real. Let’s say this basically sets the ghost-related latent variable value tofalse
in all shard-relevant contexts. Then it seems perfectly fine that most of my shards keep on bidding away and determining my actions (e.g. protect my family), since most of my value shards are not in fact functions of theghost
latent variable. While it’s indeed possible to contrive minds where most of their values are functions of a variable in the world model which will get removed by the learning process, it doesn’t seem particularly concerning to me. (But I’m also probably not trying to tackle the problems in this post, or the superproblems which spawned them.)This doesn’t seem relevant for non-AIXI RL agents which don’t end up caring about reward or explicitly weighing hypotheses over reward as part of the motivational structure? Did you intend it to be?
With almost any kind of feedback process (IE: any concrete proposals that I know of), similar concerns arise. As I argue here, wireheading is one example of a very general failure mode. The failure mode is roughly: the process actually generating feedback is, too literally, identified with the truth/value which that feedback is trying to teach.
Output-based evaluation (including supervised learning, and the most popular forms of unsupervised learning, and a lot of other stuff which treats models as black boxes implementing some input/output behavior or probability distribution or similar) can’t distinguish between a model which is internalizing the desired concepts, vs a model which is instead modeling the actual feedback process instead. These two do different things, but not in a way that the feedback system can differentiate.
In terms of shard theory, as I understand it, the point is that (absent arguments to the contrary, which is what we want to be able to construct), shards that implement feedback-modeling like this cannot be disincentivized by the feedback process, since they perform very well in those terms. Shards which do other things may or may not be disincentivized, but the feedback-modeling shards (if any are formed at any point) definitely won’t, unless of course they’re just not very good at their jobs.
So the problem, then, is: how do we arrange training such that those shards have very little influence, in the end? How do we disincentivize that kind of reasoning at all?
Plausibly, this should only be tackled as a knock-on effect of the real problem, actually giving good feedback which points in the right direction; however, it remains a powerful counterexample class which challenges many many proposals. (And therefore, trying to generate the analogue of the wireheading problem for a given proposal seems like a good sanity check.)