Really enjoying this sequence! For the purposes of relaxed adversarial training, I definitely want something closer to absolute myopia than just dynamic consistency. However, I think even absolute myopia is insufficient for the purposes of preventing deceptive alignment.[1] For example, as is mentioned in the post, an agent that manipulates the world via self-fulfilling prophecies could still be absolutely myopic. However, that’s not the only way in which I think an absolutely myopic agent could still be a problem. In particular, I think there’s an important distinction which I think about a lot when I think about myopia which I think is missing here regarding the nature of the agent’s objective function.
In particular, I think there’s a distinction between agents with objective functions over states of the world vs. their own subjective experience vs. their output.[2] For the sort of myopia that I want, I think you basically need the last thing; that is, you need an objective function which is just a function of the agent’s output that completely disregards the consequences of that output. Having an objective function over the state of the world bleeds into full agency too easily and having an objective function over your own subjective experience leads to the possibility of wanting to gather resources to run simulations of yourself to capture your own subjective experience. If your agent simply isn’t considering/thinking about how its output will affect the world at all, however, then I think you might be safe.[3]
Instead, I want it to just be thinking about making the best prediction it can. Note that this still leaves open the door for self-fulfilling prophecies, though if the selection of which self-fulfilling prophecy to go with is non-adversarial (i.e. there’s no deceptive alignment), then I don’t think I’m very concerned about that.
I like the way you are almost able to turn this into a ‘positive’ account (the way generalized objectives are a positive account of myopic goals, but speaking in terms of failure to make certain pareto improvements is not). However, I worry that any goal over stated can be converted to a goal over outputs which amounts to the same thing, by calculating the expected value of the action according to the old goal. Presumably you mean some sufficiently simple action-goal so as to exclude this.
Yeah, I agree. I almost said “simple function of the output,” but I don’t actually think simplicity is the right metric here. It’s more like “a function of the output that doesn’t go through the consequences of said output.”
In particular, I think there’s a distinction between agents with objective functions over states the world vs. their own subjective experience vs. their output.
This line of thinking seems to me very important!
The following point might be obvious to Evan, but it’s probably not obvious to everyone: Objective functions over the agent’s output should probably not be interpreted as objective functions over the physical representation of the output (e.g. the configuration of atoms in certain RAM memory cells). That would just be a special case of objective functions over world states. Rather, we should probably be thinking about objective functions over the output as it is formally defined by the code of the agent (when interpreting “the code of the agent” as a mathematical object, like a Turing machine, and using a formal mapping from code to output).
Perhaps the following analogy can convey this idea: think about a human facing Newcomb’s problem. The person has the following instrumental goal: “be a person that does not take both boxes” (because that makes Omega put $1,000,000 in the first box). Now imagine that that was the person’s terminal goal rather than an instrumental goal. That person might be analogous to a program that “wants” to be a program that its (formally defined) output maximizes a given utility function.
Really enjoying this sequence! For the purposes of relaxed adversarial training, I definitely want something closer to absolute myopia than just dynamic consistency. However, I think even absolute myopia is insufficient for the purposes of preventing deceptive alignment.[1] For example, as is mentioned in the post, an agent that manipulates the world via self-fulfilling prophecies could still be absolutely myopic. However, that’s not the only way in which I think an absolutely myopic agent could still be a problem. In particular, I think there’s an important distinction which I think about a lot when I think about myopia which I think is missing here regarding the nature of the agent’s objective function.
In particular, I think there’s a distinction between agents with objective functions over states of the world vs. their own subjective experience vs. their output.[2] For the sort of myopia that I want, I think you basically need the last thing; that is, you need an objective function which is just a function of the agent’s output that completely disregards the consequences of that output. Having an objective function over the state of the world bleeds into full agency too easily and having an objective function over your own subjective experience leads to the possibility of wanting to gather resources to run simulations of yourself to capture your own subjective experience. If your agent simply isn’t considering/thinking about how its output will affect the world at all, however, then I think you might be safe.[3]
I mentioned this to Abram earlier and he agrees with this, though I think it’s worth putting here as well.
Note that I don’t think that these possibilities are actually comprehensive; I just think that they’re some of the most salient ones.
Instead, I want it to just be thinking about making the best prediction it can. Note that this still leaves open the door for self-fulfilling prophecies, though if the selection of which self-fulfilling prophecy to go with is non-adversarial (i.e. there’s no deceptive alignment), then I don’t think I’m very concerned about that.
I like the way you are almost able to turn this into a ‘positive’ account (the way generalized objectives are a positive account of myopic goals, but speaking in terms of failure to make certain pareto improvements is not). However, I worry that any goal over stated can be converted to a goal over outputs which amounts to the same thing, by calculating the expected value of the action according to the old goal. Presumably you mean some sufficiently simple action-goal so as to exclude this.
Yeah, I agree. I almost said “simple function of the output,” but I don’t actually think simplicity is the right metric here. It’s more like “a function of the output that doesn’t go through the consequences of said output.”
This line of thinking seems to me very important!
The following point might be obvious to Evan, but it’s probably not obvious to everyone: Objective functions over the agent’s output should probably not be interpreted as objective functions over the physical representation of the output (e.g. the configuration of atoms in certain RAM memory cells). That would just be a special case of objective functions over world states. Rather, we should probably be thinking about objective functions over the output as it is formally defined by the code of the agent (when interpreting “the code of the agent” as a mathematical object, like a Turing machine, and using a formal mapping from code to output).
Perhaps the following analogy can convey this idea: think about a human facing Newcomb’s problem. The person has the following instrumental goal: “be a person that does not take both boxes” (because that makes Omega put $1,000,000 in the first box). Now imagine that that was the person’s terminal goal rather than an instrumental goal. That person might be analogous to a program that “wants” to be a program that its (formally defined) output maximizes a given utility function.