The term “RL agent” means an agent with architecture from a certain class, amenable to a specific kind of training. Since you are discussing RL agents in this post, I think it could be misleading to use human examples and analogies (“travelling across the world to do cocaine”) in it because humans are not RL agents, neither on the level of wetware biological architecture (i. e., neurons and synapses don’t represent a policy) nor on the abstract, cognitive level. On the cognitive level, even RL-by-construction agents of sufficient intelligence, trained in sufficiently complex and rich environments, will probably exhibit the dynamic of Active Inference agents, as I note below.
It’s not completely clear to me what you mean by “selection for agents” and “selection for reward”—RL training or evolutionary hyperparameter tweaking in the agent’s architecture which itself is guided by the agent’s score (i. e., the reward) within a larger process of “finding an agent that does the task the best”. The latter process can and probably will select for “reward optimizers”.
Another reason to not expect the selection argument to work is that it’s instrumentally convergent for most inner agent values to not become wireheaders, for them to not try hitting the reward button.
I think that before the agent can hit the particular attractor of reward-optimization, it will hit an attractor in which it optimizes for some aspect of a historical correlate of reward.
We train agents which intelligently optimize for e.g. putting trash away, and this reinforces trash-putting-away computations, which activate in a broad range of situations so as to steer agents into a future where trash has been put away. An intelligent agent will model the true fact that, if the agent reinforces itself into caring about antecedent-computation-reinforcement, then it will no longer navigate to futures where trash is put away. Therefore, it decides to not hit the reward button.
This reasoning follows for most inner goals by instrumental convergence.
On my current best model, this is why people usually don’t wirehead. They learn their own values via deep RL, like caring about dogs, and these actual values are opposed to the person they would become if they wirehead.
Don’t some people terminally care about reward?
I think so! I think that generally intelligent RL agents will have secondary, relatively weaker values around reward, but that reward will not be a primary motivator. Under my current (weakly held) model, an AI will only start thinking about reward after it has reinforced other kinds of computations (e.g. putting away trash). More on this in later essays.
I think that Active Inference is a simpler representation of the same ideas which doesn’t use the concepts of attractors, reward, reinforcement, antecedent computation, utility, and so on. Instead of explicitly representing utilities, Active Inference agents only have (stronger or weaker) beliefs about the world, including beliefs about themselves (“the kind of agent/creature/person I am”), and fulfil these beliefs through actions (self-evidencing). In humans, “rewarding” neurotransmitters regulate learning and belief updates.
The question which is really interesting to me is how inevitable it is that the Active Inference dynamic emerges as a result of training RL agents to certain levels of capability/intelligence.
Reward probably won’t be a deep RL agent’s primary optimization target
The longer I look at this statement (and its shorter version “Reward is not the optimization target”), the less I understand what it’s supposed to mean, considering that “optimisation” might refer to the agent’s training process as well as the “test” process (even if they overlap or coincide). It looks to me that your idea can be stated more concretely as “the more intelligent/capable RL agents (either model-based or model-free) become in the process of training using the currently conventional training algorithms, the less they will be susceptible to wireheading, rather than actively seek it”?
reward provides local updates to the agent’s cognition via credit assignment; reward is not best understood as specifying our preferences
The first part of this statement is about RL agents, the second is about humans. I think the second part doesn’t make a lot of sense. Humans should not be analysed as RL agents in the first place because they are not RL agents, as stated above.
Stop worrying about finding “outer objectives” which are safe to maximize.[9]I think that you’re not going to get an outer-objective-maximizer (i.e. an agent which maximizes the explicitly specified reward function).
Instead, focus on building good cognition within the agent.
In my ontology, there’s only an inner alignment problem: How do we grow good cognition inside of the trained agent?
Unfortunately, it’s far from obvious to me that Active Inference agents (which sufficiently intelligent RL agents will apparently become by default) are corrigible even in principle. As I noted in the post, such an agent can discover the Free Energy Principle (or read about it in the literature), form a belief that it is an Active Inference agent, and then disregard anything that humans will try to impose on it because it will contradict the belief that it is an Active Inference agent.
The term “RL agent” means an agent with architecture from a certain class, amenable to a specific kind of training. Since you are discussing RL agents in this post, I think it could be misleading to use human examples and analogies (“travelling across the world to do cocaine”) in it because humans are not RL agents, neither on the level of wetware biological architecture (i. e., neurons and synapses don’t represent a policy) nor on the abstract, cognitive level. On the cognitive level, even RL-by-construction agents of sufficient intelligence, trained in sufficiently complex and rich environments, will probably exhibit the dynamic of Active Inference agents, as I note below.
It’s not completely clear to me what you mean by “selection for agents” and “selection for reward”—RL training or evolutionary hyperparameter tweaking in the agent’s architecture which itself is guided by the agent’s score (i. e., the reward) within a larger process of “finding an agent that does the task the best”. The latter process can and probably will select for “reward optimizers”.
I think that Active Inference is a simpler representation of the same ideas which doesn’t use the concepts of attractors, reward, reinforcement, antecedent computation, utility, and so on. Instead of explicitly representing utilities, Active Inference agents only have (stronger or weaker) beliefs about the world, including beliefs about themselves (“the kind of agent/creature/person I am”), and fulfil these beliefs through actions (self-evidencing). In humans, “rewarding” neurotransmitters regulate learning and belief updates.
The question which is really interesting to me is how inevitable it is that the Active Inference dynamic emerges as a result of training RL agents to certain levels of capability/intelligence.
The longer I look at this statement (and its shorter version “Reward is not the optimization target”), the less I understand what it’s supposed to mean, considering that “optimisation” might refer to the agent’s training process as well as the “test” process (even if they overlap or coincide). It looks to me that your idea can be stated more concretely as “the more intelligent/capable RL agents (either model-based or model-free) become in the process of training using the currently conventional training algorithms, the less they will be susceptible to wireheading, rather than actively seek it”?
The first part of this statement is about RL agents, the second is about humans. I think the second part doesn’t make a lot of sense. Humans should not be analysed as RL agents in the first place because they are not RL agents, as stated above.
Unfortunately, it’s far from obvious to me that Active Inference agents (which sufficiently intelligent RL agents will apparently become by default) are corrigible even in principle. As I noted in the post, such an agent can discover the Free Energy Principle (or read about it in the literature), form a belief that it is an Active Inference agent, and then disregard anything that humans will try to impose on it because it will contradict the belief that it is an Active Inference agent.