The two examples given (law school value drift, and heroin) seem to distinguish the internal-game-theory hypothesis from the reward-as-inner-goal hypothesis in humans, but they don’t seem to rule out other equally plausible models, e.g.
Several discrete shards arise by incremental reinforcement; individually they are not optimizers, but acting in conjunction, they are an optimizer. Negotiation between them mostly doesn’t happen because the individual shards are incapable of thinking strategically. The thought-pattern where the shards are activated in conjunction is reinforced later. The goal of the resulting bounded optimizer is thousands of random proxies for reward from the constituent shards.
e.g. one shard contains a bit of world-model because it controls crawling around your room as a baby; another does search because it’s in charge of searching for the right word when you first learn to speak. Or something, I’m not a neuroscientist.
The same, but shards that make up the optimizer are not discrete; the architecture of the brain has some smooth gradient path towards forming a bounded optimizer.
Shards are incapable of aggregating into an optimizer on their own, but some coherence property is also reinforced (predictive coding or something) that produces agenty behavior without any negotiation between shards.
...
In each of these cases, the agent will avoid going to law school or using heroin just because it is instrumentally convergent to preserve its utility function, regardless of whether it is a coalition between shards or some other construction. Also, the other hypotheses seem just as plausible as the internal game theory hypothesis from the shard theory posts and docs I’ve read (e.g. Reward is not the optimization target), so I’m not sure why the internal game theory thing is considered the shard theory position.
Also, it doesn’t make sense to say “if shard theory is true”, if shard theory is a research program. It would be clearer if shard theory hypotheses were clearly stated, and all such statements were rephrased as “if hypothesis X is true”. I expect that shard theory leads to many confirmed and many disconfirmed predictions.
Thoughts on the internal game theory section:
The two examples given (law school value drift, and heroin) seem to distinguish the internal-game-theory hypothesis from the reward-as-inner-goal hypothesis in humans, but they don’t seem to rule out other equally plausible models, e.g.
Several discrete shards arise by incremental reinforcement; individually they are not optimizers, but acting in conjunction, they are an optimizer. Negotiation between them mostly doesn’t happen because the individual shards are incapable of thinking strategically. The thought-pattern where the shards are activated in conjunction is reinforced later. The goal of the resulting bounded optimizer is thousands of random proxies for reward from the constituent shards.
e.g. one shard contains a bit of world-model because it controls crawling around your room as a baby; another does search because it’s in charge of searching for the right word when you first learn to speak. Or something, I’m not a neuroscientist.
The same, but shards that make up the optimizer are not discrete; the architecture of the brain has some smooth gradient path towards forming a bounded optimizer.
Shards are incapable of aggregating into an optimizer on their own, but some coherence property is also reinforced (predictive coding or something) that produces agenty behavior without any negotiation between shards.
...
In each of these cases, the agent will avoid going to law school or using heroin just because it is instrumentally convergent to preserve its utility function, regardless of whether it is a coalition between shards or some other construction. Also, the other hypotheses seem just as plausible as the internal game theory hypothesis from the shard theory posts and docs I’ve read (e.g. Reward is not the optimization target), so I’m not sure why the internal game theory thing is considered the shard theory position.
Also, it doesn’t make sense to say “if shard theory is true”, if shard theory is a research program. It would be clearer if shard theory hypotheses were clearly stated, and all such statements were rephrased as “if hypothesis X is true”. I expect that shard theory leads to many confirmed and many disconfirmed predictions.
(That was sloppy phrasing on my part, yeah. “If shard theory is the right basic outlook on RL agents...” would be better.)