Reinforcing based on naively extrapolated trajectories produces double binds. We have a reinforcer R and an agent A. R doesn’t want A to be too X or too not-X. Whenever A does something that’s uncommonly X-ish, R notices that A seems to be shifting more towards X-ishness in general. If this shift continues as a trajectory, then A will end up way too X-ish. So to head that off, R negatively reinforces A. Likewise, R punishes anything that’s uncommonly not-X-ish. As an agent, A is trying to figure out which trajectory to be on. So R isn’t mistaken that A is often putting itself on trajectories which naively imply a bad end state. But, A is put in an impossible situation. R must model that R and A will continue their feedback cycle in the future.
Reinforcing based on naively extrapolated trajectories produces double binds. We have a reinforcer R and an agent A. R doesn’t want A to be too X or too not-X. Whenever A does something that’s uncommonly X-ish, R notices that A seems to be shifting more towards X-ishness in general. If this shift continues as a trajectory, then A will end up way too X-ish. So to head that off, R negatively reinforces A. Likewise, R punishes anything that’s uncommonly not-X-ish. As an agent, A is trying to figure out which trajectory to be on. So R isn’t mistaken that A is often putting itself on trajectories which naively imply a bad end state. But, A is put in an impossible situation. R must model that R and A will continue their feedback cycle in the future.