Steven Byrnes comments on Framing approaches to alignment and the hard problem of AI cognition

Steven Byrnes 16 Dec 2021 14:06 UTC
4 points
Hmm, I would say that DQN “chooses actions based on their anticipated consequences” in that the Q-function incorporates an estimate of anticipated consequences. (Especially with a low discount rate.)
I’m happy to say that model-based RL might be generically better at anticipating consequences (especially in novel circumstances) than model-free RL. Neither is perfect though.
DQN has an implicit plan encoded in the Q-function—i.e., in state S1 action A1 seems good, and that brings us to state S2 where action A2 seems good, etc. … all that stuff together is (IMO) an implicit plan, and such a plan can involve short-term sacrifices for longer-term benefit.
Whereas model-based RL with tree search (for example) has an explicit plan: at timestep T, it has an explicit representation of what it’s planning to do at timesteps T+1, T+2, ….
Humans are able to make explicit plans too, although it doesn’t look like one-timestep-at-a-time.
- jacob_cannell 16 Dec 2021 18:41 UTC
  2 points
  Parent
  Sure you can consider the TD style unrolling in model-free a sort of implicit planning, but it’s not really consequentialist in most situations as it can’t dynamically explore new relevant expansions of the state tree the way planning can. Or you could consider planning as a dynamic few-shot extension to fast learning/updating the decision function.
  
  Human planning is sometimes explicit timestep by timestep (when playing certain board games for example) when that is what efficient planning demands, but in the more general case human planning uses more complex approximations that more freely jump across spatio-temporal approximation hierarchies.