Lukas Finnveden comments on The Dualist Predict-O-Matic ($100 prize)

Lukas Finnveden 22 Oct 2019 9:26 UTC
LW: 1 AF: 1
AF
Yes, that sounds more like reinforcement learning. It is not the design I’m trying to point at in this post.
Ok, cool, that explains it. I guess the main differences between RL and online supervised learning is whether the model takes actions that can affect their environment or only makes predictions of fixed data; so it seems plausible that someone training the Predict-O-Matic like that would think they’re doing supervised learning, while they’re actually closer to RL.
That description sounds a lot like SGD. I think you’ll need to be crisper for me to see what you’re getting at.
No need, since we already found the point of disagreement. (But if you’re curious, the difference is that sgd makes a change in the direction of the gradient, and this one wouldn’t.)
- John_Maxwell 22 Oct 2019 23:33 UTC
  LW: 2 AF: 1
  AF Parent
  
  it seems plausible that someone training the Predict-O-Matic like that would think they’re doing supervised learning, while they’re actually closer to RL.
  
  How’s that?
  - Lukas Finnveden 23 Oct 2019 9:45 UTC
    LW: 1 AF: 1
    AF Parent
    Assuming that people don’t think about the fact that Predict-O-Matic’s predictions can affect reality (which seems like it might have been true early on in the story, although it’s admittedly unlikely to be true for too long in the real world), they might decide to train it by letting it make predictions about the future (defining and backpropagating the loss once the future comes about). They might think that this is just like training on predefined data, but now the Predict-O-Matic can change the data that it’s evaluated against, so there might be any number of ‘correct’ answers (rather than exactly 1). Although it’s a blurry line, I’d say this makes it’s output more action-like and less prediction-like, so you could say that it makes the training process a bit more RL-like.
    - John_Maxwell 23 Oct 2019 23:05 UTC
      LW: 2 AF: 1
      AF Parent
      I think it depends on internal details of the Predict-O-Matic’s prediction process. If it’s still using SGD, SGD is not going to play the future forward to see the new feedback mechanism you’ve described and incorporate it into the loss function which is being minimized. However, it’s conceivable that given a dataset about its own past predictions and how they turned out, the Predict-O-Matic might learn to make its predictions “more self-fulfilling” in order to minimize loss on that dataset?
      - Lukas Finnveden 24 Oct 2019 10:44 UTC
        LW: 1 AF: 1
        AF Parent
        
        SGD is not going to play the future forward to see the new feedback mechanism you’ve described and incorporate it into the loss function which is being minimized
        
        My ‘new feedback mechanism’ is part of the training procedure. It’s not going to be good at that by ‘playing the future forward’, it’s going to become good at that by being trained on it.
        
        I suspect we’re using SGD in different ways, because everything we’ve talked about seems like it could be implemented with SGD. Do you agree that letting the Predict-O-Matic predict the future and rewarding it for being right, RL-style, would lead to it finding fixed points? Because you can definitely use SGD to do RL (first google result).
        
        John_Maxwell 26 Oct 2019 15:57 UTC
        LW: 2 AF: 1
        AF Parent
        
        I suspect we’re using SGD in different ways, because everything we’ve talked about seems like it could be implemented with SGD. Do you agree that letting the Predict-O-Matic predict the future and rewarding it for being right, RL-style, would lead to it finding fixed points? Because you can definitely use SGD to do RL (first google result).
        
        Fair enough, I was thinking about supervised learning.