Yes, I am proposing something that is not a standard part of ML training.
Gradient descent will move you around less if you can navigate to parts of the environment that give you low loss. This setup is somehow between RL and unsupervised learning in the sense that it has state but you are using autoregressive loss. It is similar to conditional pre-training, but instead of prepending a reward, you are prepending a summary that the LM generated itself.
The gradient would indeed be flowing indirectly here, and that actions would make the input more predictable is an empirical prediction that A) I could be wrong about and B) is not a crux for this method and C) is not a crux for this article, unless the reader thinks that there is no way to train an AI in a human like way and needs and existence proof.
Do LLM’s learn to break their sensors?
Yes, I am proposing something that is not a standard part of ML training.
Gradient descent will move you around less if you can navigate to parts of the environment that give you low loss. This setup is somehow between RL and unsupervised learning in the sense that it has state but you are using autoregressive loss. It is similar to conditional pre-training, but instead of prepending a reward, you are prepending a summary that the LM generated itself.
The gradient would indeed be flowing indirectly here, and that actions would make the input more predictable is an empirical prediction that A) I could be wrong about and B) is not a crux for this method and C) is not a crux for this article, unless the reader thinks that there is no way to train an AI in a human like way and needs and existence proof.