A model which is just predicting the next word isn’t optimizing for strategies which look good to a human reviewer, it’s optimizing for truth itself (as contained in it’s training data). If you begin re-feeding its outputs as training inputs then there could be a feedback loop leading to such incentives, but if the model is general and sufficient intelligent, you don’t need to do that. You can train it in a different domain and it will generalize to your domain of interest.
Even if you that, you can try to make the new data grounded in reality in some way, like including experiment results. And the model won’t just absorb the new data as truth, it will include it in it’s world model to make better predictions. If it’s fed a bunch of new alignment forum posts that are bad ideas which look good to humans, it will just predict that alignment forum produces that kind of post, but that doesn’t mean there isn’t some prompt that can make it output what it actually thinks is correct.
A model which is just predicting the next word isn’t optimizing for strategies which look good to a human reviewer, it’s optimizing for truth itself (as contained in it’s training data). If you begin re-feeding its outputs as training inputs then there could be a feedback loop leading to such incentives, but if the model is general and sufficient intelligent, you don’t need to do that. You can train it in a different domain and it will generalize to your domain of interest.
Even if you that, you can try to make the new data grounded in reality in some way, like including experiment results. And the model won’t just absorb the new data as truth, it will include it in it’s world model to make better predictions. If it’s fed a bunch of new alignment forum posts that are bad ideas which look good to humans, it will just predict that alignment forum produces that kind of post, but that doesn’t mean there isn’t some prompt that can make it output what it actually thinks is correct.