It’s not clear, because we don’t know what the solution might look like...
But there are certainly ways to improve the odds. For example, one could pretrain on heavily curated data (no atrocities, no betrayals, etc, etc). Additionally, one can use curricula like we teach children, starting with “age-appropriate” texts first.
Then if we succeed in interpretability, we might be able to monitor and adjust what’s going on.
Here the remark of “alignment being fundamental” might come into play: we might figure out ways to replace Transformers with an architecture which is much easier to interpret.
All these are likely to be positive things, although without truly knowing a solution it’s difficult to be sure...
From Janus’ two comments there I am getting an impression of a non-RLHF’d system which is, nevertheless, tends to be much stronger than usual in its convictions (or, the virtual characters it creates tend to be stronger than usual in their convictions about the nature of their current reality). There might be multiple reasons for that, but some degree of data curation might be one of them.
Is there any way to do so given our current paradigm of pretraining and fine-tuning foundation models?
It’s not clear, because we don’t know what the solution might look like...
But there are certainly ways to improve the odds. For example, one could pretrain on heavily curated data (no atrocities, no betrayals, etc, etc). Additionally, one can use curricula like we teach children, starting with “age-appropriate” texts first.
Then if we succeed in interpretability, we might be able to monitor and adjust what’s going on.
Here the remark of “alignment being fundamental” might come into play: we might figure out ways to replace Transformers with an architecture which is much easier to interpret.
All these are likely to be positive things, although without truly knowing a solution it’s difficult to be sure...
Pretraining on curated data seems like a simple idea. Are there any papers exploring this?
I’ve reviewed someone’s draft which suggests this for AI safety (I hope it will be made public soon).
But I’ve heard rumors that people are trying this… And even from what Janus is saying in the comments/answers to my question https://www.lesswrong.com/posts/tbJdxJMAiehewGpq2/impressions-from-base-gpt-4, I am getting a rather strong suspicion that GPT-4 pretraining has been using some data curation.
From Janus’ two comments there I am getting an impression of a non-RLHF’d system which is, nevertheless, tends to be much stronger than usual in its convictions (or, the virtual characters it creates tend to be stronger than usual in their convictions about the nature of their current reality). There might be multiple reasons for that, but some degree of data curation might be one of them.