This comment is to clarify some things, not to disagree too much with you:
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
Then we’d better start cracking on how to get GPS into LLMs.
Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Thanks for clarifying that, now I understand what you’re saying.
This comment is to clarify some things, not to disagree too much with you:
Then we’d better start cracking on how to get GPS into LLMs.
Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
Thanks for clarifying that, now I understand what you’re saying.