Daniel C comments on My decomposition of the alignment problem

Daniel C 11 Sep 2024 15:06 UTC
3 points
2
Then we’ve converged almost completely, thanks for the conversation.
Thanks! I enjoyed the conversation too.
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Agreed.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Thanks!
- Noosphere89 11 Sep 2024 17:10 UTC
  3 points
  1
  Parent
  This comment is to clarify some things, not to disagree too much with you:
  yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
  Then we’d better start cracking on how to get GPS into LLMs.
  Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
  And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
  yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
  Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
  I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
  Thanks for clarifying that, now I understand what you’re saying.