Noosphere89 comments on My decomposition of the alignment problem

Noosphere89 11 Sep 2024 13:23 UTC
3 points
1
Agreed
Then we’ve converged almost completely, thanks for the conversation.
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification X such that X ⟹ alignment but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one
advantage of infrabayesianism
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).

I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
- Daniel C 11 Sep 2024 15:06 UTC
  3 points
  2
  Parent
  Then we’ve converged almost completely, thanks for the conversation.
  Thanks! I enjoyed the conversation too.
  So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
  yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
  While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
  I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
  Agreed.
  My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
  I also don’t believe it’s necessary for alignment/uncertainty either.
  yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
  I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
  I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
  Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
  Thanks!
  - Noosphere89 11 Sep 2024 17:10 UTC
    3 points
    1
    Parent
    This comment is to clarify some things, not to disagree too much with you:
    yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
    Then we’d better start cracking on how to get GPS into LLMs.
    Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
    And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
    yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
    Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
    I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
    Thanks for clarifying that, now I understand what you’re saying.