Vladimir_Nesov comments on Reframing inner alignment

Vladimir_Nesov 11 Dec 2022 15:36 UTC
LW: 21 AF: 9
3
AF
There is conspicuously little explicit focus on developing purely epistemic AI capabilities, which seem useful for alignment in a more straightforward manner than agentic AIs. A sufficiently good model of the world is going to have accurate models of humans, approaching the capability of formulating uploads, aligned with the original physical humans. In the limit of accuracy the human imitations (or rather imitations of possible humans from sufficiently accurate possible worlds) are a good enough building block for alignment research at higher physical speed. (But with insufficient accuracy, or when warped into greater capability than original humans, they are an alignment hazard.)

The simulators framing of LLMs is a step in the direction of epistemic rationality, away from fixation on goals and policies. RLHF as a step back, destroying the model-of-the-world role of an SSL-trained model, not even sparing it as a new explicit element in the architecture of a resulting AI.

So I think it’s important to figure out internal alignment objectives for models of the world, things like SSL-trained LLMs that weren’t mutilated with RL. This is not alignment with preferences, but it might be similar in asking for coherence in related generated descriptions, like utility functions ask for coherence in actions across related hypothetical situations. (My useless guess is automatic identification of a large number of concepts much smaller/simpler than people/simulacra, taking a look at how they influence episodes where they are relevant/expressed, and retraining the model towards improvement of their coherence, perhaps literally reifying them as agents with simple utility functions in a smaller scope of episodes where they apply.)
What links here?
- Vladimir_Nesov's comment on The Limit of Language Models by DragonGod (25 Dec 2022 14:26 UTC; 7 points)
- Roman Leventov 20 Dec 2022 19:59 UTC
  1 point
  0
  Parent
  “Purely epistemic model” is not a thing, everything is an agent that is self-evidencing at least to some degree: https://www.lesswrong.com/posts/oSPhmfnMGgGrpe7ib/properties-of-current-ais-and-some-predictions-of-the. I agree, however, that RLHF actively strengthens goal-directedness (the synonym of self-evidencing) which may otherwise remain almost rudimentary in LLMs.