WillPetillo comments on Requirements for a Basin of Attraction to Alignment

WillPetillo 14 Sep 2024 7:09 UTC
3 points
0
Attempting to distill the intuitions behind my comment into more nuanced questions:

1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.

2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?
- Satron 25 Nov 2024 22:44 UTC
  1 point
  0
  Parent
  Those 2 questions seem to be advancing the discussion, so I’d be really interested in Roger’s response to them.
  - RogerDearnaley 4 Dec 2024 5:42 UTC
    5 points
    2
    Parent
    Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
    a) as an AI, I should act fully aligned to human values
    b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
    c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>…
    As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
    LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
    Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.