RogerDearnaley comments on Requirements for a Basin of Attraction to Alignment

RogerDearnaley 13 Sep 2024 22:12 UTC
3 points
0
Yup. So the hard part is consistently getting a simulacrum that knows that, and acts as if, its purpose is to do what we (some suitably-blended-and-proritized combination of its owner/user and society/humanity in general) would want done, and is also in a position to further improve its own ability to do that. Which as I attempt to show above is a not just a stable-under-reflection ethical position, but actually a convergent-under-reflection one for some convergence region of close-to-aligned AGI. However, when push-comes-to-shove this is not normal evolved-human ethical behavior so it is sparse in a human-derived training set. Obviously step one is just to write all that down as a detailed prompt and feed it to a model capable of understanding it. Step two might involve enriching the training set with more and better examples of this sort of behavior.
- WillPetillo 14 Sep 2024 7:09 UTC
  3 points
  0
  Parent
  Attempting to distill the intuitions behind my comment into more nuanced questions:
  
  1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.
  
  2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?
  - Satron 25 Nov 2024 22:44 UTC
    1 point
    0
    Parent
    Those 2 questions seem to be advancing the discussion, so I’d be really interested in Roger’s response to them.
    - RogerDearnaley 4 Dec 2024 5:42 UTC
      5 points
      2
      Parent
      Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
      a) as an AI, I should act fully aligned to human values
      b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
      c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>…
      As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
      LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
      Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.