RogerDearnaley comments on Requirements for a Basin of Attraction to Alignment

RogerDearnaley 4 Dec 2024 5:42 UTC
5 points
2
1. Value learning converges to full alignment by construction: since a value learning AI basically starts with the propositions:
  a) as an AI, I should act fully aligned to human values
  b) I do not fully understand what human value are, or how to act fully aligned to them, so in order to be able to do this I need to learn more about human values and how to act fully aligned to them, by applying approximately Bayesian learning to this problem
  c) Here are some Bayesian priors about what human values are, and how to act fully aligned to them: <insert initialization information here>…
2. As usual for a Bayesian learning problem, as long as the Bayesian priors 1 c) are not completely screwed up as a place to start from, this will converge. Thus there is a region of convergence to full alignment.
3. LLMs have a very large amount of detailed information about what human values are and how to act aligned to them. Thus they provide a very detailed set of Bayesian priors for 1 c).
4. Also, training an LLM is a fairly good approximation to Bayesian learning. Thus (with suitable additions to enable online learning) they provide one possible implementation for the Bayesian learning process required by 1 b). For example, one could apply fine-tuning to the LLM to incorporate new information, and/or periodically retrain the LLM based on the training set plus new information the AI has gathered during the value learning process.