Attempting to distill the intuitions behind my comment into more nuanced questions:
1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.
2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?
Attempting to distill the intuitions behind my comment into more nuanced questions:
1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.
2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?