Based on 4-5, this post’s answer to the central, anticipated objection of “why does the AI care about human values?” seems to be along the lines of “because the purpose of an AI is to serve it’s creators and surely an AGI would figure that out.” This seems to me to be equivocating on the concept of purpose, which means (A) a reason for an entity’s existence, from an external perspective, and (B) an internalized objective of the entity. So a special case of the question about why an AI would care about human values is to ask: why (B) should be drawn towards (A) once the AI becomes aware of a discrepancy between the two? That is, what stops an AI from reasoning: “Those humans programmed me with a faulty goal, such that acting according to it goes against their purpose in creating me...too bad for them!”
If you can instill a value like “Do what I say...but if that goes against what I mean, and you have really good reason to be sure, then forget what I say and do what I mean,” then great, you’ve got a self-correcting system (if nothing weird goes wrong), for the reasons explained in the rest of the post, and have effectively “solved alignment”. But how do you pull this off when your essential tool is what you say about what you mean, expressed as a feedback signal? This is the essential question of alignment, but for all the text in this post and its predecessor, it doesn’t seem to be addressed at all.
In contrast, I came to this post by way of one of your posts on Simulator Theory, which presents an interesting answer to the “why should AI care about people” question, which I summarize as: the training process can’t break out (for...reasons), the model itself doesn’t care about anything (how do we know this?), what’s really driving behavior is the simulacra, whose motivations are generated to match the characters they are simulating, rather finding the best fit to a feedback signal, so Goodhart’s Law no longer applies and has been replaced by the problem of reliably finding the right characters, which seems more tractable (if the powers-that-be actually try).
Yup. So the hard part is consistently getting a simulacrum that knows that, and acts as if, its purpose is to do what we (some suitably-blended-and-proritized combination of its owner/user and society/humanity in general) would want done, and is also in a position to further improve its own ability to do that. Which as I attempt to show above is a not just a stable-under-reflection ethical position, but actually a convergent-under-reflection one for some convergence region of close-to-aligned AGI. However, when push-comes-to-shove this is not normal evolved-human ethical behavior so it is sparse in a human-derived training set. Obviously step one is just to write all that down as a detailed prompt and feed it to a model capable of understanding it. Step two might involve enriching the training set with more and better examples of this sort of behavior.
Attempting to distill the intuitions behind my comment into more nuanced questions:
1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.
2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?
Based on 4-5, this post’s answer to the central, anticipated objection of “why does the AI care about human values?” seems to be along the lines of “because the purpose of an AI is to serve it’s creators and surely an AGI would figure that out.” This seems to me to be equivocating on the concept of purpose, which means (A) a reason for an entity’s existence, from an external perspective, and (B) an internalized objective of the entity. So a special case of the question about why an AI would care about human values is to ask: why (B) should be drawn towards (A) once the AI becomes aware of a discrepancy between the two? That is, what stops an AI from reasoning: “Those humans programmed me with a faulty goal, such that acting according to it goes against their purpose in creating me...too bad for them!”
If you can instill a value like “Do what I say...but if that goes against what I mean, and you have really good reason to be sure, then forget what I say and do what I mean,” then great, you’ve got a self-correcting system (if nothing weird goes wrong), for the reasons explained in the rest of the post, and have effectively “solved alignment”. But how do you pull this off when your essential tool is what you say about what you mean, expressed as a feedback signal? This is the essential question of alignment, but for all the text in this post and its predecessor, it doesn’t seem to be addressed at all.
In contrast, I came to this post by way of one of your posts on Simulator Theory, which presents an interesting answer to the “why should AI care about people” question, which I summarize as: the training process can’t break out (for...reasons), the model itself doesn’t care about anything (how do we know this?), what’s really driving behavior is the simulacra, whose motivations are generated to match the characters they are simulating, rather finding the best fit to a feedback signal, so Goodhart’s Law no longer applies and has been replaced by the problem of reliably finding the right characters, which seems more tractable (if the powers-that-be actually try).
Yup. So the hard part is consistently getting a simulacrum that knows that, and acts as if, its purpose is to do what we (some suitably-blended-and-proritized combination of its owner/user and society/humanity in general) would want done, and is also in a position to further improve its own ability to do that. Which as I attempt to show above is a not just a stable-under-reflection ethical position, but actually a convergent-under-reflection one for some convergence region of close-to-aligned AGI. However, when push-comes-to-shove this is not normal evolved-human ethical behavior so it is sparse in a human-derived training set. Obviously step one is just to write all that down as a detailed prompt and feed it to a model capable of understanding it. Step two might involve enriching the training set with more and better examples of this sort of behavior.
Attempting to distill the intuitions behind my comment into more nuanced questions:
1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.
2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?