I don’t this the self-alignment problem depends of notion of ‘human values’. Also I don’t think the “do what I said” solves it. Do what I said is roughly “aligning with the output of the aggregation procedure”, and
for most non-trivial requests, understanding what I said depends of fairly complex model of what the words I said mean
often there will be a tension between your words; strictly interpreted “do not do damage” can mean “do nothing”—basically anything has some risk of some damage; when you tell a LLM to be “harmless” and “helpful”, these requests point in different directions
strong learners will learn what lead you to say the words anyway
I see connection between self-alignment and human values as following: the idea of human values assumes that human has stable set of preferences. The stability is important part of the idea of human values. But human motivation system is notoriously non-stable: I want to drink, I have drink and now I don’t want to drink. The idea of “desires” may be a better fit than “human values” as it is normal for desires to evolve and contradict each other.
But human motivational system is more complex than that: I have rules and I have desires, which are often contradict each other and are in dynamic balance. For example, I have a rule not to drink alcohol and desire for a drink.
Speaking about you bullet points: everything depends of the situation and there are two main types of situations: a) researchers starts first ever AI first time 2) consumer uses a home robot for a task. In the second case, the robot is likely trained on a very large dataset and knows what are good and bad outcomes for almost all possible situations.
I don’t this the self-alignment problem depends of notion of ‘human values’. Also I don’t think the “do what I said” solves it. Do what I said is roughly “aligning with the output of the aggregation procedure”, and
for most non-trivial requests, understanding what I said depends of fairly complex model of what the words I said mean
often there will be a tension between your words; strictly interpreted “do not do damage” can mean “do nothing”—basically anything has some risk of some damage; when you tell a LLM to be “harmless” and “helpful”, these requests point in different directions
strong learners will learn what lead you to say the words anyway
I see connection between self-alignment and human values as following: the idea of human values assumes that human has stable set of preferences. The stability is important part of the idea of human values. But human motivation system is notoriously non-stable: I want to drink, I have drink and now I don’t want to drink. The idea of “desires” may be a better fit than “human values” as it is normal for desires to evolve and contradict each other.
But human motivational system is more complex than that: I have rules and I have desires, which are often contradict each other and are in dynamic balance. For example, I have a rule not to drink alcohol and desire for a drink.
Speaking about you bullet points: everything depends of the situation and there are two main types of situations: a) researchers starts first ever AI first time 2) consumer uses a home robot for a task. In the second case, the robot is likely trained on a very large dataset and knows what are good and bad outcomes for almost all possible situations.