I agree with OpenAI folks that generalisation is the key concept for understanding alignment process. But I think that with their weak-to-strong generalisation agenda, they (as well as almost everyone else) apply it I’m the reverse direction: learning values of weak agents (humans) doesn’t make sense. Rather, weak agents should learn the causal models that strong agents employ to be able to express an informed value judgement. This is the way to circumvent the “absence of the ground truth for values” problem: instead, agent try to generalise their respective world models so that they sufficiently overlap, and then choose actions that seem net beneficial to both sides, without knowing how this value judgement way made by the other side.
In order to be able to generalise to shared world models with AIs, we must also engineer AIs to have human inductive biases from the beginning. Otherwise, this won’t be feasible. This observation makes “brain-like AGI” one of the most important alignment agendas in my view.
Well, yes, it also includes learning weak agent’s models more generally, not just the “values”. But I think the point stands. It’s elaborated better in the linked post. As AIs will receive most of the same information that humans receive through always-on wearable sensors, there won’t be much to learn for AIs from humans. Rather, it’s humans that will need to do their homework, to increase the quality of their value judgements.
Could you clarify who you’re referring to by “strong” agents? You refer to humans as weak agents at the start, yet claim later that AIs should have human inductive biases from the beginning, which makes me think humans are the strong ones.
I agree with the core problem statement and most assumptions of the Pursuit of Happiness/Conventions Approach, but suggest a different solution: https://www.lesswrong.com/posts/rZWNxrzuHyKK2pE65/ai-alignment-as-a-translation-problem
I agree with OpenAI folks that generalisation is the key concept for understanding alignment process. But I think that with their weak-to-strong generalisation agenda, they (as well as almost everyone else) apply it I’m the reverse direction: learning values of weak agents (humans) doesn’t make sense. Rather, weak agents should learn the causal models that strong agents employ to be able to express an informed value judgement. This is the way to circumvent the “absence of the ground truth for values” problem: instead, agent try to generalise their respective world models so that they sufficiently overlap, and then choose actions that seem net beneficial to both sides, without knowing how this value judgement way made by the other side.
In order to be able to generalise to shared world models with AIs, we must also engineer AIs to have human inductive biases from the beginning. Otherwise, this won’t be feasible. This observation makes “brain-like AGI” one of the most important alignment agendas in my view.
I don’t think “weak-to-strong generalization” is well described as “trying to learn the values of weak agents”.
Well, yes, it also includes learning weak agent’s models more generally, not just the “values”. But I think the point stands. It’s elaborated better in the linked post. As AIs will receive most of the same information that humans receive through always-on wearable sensors, there won’t be much to learn for AIs from humans. Rather, it’s humans that will need to do their homework, to increase the quality of their value judgements.
Could you clarify who you’re referring to by “strong” agents? You refer to humans as weak agents at the start, yet claim later that AIs should have human inductive biases from the beginning, which makes me think humans are the strong ones.