My impression continues to be that (4) is neglected. Stuart has been the most prolific person I can think of to work on this question, and it’s a fast falling power distribution after that with myself having done some work and then not much else that comes to mind in terms of work to address (4) in a technical manner that might lead to solutions useful for AI safety.
I have no doubt others have done things (Alexey has thought (and maybe published?) some on this), and others could probably forget my work or Stuart’s as easily as I’ve forgotten there because we don’t have a lot of momentum on this problem right now to keep it fresh in our minds. Or so is my impression of things now. I’ve had some good conversations with folks and a few seem excited about working on (4) and they seem qualified in ways to do it, but no one but Stuart has yet produced very much published work on it.
(Yes, there is Eliezer’s work on CEV, which is more like a placeholder and wishful thinking than anything more serious, and it has probably accidentally been the biggest bottleneck to work on (4) because so many people I talk to say things like “oh, we can just do CEV and be done with this, so let’s worry about the real problems”.)
I agree there is a risk it is an impossible problem, and I actually think it’s quite high in that we may not be able to adequately aggregate human preferences in ways that result in something coherent. In that case I view safety and alignment as more about avoiding catastrophe and cutting down aligned AI solution space to remove the things that clearly don’t work rather than building towards things that clearly do. I hope I’m being too pessimistic.
In my experience, people mostly haven’t had the view of “we can just do CEV, it’ll be fine” and instead have had the view of “before we figure out what our preferences are, which is an inherently political and messy question, let’s figure out how to load any preferences at all.”
It seems like there needs to be some interplay here—”what we can load” informs “what shape we should force our preferences into” and “what shape our preferences actually are” informs “what loading needs to be capable of to count as aligned.”
I wouldn’t say it’s neglected, just that people are busy laying foundation and that it’s probably too early to tackle the problem directly. In particular, grounding the preferences of real-world agents is an obvious application for any potential theory of embedded agency. (At least the way I think about it, grounding models and preferences is the main problem of embedded agency.)
My impression continues to be that (4) is neglected. Stuart has been the most prolific person I can think of to work on this question, and it’s a fast falling power distribution after that with myself having done some work and then not much else that comes to mind in terms of work to address (4) in a technical manner that might lead to solutions useful for AI safety.
I have no doubt others have done things (Alexey has thought (and maybe published?) some on this), and others could probably forget my work or Stuart’s as easily as I’ve forgotten there because we don’t have a lot of momentum on this problem right now to keep it fresh in our minds. Or so is my impression of things now. I’ve had some good conversations with folks and a few seem excited about working on (4) and they seem qualified in ways to do it, but no one but Stuart has yet produced very much published work on it.
(Yes, there is Eliezer’s work on CEV, which is more like a placeholder and wishful thinking than anything more serious, and it has probably accidentally been the biggest bottleneck to work on (4) because so many people I talk to say things like “oh, we can just do CEV and be done with this, so let’s worry about the real problems”.)
I agree there is a risk it is an impossible problem, and I actually think it’s quite high in that we may not be able to adequately aggregate human preferences in ways that result in something coherent. In that case I view safety and alignment as more about avoiding catastrophe and cutting down aligned AI solution space to remove the things that clearly don’t work rather than building towards things that clearly do. I hope I’m being too pessimistic.
In my experience, people mostly haven’t had the view of “we can just do CEV, it’ll be fine” and instead have had the view of “before we figure out what our preferences are, which is an inherently political and messy question, let’s figure out how to load any preferences at all.”
It seems like there needs to be some interplay here—”what we can load” informs “what shape we should force our preferences into” and “what shape our preferences actually are” informs “what loading needs to be capable of to count as aligned.”
I wouldn’t say it’s neglected, just that people are busy laying foundation and that it’s probably too early to tackle the problem directly. In particular, grounding the preferences of real-world agents is an obvious application for any potential theory of embedded agency. (At least the way I think about it, grounding models and preferences is the main problem of embedded agency.)