I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that “An actual grounded definition of human preferences” is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,
My impression continues to be that (4) is neglected. Stuart has been the most prolific person I can think of to work on this question, and it’s a fast falling power distribution after that with myself having done some work and then not much else that comes to mind in terms of work to address (4) in a technical manner that might lead to solutions useful for AI safety.
I have no doubt others have done things (Alexey has thought (and maybe published?) some on this), and others could probably forget my work or Stuart’s as easily as I’ve forgotten there because we don’t have a lot of momentum on this problem right now to keep it fresh in our minds. Or so is my impression of things now. I’ve had some good conversations with folks and a few seem excited about working on (4) and they seem qualified in ways to do it, but no one but Stuart has yet produced very much published work on it.
(Yes, there is Eliezer’s work on CEV, which is more like a placeholder and wishful thinking than anything more serious, and it has probably accidentally been the biggest bottleneck to work on (4) because so many people I talk to say things like “oh, we can just do CEV and be done with this, so let’s worry about the real problems”.)
I agree there is a risk it is an impossible problem, and I actually think it’s quite high in that we may not be able to adequately aggregate human preferences in ways that result in something coherent. In that case I view safety and alignment as more about avoiding catastrophe and cutting down aligned AI solution space to remove the things that clearly don’t work rather than building towards things that clearly do. I hope I’m being too pessimistic.
In my experience, people mostly haven’t had the view of “we can just do CEV, it’ll be fine” and instead have had the view of “before we figure out what our preferences are, which is an inherently political and messy question, let’s figure out how to load any preferences at all.”
It seems like there needs to be some interplay here—”what we can load” informs “what shape we should force our preferences into” and “what shape our preferences actually are” informs “what loading needs to be capable of to count as aligned.”
I wouldn’t say it’s neglected, just that people are busy laying foundation and that it’s probably too early to tackle the problem directly. In particular, grounding the preferences of real-world agents is an obvious application for any potential theory of embedded agency. (At least the way I think about it, grounding models and preferences is the main problem of embedded agency.)
Even if we succeeded at (1), it would be hard to know that we’d succeeded without progress on (4). If we’re using one or more proxies, we don’t have a way to talk about how accurate they are without (4) - we can’t evaluate how closely the proxies match the thing they’re supposed to proxy, without grounding that thing.
For (2), if we want to talk about “low-impact” or anything like it, then we need a grounding of what kind of impact we care about—and that question falls under (4). If we forget about some kind of impact that humans actually do care about, then we’re in trouble.
Thanks for clarifying! I haven’t brought this up on your research agenda because I prefer to have the discussion during an upcoming sequence of mine, and it felt unfair to comment on your agenda, “I disagree but I won’t elaborate right now”.
I agree that 4 needs to be taken seriously, as 1 and 2 are hard to succeed at without making a lot of progress on 4, and 3 is just a catch-all for every other approach. It is also the hardest, as it probably requires breaking a lot of new ground, so people tend to work on what appears solvable. I thought some people are working on it though, no? There is also a chance of proving that “An actual grounded definition of human preferences” is impossible in a self-consistent way, and we would have to figure out what to do in this case. The latter feels like a real possibility to me,
My impression continues to be that (4) is neglected. Stuart has been the most prolific person I can think of to work on this question, and it’s a fast falling power distribution after that with myself having done some work and then not much else that comes to mind in terms of work to address (4) in a technical manner that might lead to solutions useful for AI safety.
I have no doubt others have done things (Alexey has thought (and maybe published?) some on this), and others could probably forget my work or Stuart’s as easily as I’ve forgotten there because we don’t have a lot of momentum on this problem right now to keep it fresh in our minds. Or so is my impression of things now. I’ve had some good conversations with folks and a few seem excited about working on (4) and they seem qualified in ways to do it, but no one but Stuart has yet produced very much published work on it.
(Yes, there is Eliezer’s work on CEV, which is more like a placeholder and wishful thinking than anything more serious, and it has probably accidentally been the biggest bottleneck to work on (4) because so many people I talk to say things like “oh, we can just do CEV and be done with this, so let’s worry about the real problems”.)
I agree there is a risk it is an impossible problem, and I actually think it’s quite high in that we may not be able to adequately aggregate human preferences in ways that result in something coherent. In that case I view safety and alignment as more about avoiding catastrophe and cutting down aligned AI solution space to remove the things that clearly don’t work rather than building towards things that clearly do. I hope I’m being too pessimistic.
In my experience, people mostly haven’t had the view of “we can just do CEV, it’ll be fine” and instead have had the view of “before we figure out what our preferences are, which is an inherently political and messy question, let’s figure out how to load any preferences at all.”
It seems like there needs to be some interplay here—”what we can load” informs “what shape we should force our preferences into” and “what shape our preferences actually are” informs “what loading needs to be capable of to count as aligned.”
I wouldn’t say it’s neglected, just that people are busy laying foundation and that it’s probably too early to tackle the problem directly. In particular, grounding the preferences of real-world agents is an obvious application for any potential theory of embedded agency. (At least the way I think about it, grounding models and preferences is the main problem of embedded agency.)
It’s not obvious to me why this ought to be the case. Could you elaborate?
Even if we succeeded at (1), it would be hard to know that we’d succeeded without progress on (4). If we’re using one or more proxies, we don’t have a way to talk about how accurate they are without (4) - we can’t evaluate how closely the proxies match the thing they’re supposed to proxy, without grounding that thing.
For (2), if we want to talk about “low-impact” or anything like it, then we need a grounding of what kind of impact we care about—and that question falls under (4). If we forget about some kind of impact that humans actually do care about, then we’re in trouble.
Yep ^_^ I make those points in the research agenda (section 3).
Exactly. You explained it better than I could :)
I also am curious why this should be so.
I also continue to disagree with Stuart on low impact in particular being intractable without learning human values.
To be precise: I argue low impact is intractable without learning a subset of human values; the full set is not needed.
Thanks for clarifying! I haven’t brought this up on your research agenda because I prefer to have the discussion during an upcoming sequence of mine, and it felt unfair to comment on your agenda, “I disagree but I won’t elaborate right now”.