Yeah, I’m not actually convinced humans are “aligned under reflection” in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:
You have just done a lot of steps, many of which involved reflection, with no particular way to get ‘back on track’ if you’ve done some of them in goofy ways
[...]
If the AI does a bunch of screwed-up reflection, it might thereby land in a state where it’d be realistic to do crazy stuff (as humans who have done a lot of reflection sometimes do).
It certainly seems to me that e.g. people like Ziz have done reflection in a “goofy” way, and that being human has not particularly saved them from deriving “crazy stuff”. Of course, humans doing reflection would still be confined to a subset of the mental moves being done by crazy minds made out of gradient descent on matrix multiplication, but it’s currently plausible to me that part of the danger arises simply from “reflection on (partially) incoherent starting points” getting really crazy really fast.
(It’s not yet clear to me how this intuition interfaces with my view on alignment hopes; you’d expect it to make things worse, but I actually think this is already “priced in” w.r.t. my P(doom), so explicating it like this doesn’t actually move me—which is about what you’d expect, and strive for, as someone who tries to track both their object-level beliefs and the implications of those beliefs.)
(EDIT: I mean, a lot of what I’m saying here is basically “CEV” might not be so “C”, and I don’t actually think I’ve ever bought that to begin with, so it really doesn’t come as an update for me. Still worth making explicit though, IMO.)
I hear you on this concern, but it basically seems similar (IMO) to a concern like: “The future of humanity after N more generations will be ~without value, due to all the reflection humans will do—and all the ways their values will change—between now and then.” A large set of “ems” gaining control of the future after a lot of “reflection” seems like quite comparable to future humans having control over the future (also after a lot of effective “reflection”).
I think there’s some validity to worrying about a future with very different values from today’s. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or “bad” ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to “misaligned AI” levels of value divergence than to “ems” levels of value divergence.
My view writ moral reflection leading to things we perceive as bad I suspect ultimately comes down to the fact that there are too many valid answers to the question “What’s moral/ethical?” or “What’s the CEV?” Indeed, I think there are an infinite number of valid answers to these questions.
This leads to several issues for alignment:
Your endpoint in reflection completely depends on your starting assumptions, and these assumptions are choosable.
There is no safeguard against someone reflecting and ending up in a point where they harm someone else’s values. Thus, seemingly bad values from our perspective can’t be guaranteed to be avoided.
The endpoints aren’t constrained by default, thus you have to hope that the reflection process doesn’t lead to your values being lessened or violated.
Yeah, I’m not actually convinced humans are “aligned under reflection” in the relevant sense; there are lots of ways to do reflection, and as Holden himself notes in the top-level post:
It certainly seems to me that e.g. people like Ziz have done reflection in a “goofy” way, and that being human has not particularly saved them from deriving “crazy stuff”. Of course, humans doing reflection would still be confined to a subset of the mental moves being done by crazy minds made out of gradient descent on matrix multiplication, but it’s currently plausible to me that part of the danger arises simply from “reflection on (partially) incoherent starting points” getting really crazy really fast.
(It’s not yet clear to me how this intuition interfaces with my view on alignment hopes; you’d expect it to make things worse, but I actually think this is already “priced in” w.r.t. my P(doom), so explicating it like this doesn’t actually move me—which is about what you’d expect, and strive for, as someone who tries to track both their object-level beliefs and the implications of those beliefs.)
(EDIT: I mean, a lot of what I’m saying here is basically “CEV” might not be so “C”, and I don’t actually think I’ve ever bought that to begin with, so it really doesn’t come as an update for me. Still worth making explicit though, IMO.)
I hear you on this concern, but it basically seems similar (IMO) to a concern like: “The future of humanity after N more generations will be ~without value, due to all the reflection humans will do—and all the ways their values will change—between now and then.” A large set of “ems” gaining control of the future after a lot of “reflection” seems like quite comparable to future humans having control over the future (also after a lot of effective “reflection”).
I think there’s some validity to worrying about a future with very different values from today’s. But I think misaligned AI is (reasonably) usually assumed to diverge in more drastic and/or “bad” ways than humans themselves would if they stayed in control; I think of this difference as the major driver of wanting to align AIs at all. And it seems Nate thinks that the hypothetical training process I outline above gets us something much closer to “misaligned AI” levels of value divergence than to “ems” levels of value divergence.
My view writ moral reflection leading to things we perceive as bad I suspect ultimately comes down to the fact that there are too many valid answers to the question “What’s moral/ethical?” or “What’s the CEV?” Indeed, I think there are an infinite number of valid answers to these questions.
This leads to several issues for alignment:
Your endpoint in reflection completely depends on your starting assumptions, and these assumptions are choosable.
There is no safeguard against someone reflecting and ending up in a point where they harm someone else’s values. Thus, seemingly bad values from our perspective can’t be guaranteed to be avoided.
The endpoints aren’t constrained by default, thus you have to hope that the reflection process doesn’t lead to your values being lessened or violated.