From the view of someone who mostly has no clue what they are talking about (that person being me), I don’t understand why people working in AI safety seem to think that a successful alignment solution (as in, one that stops everyone from being killed or tortured) is something that is humanly achievable.
To be more clear, if someone is worried about AI x-risk, but is also not simultaneously a doomer, then I do not know what they are hoping for.
I think there’s a fairly high chance that I’m largely making false assumptions here, and I understand that individual alignment schemes generally differ from each other significantly.
My worry is that different approaches, underneath the layers of technical jargon which mostly go over my head, all are ultimately working towards something that looks like “accurately reverse engineer human values and also accurately encode them”.
If the above characterization is correct, then I don’t understand why the situation isn’t largely considered hopeless. Do people think that would be much more doable/less complex than I do? Why?
Or am I just completely off-base here? (EDIT: by this I mean to ask if I’m incorrect in my assumption of what alignment approaches are ultimately trying to do)
[Question] Can someone explain to me why most researchers think alignment is probably something that is humanly tractable?
From the view of someone who mostly has no clue what they are talking about (that person being me), I don’t understand why people working in AI safety seem to think that a successful alignment solution (as in, one that stops everyone from being killed or tortured) is something that is humanly achievable.
To be more clear, if someone is worried about AI x-risk, but is also not simultaneously a doomer, then I do not know what they are hoping for.
I think there’s a fairly high chance that I’m largely making false assumptions here, and I understand that individual alignment schemes generally differ from each other significantly.
My worry is that different approaches, underneath the layers of technical jargon which mostly go over my head, all are ultimately working towards something that looks like “accurately reverse engineer human values and also accurately encode them”.
If the above characterization is correct, then I don’t understand why the situation isn’t largely considered hopeless. Do people think that would be much more doable/less complex than I do? Why?
Or am I just completely off-base here? (EDIT: by this I mean to ask if I’m incorrect in my assumption of what alignment approaches are ultimately trying to do)