I’m an alignment researcher. It’s not obvious whether I’m a doomer (P(doom) ~ 50%), but I definitely think alignment is tractable and research is worth doing.
Quintin is on the extreme end—he’s treating “Learn human values as well as humans do / Learn the fuzzy concept that humans actually use when we talk about values” as a success condition, whereas I’d say we want “Learn human values with a better understanding than humans have / Learn the entire constellation of different fuzzy concepts that humans use when we talk about values, and represent them in a way that’s amenable to self-improvement and decision-making.”
But my reasons for optimism are overlapping with his. We’re not trying to learn the Final Form of human values ourselves and then write it down. That problem would be really hard. We’re just trying to build an AI that learns to model humans in a way that’s sufficiently responsive to how humans want to be modeled.
The extra twist I’d add is that doing this well still requires answering a lot of philosophy-genre questions. But I’m optimistic about our ability to do that, too. We have a lot of advantages relative to philosophers, like: if you would ask an AI to have some property and that property turns out to be impossible, don’t keep arguing about it, just remember we’re in the business of designing real-world solutions and ask for a different property.
I’m an alignment researcher. It’s not obvious whether I’m a doomer (P(doom) ~ 50%), but I definitely think alignment is tractable and research is worth doing.
Quintin is on the extreme end—he’s treating “Learn human values as well as humans do / Learn the fuzzy concept that humans actually use when we talk about values” as a success condition, whereas I’d say we want “Learn human values with a better understanding than humans have / Learn the entire constellation of different fuzzy concepts that humans use when we talk about values, and represent them in a way that’s amenable to self-improvement and decision-making.”
But my reasons for optimism are overlapping with his. We’re not trying to learn the Final Form of human values ourselves and then write it down. That problem would be really hard. We’re just trying to build an AI that learns to model humans in a way that’s sufficiently responsive to how humans want to be modeled.
The extra twist I’d add is that doing this well still requires answering a lot of philosophy-genre questions. But I’m optimistic about our ability to do that, too. We have a lot of advantages relative to philosophers, like: if you would ask an AI to have some property and that property turns out to be impossible, don’t keep arguing about it, just remember we’re in the business of designing real-world solutions and ask for a different property.