I think many working in AI safety, and specifically those focused on alignment, basically don’t understand what values are and, alarmingly to me, haven’t invested a lot of effort into figuring it out. I think this is motivated stopping, though, because figuring out what values are and how they work is hard and a problem that doesn’t easily avail itself to methods well known by AI researchers, so it gets ignored or pushed off as something AI will figure out for us. As a result, most of the time value is treated as some kind of black box at worst and as an abstract mathematical construct akin to preferences at best, which is a step beyond just totally ignoring not knowing what values are but not anywhere close to where I think we’ll need to be to build aligned AI.
I think many working in AI safety, and specifically those focused on alignment, basically don’t understand what values are and, alarmingly to me, haven’t invested a lot of effort into figuring it out. I think this is motivated stopping, though, because figuring out what values are and how they work is hard and a problem that doesn’t easily avail itself to methods well known by AI researchers, so it gets ignored or pushed off as something AI will figure out for us. As a result, most of the time value is treated as some kind of black box at worst and as an abstract mathematical construct akin to preferences at best, which is a step beyond just totally ignoring not knowing what values are but not anywhere close to where I think we’ll need to be to build aligned AI.
I’ve tried to address this in the past, but don’t know how much impact I’ve really had. The recent work on the shard theory of values gives me hope the situation is changing.