There are two answers to this. The first is indirection strategies. Human values are very complex, too complex to write down correctly or program into an AI. But specifying a pointer that picks out a particular human brain or group of brains, and interprets the connectome of that brain as a set of values, might be easier. Or, really, any specification that’s able to conceptually represent humans as agents, if it successfully dodges all the corner cases about what counts, is something that a specification of values might be built around. We don’t know how to do this (can’t get a connectome, can’t convert a connectome as values, can’t interpret a human as an agent, and can’t convert an abstract agent to values). But all of these steps are things that are possible in principle, albeit different.
The second answer is that things look more complex when you don’t understand them, and the apparent complexity of human values might actually be an artifact of our confusion. I don’t think human values are simple in the way that philosophy tends to try to simplify the, but I think the algorithm by which humans acquire their values, given a lifetime of language inputs, might turn out to be a neat one-page algorithm, in the same way that the algorithm for a transformer is a neat one-page algorithm that captures all of grammar. This wouldn’t be a solution to alignment either, but it would be a decent starting point to build on.
I apologize for my ignorance, but are these things what people are actually trying in their own ways? Or are they really trying the thing that seems much, much crazier to me?
They’re mostly doing “train a language model on a bunch of data and hope human concepts and values are naturally present in the neural net that pops out”, which isn’t exactly either of these strategies. Currently it’s a bit of a struggle to get language models to go in an at-all-nonrandom direction (though there has been recent progress in that area). There are tidbits of deconfusion-about-ethics here and there on LW, but nothing I would call a research program.
There are two answers to this. The first is indirection strategies. Human values are very complex, too complex to write down correctly or program into an AI. But specifying a pointer that picks out a particular human brain or group of brains, and interprets the connectome of that brain as a set of values, might be easier. Or, really, any specification that’s able to conceptually represent humans as agents, if it successfully dodges all the corner cases about what counts, is something that a specification of values might be built around. We don’t know how to do this (can’t get a connectome, can’t convert a connectome as values, can’t interpret a human as an agent, and can’t convert an abstract agent to values). But all of these steps are things that are possible in principle, albeit different.
The second answer is that things look more complex when you don’t understand them, and the apparent complexity of human values might actually be an artifact of our confusion. I don’t think human values are simple in the way that philosophy tends to try to simplify the, but I think the algorithm by which humans acquire their values, given a lifetime of language inputs, might turn out to be a neat one-page algorithm, in the same way that the algorithm for a transformer is a neat one-page algorithm that captures all of grammar. This wouldn’t be a solution to alignment either, but it would be a decent starting point to build on.
I apologize for my ignorance, but are these things what people are actually trying in their own ways? Or are they really trying the thing that seems much, much crazier to me?
They’re mostly doing “train a language model on a bunch of data and hope human concepts and values are naturally present in the neural net that pops out”, which isn’t exactly either of these strategies. Currently it’s a bit of a struggle to get language models to go in an at-all-nonrandom direction (though there has been recent progress in that area). There are tidbits of deconfusion-about-ethics here and there on LW, but nothing I would call a research program.