So, something I am now wondering is: Why don’t Complexity of Value and Fragility of Value make alignment obviously impossible?
Maybe I’m misunderstanding the two theories, but don’t they very basically boil down to “Human values are too complex to program”? Because that just seems like something that’s objectively correct. Like, trying to do exactly that seems like attempting to “solve” ethics which looks pretty blatantly futile to me.
I (hopefully) suspect that I have the exact shape of the issue wrong, and that (most) people aren’t actually literally trying to reverse engineer human morality and then encode it.
If that actually is what everyone is trying to do, then why is it only considered “difficult” and not outright undoable?
There are two answers to this. The first is indirection strategies. Human values are very complex, too complex to write down correctly or program into an AI. But specifying a pointer that picks out a particular human brain or group of brains, and interprets the connectome of that brain as a set of values, might be easier. Or, really, any specification that’s able to conceptually represent humans as agents, if it successfully dodges all the corner cases about what counts, is something that a specification of values might be built around. We don’t know how to do this (can’t get a connectome, can’t convert a connectome as values, can’t interpret a human as an agent, and can’t convert an abstract agent to values). But all of these steps are things that are possible in principle, albeit different.
The second answer is that things look more complex when you don’t understand them, and the apparent complexity of human values might actually be an artifact of our confusion. I don’t think human values are simple in the way that philosophy tends to try to simplify the, but I think the algorithm by which humans acquire their values, given a lifetime of language inputs, might turn out to be a neat one-page algorithm, in the same way that the algorithm for a transformer is a neat one-page algorithm that captures all of grammar. This wouldn’t be a solution to alignment either, but it would be a decent starting point to build on.
I apologize for my ignorance, but are these things what people are actually trying in their own ways? Or are they really trying the thing that seems much, much crazier to me?
They’re mostly doing “train a language model on a bunch of data and hope human concepts and values are naturally present in the neural net that pops out”, which isn’t exactly either of these strategies. Currently it’s a bit of a struggle to get language models to go in an at-all-nonrandom direction (though there has been recent progress in that area). There are tidbits of deconfusion-about-ethics here and there on LW, but nothing I would call a research program.
So, something I am now wondering is: Why don’t Complexity of Value and Fragility of Value make alignment obviously impossible?
Maybe I’m misunderstanding the two theories, but don’t they very basically boil down to “Human values are too complex to program”? Because that just seems like something that’s objectively correct. Like, trying to do exactly that seems like attempting to “solve” ethics which looks pretty blatantly futile to me.
I (hopefully) suspect that I have the exact shape of the issue wrong, and that (most) people aren’t actually literally trying to reverse engineer human morality and then encode it.
If that actually is what everyone is trying to do, then why is it only considered “difficult” and not outright undoable?
There are two answers to this. The first is indirection strategies. Human values are very complex, too complex to write down correctly or program into an AI. But specifying a pointer that picks out a particular human brain or group of brains, and interprets the connectome of that brain as a set of values, might be easier. Or, really, any specification that’s able to conceptually represent humans as agents, if it successfully dodges all the corner cases about what counts, is something that a specification of values might be built around. We don’t know how to do this (can’t get a connectome, can’t convert a connectome as values, can’t interpret a human as an agent, and can’t convert an abstract agent to values). But all of these steps are things that are possible in principle, albeit different.
The second answer is that things look more complex when you don’t understand them, and the apparent complexity of human values might actually be an artifact of our confusion. I don’t think human values are simple in the way that philosophy tends to try to simplify the, but I think the algorithm by which humans acquire their values, given a lifetime of language inputs, might turn out to be a neat one-page algorithm, in the same way that the algorithm for a transformer is a neat one-page algorithm that captures all of grammar. This wouldn’t be a solution to alignment either, but it would be a decent starting point to build on.
I apologize for my ignorance, but are these things what people are actually trying in their own ways? Or are they really trying the thing that seems much, much crazier to me?
They’re mostly doing “train a language model on a bunch of data and hope human concepts and values are naturally present in the neural net that pops out”, which isn’t exactly either of these strategies. Currently it’s a bit of a struggle to get language models to go in an at-all-nonrandom direction (though there has been recent progress in that area). There are tidbits of deconfusion-about-ethics here and there on LW, but nothing I would call a research program.
I don’t think most people are trying to explicitly write down all human values and then tell them to an AI. Here are some more promising alternatives:
Tell an AI to “consult a human if you aren’t sure what to do”
Instead of explicitly trying to write down human values, learn them by example (by watching human actions, or reading books, or…)