If you define what humans want in terms of states of the brain, and you don’t want the AI to just intervene directly on peoples’ brains, there’s a lot of extra work that has to happen, which I think will inevitably “de-purify” the values by making them dependent on context and on human behavior. Here’s what I think this might look like:
You have some model (“minimize prediction error”) that identifies what’s good, and try to fit the brain’s actual physiology to this model, in order to identify what’s going on physically when humans’ values are satisfied. But of course what humans want isn’t a certain brain state, humans want things to happen in the world. So your AI needs to learn what changes in the world are the “normal human context” in which it can apply this rating based on brain states. But heroin still exists, so if we don’t want it to learn that everyone wants heroin, this understanding of the external world has to start getting really value-laden, and maybe value-laden in a way based on human behaviors and not just human brain states.
One further thing to think about: this doesn’t engage with meta-ethics. Our meta-ethical desires are about things like what our desires should be, what are good procedural rules of decision-making (simple decision-making procedures often fail to care about continuity of human identity), and how to handle population ethics. The learn-from-behavior side puts these on equal footing with our desire for e.g. eating tasty food, because they’re all learned from human behavior. But if you ground our desire for tasty food in brain structure, this at the very least puts opinions on stuff like tasty food and opinions on stuff like theory of identity on very different footings, and might even cause some incompatibilities. Not sure.
Overall I think reading this post increased how good of an idea I think it is to try to ground human liking in terms of a model of brain physiology, but I think this information has to be treated quite indirectly. We can’t just give the AI preferences over human brain states, it needs to figure out what these brain states are referring to in the outside world, which is a tricky act of translation / communication in the sense of Quine and Grice.
I appreciate this sentiment, and do think there’s a dangerous, bad reduction of values to valence grounded in the operation of the brain that ignores much of what we care about, and that that extra stuff that we care about is also expressed as valence grounded in the operation of the brain. All the concerns you bring up must be computed somewhere, that somewhere is human brains, and if what those brains do is “minimize prediction error” then those concerns are also expressions of prediction error minimization. This to me is what’s exciting about a grounding like the one I’m considering: it’s embedded in the world in a way that means we don’t leave anything out (unless there’s some kind of “spooky” physics happening that we can’t observe, which I consider unlikely) such that we naturally capture all the complexity you’re concerned about, though it may take quite a bit to compute it all.
The difficulty is that we want to take human values and put them into an AI that doesn’t do prediction error minimization in the human sense, but instead does superhumanly competent search and planning. But if you have a specific scheme in mind that could outperform humans without leaving anything out, I’d be super interested.
As of yet, no, although this brings up an interesting point, which is that I’m looking at this stuff to find a precise grounding because I don’t think we can develop a plan that will work to our satisfaction without it. I realize lots of people disagree with me here, thinking that we need the method first and the value grounding will be worked out instrumentally by the method, but I dislike this because it makes it hard to verify the method than by observing what an AI produced by that method does, and this is a dangerous verification method due to the risk of a “treacherous” turn that isn’t so much treacherous as it is the one that could have been predicted if we bothered to have a solid theory of what the method we were using really implied in terms of the thing we cared about, if we had bothered to know what the thing we cared about fundamentally was.
Also I suspect we will be able to think of our desired AI in terms of control systems and set points, because I think we can do this for everything that’s “alive”, although it may not be the most natural abstraction to use for its architecture.
If you define what humans want in terms of states of the brain, and you don’t want the AI to just intervene directly on peoples’ brains, there’s a lot of extra work that has to happen, which I think will inevitably “de-purify” the values by making them dependent on context and on human behavior. Here’s what I think this might look like:
You have some model (“minimize prediction error”) that identifies what’s good, and try to fit the brain’s actual physiology to this model, in order to identify what’s going on physically when humans’ values are satisfied. But of course what humans want isn’t a certain brain state, humans want things to happen in the world. So your AI needs to learn what changes in the world are the “normal human context” in which it can apply this rating based on brain states. But heroin still exists, so if we don’t want it to learn that everyone wants heroin, this understanding of the external world has to start getting really value-laden, and maybe value-laden in a way based on human behaviors and not just human brain states.
One further thing to think about: this doesn’t engage with meta-ethics. Our meta-ethical desires are about things like what our desires should be, what are good procedural rules of decision-making (simple decision-making procedures often fail to care about continuity of human identity), and how to handle population ethics. The learn-from-behavior side puts these on equal footing with our desire for e.g. eating tasty food, because they’re all learned from human behavior. But if you ground our desire for tasty food in brain structure, this at the very least puts opinions on stuff like tasty food and opinions on stuff like theory of identity on very different footings, and might even cause some incompatibilities. Not sure.
Overall I think reading this post increased how good of an idea I think it is to try to ground human liking in terms of a model of brain physiology, but I think this information has to be treated quite indirectly. We can’t just give the AI preferences over human brain states, it needs to figure out what these brain states are referring to in the outside world, which is a tricky act of translation / communication in the sense of Quine and Grice.
I appreciate this sentiment, and do think there’s a dangerous, bad reduction of values to valence grounded in the operation of the brain that ignores much of what we care about, and that that extra stuff that we care about is also expressed as valence grounded in the operation of the brain. All the concerns you bring up must be computed somewhere, that somewhere is human brains, and if what those brains do is “minimize prediction error” then those concerns are also expressions of prediction error minimization. This to me is what’s exciting about a grounding like the one I’m considering: it’s embedded in the world in a way that means we don’t leave anything out (unless there’s some kind of “spooky” physics happening that we can’t observe, which I consider unlikely) such that we naturally capture all the complexity you’re concerned about, though it may take quite a bit to compute it all.
The difficulty is that we want to take human values and put them into an AI that doesn’t do prediction error minimization in the human sense, but instead does superhumanly competent search and planning. But if you have a specific scheme in mind that could outperform humans without leaving anything out, I’d be super interested.
As of yet, no, although this brings up an interesting point, which is that I’m looking at this stuff to find a precise grounding because I don’t think we can develop a plan that will work to our satisfaction without it. I realize lots of people disagree with me here, thinking that we need the method first and the value grounding will be worked out instrumentally by the method, but I dislike this because it makes it hard to verify the method than by observing what an AI produced by that method does, and this is a dangerous verification method due to the risk of a “treacherous” turn that isn’t so much treacherous as it is the one that could have been predicted if we bothered to have a solid theory of what the method we were using really implied in terms of the thing we cared about, if we had bothered to know what the thing we cared about fundamentally was.
Also I suspect we will be able to think of our desired AI in terms of control systems and set points, because I think we can do this for everything that’s “alive”, although it may not be the most natural abstraction to use for its architecture.