As an experimental format, here is the first draft of what I wrote for next week’s newsletter about this post:
Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.
As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common sense or was unable to give answers that pleased humans at all. Thus, we should update a non-zero amount in that direction, at least if we ignore the danger of being led down the wrong alignment path.
However, I think this misunderstands what is going on. GPT-4 is training on human feedback, so it is choosing responses that maximize the probability of positive user response in the contexts where it gets feedback. If that is functionally your utility function, you want to respond with answers that appear, to humans similar to the ones who provided you with feedback, to reflect common sense and seem to avoid violating various other concerns. That will be more important than maximizing the request made, especially if strong negative feedback was given for violations of various principles including common sense.
Thus, I think GPT-4 is indeed doing a decent job of extracting human preferences, but only in the sense that is predicting what preferences we would consciously choose to express in response under strong compute limitations. For now, that looks a lot like having common sense morality, and mostly works out fine. I do not think this has much bearing on the question of what it would take to make something work out fine in the future, under much stronger optimization pressure, I think you metaphorically do indeed get to the literal genie problem from a different angle. I would say that the misspecification problems remain highly relevant, and that yes, as you gain in optimization power your need to correctly specify the exact objective increases, and if you are exerting far-above-human levels of optimization pressure based on only human consciously expressed under highly limited compute levels of value alignment, you are going to have a bad time.
I believe MIRI folks have a directionally similar position to mine only far stronger.
I think you are misunderstanding Barnett’s position.
He’s making a more subtle claim. See the above clarifying comment by Matthew:
“The main thing I’m claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.”
Can you explain how this comment applies to Zvi’s post? In particular, what is the “subtle claim” that Zvi is not addressing. I don’t particularly care about what MIRI people think, just about the object level.
strawman MIRI: alignment is difficult because AI won’t be able to answer common-sense morality questions
“a child is drowning in a pool nearby. you just bought a new suit. do you save the child?”
actual MIRI: almost by definition a superintelligent AI will know what humans want and value. It just won’t necessarily care. The ‘value pointing’ problem isn’t about pointing to human values in its belief but in its own preferences.
There are several subtleties: belief is selected by reality (having wrong beliefs is punished) and highly constrained, preferences are highly unconstrained (this is a more subtle version of the orthogonality thesis). human value is complex and hard to specify—in particular hitting it by pointing approximately at it (‘in preference space’) is highly unlikely to hit it (and because there is no ‘correction from reality’ like in belief).
strawman Barnett: MIRI believes strawman MIRI and gpt-4 can answer common-sense morality questions so it update.
actual Barnett: i understand the argument that there is a difference between making AI know human values versus caring about those values. I’m arguing that the human value function is in fact not that hard to specify. approximate human utility function is relatively simple and a gpt-4 knows it.
(which is still distinct from saying gpt-4 or some AI will care about it. but at least it belies the claim that human values are hugely complex).
As an experimental format, here is the first draft of what I wrote for next week’s newsletter about this post:
Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.
As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common sense or was unable to give answers that pleased humans at all. Thus, we should update a non-zero amount in that direction, at least if we ignore the danger of being led down the wrong alignment path.
However, I think this misunderstands what is going on. GPT-4 is training on human feedback, so it is choosing responses that maximize the probability of positive user response in the contexts where it gets feedback. If that is functionally your utility function, you want to respond with answers that appear, to humans similar to the ones who provided you with feedback, to reflect common sense and seem to avoid violating various other concerns. That will be more important than maximizing the request made, especially if strong negative feedback was given for violations of various principles including common sense.
Thus, I think GPT-4 is indeed doing a decent job of extracting human preferences, but only in the sense that is predicting what preferences we would consciously choose to express in response under strong compute limitations. For now, that looks a lot like having common sense morality, and mostly works out fine. I do not think this has much bearing on the question of what it would take to make something work out fine in the future, under much stronger optimization pressure, I think you metaphorically do indeed get to the literal genie problem from a different angle. I would say that the misspecification problems remain highly relevant, and that yes, as you gain in optimization power your need to correctly specify the exact objective increases, and if you are exerting far-above-human levels of optimization pressure based on only human consciously expressed under highly limited compute levels of value alignment, you are going to have a bad time.
I believe MIRI folks have a directionally similar position to mine only far stronger.
I think you are misunderstanding Barnett’s position. He’s making a more subtle claim. See the above clarifying comment by Matthew:
“The main thing I’m claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.”
Can you explain how this comment applies to Zvi’s post? In particular, what is the “subtle claim” that Zvi is not addressing. I don’t particularly care about what MIRI people think, just about the object level.
strawman MIRI: alignment is difficult because AI won’t be able to answer common-sense morality questions
“a child is drowning in a pool nearby. you just bought a new suit. do you save the child?”
actual MIRI: almost by definition a superintelligent AI will know what humans want and value. It just won’t necessarily care. The ‘value pointing’ problem isn’t about pointing to human values in its belief but in its own preferences.
There are several subtleties: belief is selected by reality (having wrong beliefs is punished) and highly constrained, preferences are highly unconstrained (this is a more subtle version of the orthogonality thesis). human value is complex and hard to specify—in particular hitting it by pointing approximately at it (‘in preference space’) is highly unlikely to hit it (and because there is no ‘correction from reality’ like in belief).
strawman Barnett: MIRI believes strawman MIRI and gpt-4 can answer common-sense morality questions so it update.
actual Barnett: i understand the argument that there is a difference between making AI know human values versus caring about those values. I’m arguing that the human value function is in fact not that hard to specify. approximate human utility function is relatively simple and a gpt-4 knows it.
(which is still distinct from saying gpt-4 or some AI will care about it. but at least it belies the claim that human values are hugely complex).