I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
It’s not surprisingly good for human behaviour to usually follow the values we infer from human behaviour.
My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x]. We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment): If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit. If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid. f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].
I did try to provide a casual story for why humans could be aligned to some value without relying on societal incentives that much, so you can check out the second part of my comment.
My non-tautological claim is that the reason isn’t behavioral, but instead internal, and in particular the innate reward system plays a big role here.
In essence, my story on how humans are aligned with the values of the innate reward system wasn’t relying on a behavioral property.
I’ll reproduce it, so that you can focus on the fact that it didn’t rely on behavioral analysis:
There is a weak prior in the genome for stuff like not taking power to kill people in your ingroup, and the prior is weak enough such that we can make it as a wildcard symbol such that aligning it to some other value more or less works.
The brain’s innate reward system uses DPO, RLHF or whatever else is used to create a preference model to guide the intelligence into being aligned to whatever values the innate reward system wants like say empathy for the ingroup, albeit this is only a motivating example.
It uses backprop or a weaker variant of it, and at a high level probably uses an optimizer that is probably at best comparable to Gradient descent, and since it has white-box access and can update the brain in a sort of targeted way, it can efficiently compute the optimal direction to improve it’s performance on say having empathy for the ingroup, but again this is a wildcard symbol in that it could stand in for almost any values.
The loop of weak prior + innate reward system + algorithm to implement it like backprop or it’s weaker variants means that eventually, the human by 25 years old is very aligned with the values that the innate reward system put in place like empathy for the ingroup, albeit again this is only an example of an alignment target, you could put almost arbitrary alignment targets in there.
Critically, it makes very little reference to society or behavioral analysis, so I wasn’t making the mistake you said I made.
It is also no longer a tautology, as it depends on the innate reward system actually rewarding desired behavior by changing the brain’s weights, and removing the innate reward system or showing that the weak prior + value learning strategy was ineffective would break my thesis.
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x].
We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment):
If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit.
If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for
(Basically argues that the critic in the brain generates the values)
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
(The genomic prior can’t be strong, because it has massive limitations in what it can encode).
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid.
f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].