This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x]. We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment): If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit. If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid. f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].
This still seems like the same error: what evidence do we have that tells us the “values the innate reward system put in place”? We have behaviour.
We don’t know that [system aimed for x and got x].
We know only [there’s a system that tends to produce x].
We don’t know the “values of the innate reward system”.
The reason I’m (thus far) uninterested in a story about the mechanism, is that there’s nothing interesting to explain. You only get something interesting if you assume your conclusion: if you assume without justification that the reward system was aiming for x and got x, you might find it interesting to consider how that’s achieved—but this doesn’t give you evidence for the assumption you used to motivate your story in the first place.
In particular, I find it implausible that there’s a system that does aim for x and get x (unless the ‘system’ is the entire environment):
If there are environmental regularities that tend to give you elements of x without your needing to encode them explicitly, those regularities will tend to be ‘used’ - since you get them for free. There’s no selection pressure to encode or preserve those elements of x.
If you want to sail quickly, you take advantage of the currents.
So I don’t think there’s any reasonable sense in which there’s a target being hit.
If a magician has me select a card, looks at it, then tells me that’s exactly the card they were aiming for me to pick, I’m not going to spend energy working out how the ‘trick’ worked.
It sounds like we’ve got to my crux for my optimism, in that you think that to have a system that aims for x, it essentially needs to be an entire environment, and the environment largely dictates human values, whereas I think human values are less dependent on the environment, and far more dependent on their genome + learning process. Equivalently speaking, I place a lot more emphasis on the internal stuff of humans as the main contributor to values, while you emphasize the external environment a lot more than the internals like the genome or learning process.
This could be disentangled into 2 cruxes:
Where are human values generated.
How cheap is it to specify values, or alternatively how weak do our priors need to be to encode values (if you are encoding values internally.)
And I’d expect the answers from me to be mostly internal, like the genome + learning process with a little help from the environment on the first question and relatively cheap to specify values on the second question, whereas you’d probably think the answers to these questions are basically the environment sets the values , with little or no help from the internals of humans on the first question and very expensive to specify values for the second question.
For some of my reasoning on this, I’d probably read some posts like these:
https://www.lesswrong.com/posts/HEonwwQLhMB9fqABh/human-preferences-as-rl-critic-values-implications-for
(Basically argues that the critic in the brain generates the values)
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
(The genomic prior can’t be strong, because it has massive limitations in what it can encode).
The central crux really isn’t where values are generated. That’s a more or less trivial aside. (though my claim was simply that it’s implausible the values aimed for would be entirely determined by genome + learning process; that’s a very weak claim; 98% determined is [not entirely determined])
The crux is the tautology issue: I’m saying there’s nothing to explain, since the source of information we have on [what values are being “aimed for”] is human behaviour, and the source of information we have on what values are achieved, is human behaviour.
These things must agree with one-another: the learning process that produced human values produces human values. From an alignment difficulty perspective, that’s enough to conclude that there’s nothing to learn here.
An argument of the form [f(x) == f(x), therefore y] is invalid.
f(x) might be interesting for other reasons, but that does nothing to rescue the argument.
That’s our disagreement, in that we have more information than that. I agree human behavior plays a role in my evidence base, but there’s more evidence I have than that.
In particular I am using results from both ML/AI and human brain studies to inform my conclusion.
Basically, my claim is that [f(x) == f(y), therefore z].