Wouldn’t cognitive science or neuroscience be sufficient to falsify such a theory? All we really have to do is show that “good life”, as seen from the inside, does not correspond to maximized happy-juice or dopamine-reward.
You’re going to have to explain what meta-ethical view you hold such that “prefer on reflection given full knowledge and rationality” and “should prefer” are different.
No, we don’t. That’s making recommendations as to how they can attain their preferences. That you don’t seem to understand this distinction is concerning. Instrumental and terminal values are different.
My position is in line with that—people are wrong about what their terminal values are, and they should realize that their actual terminal value is pleasure.
Fundamentally, because pleasure feels good and preferable, and it doesn’t need anything additional (such as conditioning through social norms) to make it desirable.
It’s not a matter of what you should desire, it’s a matter of what you’d desire if you were internally consistent. Theoretically, you could have values that weren’t pleasure, such as if you couldn’t experience pleasure.
Also, the naturalistic fallacy isn’t a fallacy, because “is” and “ought” are bound together.
(Note: Being continuously downvoted is making me reluctant to continue this discussion.)
One reason to be internally consistent is that it prevents you from being Dutch booked. Another reason is that it enables you to coherently be able to get the most of what you want, without your preferences contradicting each other.
Why should the way things are be the way things are?
As far as preferences and motivation are concerned, however things should be must appeal to them as they are, or at least as they would be if they were internally consistent.
Retracted: Dutch booking has nothing to do with preferences; it refers entirely to doxastic probabilities.
As far as preferences and motivation are concerned, however things should be must appeal to them as they are, or at least as they would be if they were internally consistent.
I very much disagree. I think you’re couching this deontological moral stance as something more than the subjective position that it is. I find your morals abhorrent, and your normative statements regarding others’ preferences to be alarming and dangerous.
Dutch booking has nothing to do with preferences; it refers entirely to doxastic probabilities.
You can be Dutch booked with preferences too. If you prefer A to B, B to C, and C to A, I can make money off of you by offering a circular trade to you.
And if I’m unaware that such a strategy is taking place. Even if I was aware, I am a dynamic system evolving in time, and I might be perfectly happy with the expenditure per utility shift.
Unless I was opposed to that sort of arrangement, I find nothing wrong with that. It is my prerogative to spend resources to satisfy my preferences.
I might be perfectly happy with the expenditure per utility shift.
That’s exactly the problem—you’d be happy with the expenditure per shift, but every time a fill cycle would be made, you’d be worse off. If you start out with A and $10, pay me a dollar to switch to B, another dollar to switch to C, and a third dollar to switch to A, you’d end up with A and $7, worse off than you started, despite being satisfied with each transaction. That’s the cost of inconsistency.
But presumably you don’t get utility from switching as such, you get utility from having A, B, or C, so if you complete a cycle for free (without me charging you), you have exactly the same utility as when you started, and if I charge you, then when you’re back to A, you have lower utility.
If I have utility in the state of the world, as opposed to the transitions between A, B, and C, I don’t see how it’s possible for me to have cyclic preferences, unless you’re claiming that my utility doesn’t have ordinality for some reason. If that’s the sort of inconsistency in preferences you’re referring to, then yes, it’s bad, but I don’t see how ordinal utility necessitates wireheading.
Regarding inconsistent preferences, yes, that is what I’m referring to.
Ordinal utility doesn’t by itself necessitate wireheading, such as if you are incapable of experiencing pleasure, but if you can experience it, then you should wirehead, because pleasure has the quale of desirability (pleasure feels desirable).
Terminal values are what are sought for their own sake, as opposed to instrumental values, which are sought because they ultimately produce terminal values.
I know what terminal values are and I apologize if the intent behind my question was unclear. To clarify, my request was specifically for a definition in the context of human beings—that is, entities with cognitive architectures with no explicitly defined utility functions and with multiple interacting subsystems which may value different things (ie. emotional vs deliberative systems). I’m well aware of the huge impact my emotional subsystem has on my decision making. However, I don’t consider it ‘me’ - rather, I consider it an external black box which interacts very closely with that which I do identify as me (mostly my deliberative system). I can acknowledge the strong influence it has on my motivations whilst explicitly holding a desire that this not be so, a desire which would in certain contexts lead me to knowingly make decisions that would irreversibly sacrifice a significant portion of my expected future pleasure.
To follow up on my initial question, it had been intended to lay the groundwork for this followup: What empirical claims do you consider yourself to be making about the jumble of interacting systems that is the human cognitive architecture when you say that the sole ‘actual’ terminal value of a human is pleasure?
What empirical claims do you consider yourself to be making about the jumble of interacting systems that is the human cognitive architecture when you say that the sole ‘actual’ terminal value of a human is pleasure?
That upon ideal rational deliberation and when having all the relevant information, a person will choose to pursue pleasure as a terminal value.
I’ve got to give it to you, the human value is not complex point is frankly aging very well with the rise of LLMs, and one of the miracles is that you can get a reasonably good human value function without very large hacks or complicated code, it’s just learned from the data.
To put it another way, I think that this contrarian opinion has been receiving Bayes points compared to a lot of other theories on how complicated are human values.
Contra OthelloGPT, in the case of GPT-4, and the new o1-preview model, I believe that the neural networks we are focusing on are deep and sort of wide, which I suspect prevents a lot of the “just find heuristic” behavior, and I believe Chain-Of-Thought scaling is there to make the model be biased more towards algorithmic solutions and sequential computation over heuristics and parallel computation.
Well, it would certainly be nice if that were true, but all the interpretability research thus far has pointed out the opposite of what you seem to be taking it to. The only cases where the neural nets turn out to learn a crisp, clear, extrapolable-out-many-orders-of-magnitude-correctly algorithm, verified by interpretability or formal methods to date, are not deep nets. They are tiny, tiny nets either constructed by hand or trained by grokking (which appears to not describe at all any GPT-4 model, and it’s not looking good for their successors either). The bigger deeper nets certainly get much more powerful and more intelligent, but they appear to be doing so by, well, slapping on ever more bags of heuristics at scale. Which is all well and good if you simply want raw intelligence and capability, but not good if anything morally important hinges on them reasoning correctly for the right reasons, rather than heuristics which can be broken when extrapolated far enough or manipulated by adversarial processes.
We actually have a resolution for the thread on whether LLMs naturally learn algorithmic reasoning as they scale up with COT vs just reasoning with memorized bags of heuristics, and the answer is that we have both real reasoning, which is indicative of LLMs actually using somewhat clean algorithms, but there are also a lot of heuristic reasoning involved.
So we both got some things wrong, but also got some things right.
The main thing I got wrong was in underestimating how much COT for current models still involves pretty significant memorization/bag of heuristics to get correct answers, which means I have to raise the complexity of human values, given that LLMs didn’t compress as well as I thought, and the thing I got right was that sequential computation like COT does incentive actual noisy reasoning/algorithms to appear, but I was wrong about the strength of the effect, though I was still right to be concerned about the fact that the OthelloGPT network was very wide and skinny, rather than deep and wide, which makes it harder to learn the correct algorithm.
A key crux is that I think those heuristics actually go quite far, because it’s much, much easier to learn a quite close to correct model of human values with simple heuristics and internalize the values from it’s training data as it’s own than it is to learn useful capabilities, and more generally it’s easier to learn and internalize human values as it’s own than it is to learn useful new capabilities, so even under a heuristic view of LLMs where LLMs are basically always learning a bag of heuristics and don’t have actual algorithms, the heuristics for internalizing human values is always simpler than heuristics for learning capabilities, because it’s easier to generate training data for human values than it is to generate any other capability.
See below for relevant points:
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
Good point though that the claim that current LLMs are definitely learning algorithms rather than just heuristics was definitely not supported very well by the current interpretability results/evidence, though I’d argue that o1-preview is mild evidence we will start seeing more algorithmic/search parts will be used for AIs in the future (though to be clear I believe a majority of the success comes from it’s data being higher quality, and only fairly little to it’s runtime search.)
[Please read the OP before voting. Special voting rules apply.]
Human value is not complex, wireheading is the optimal state, and Fun Theory is mostly wrong.
What would you have to see to convince you otherwise?
I think it would take an a priori philosophical argument, rather than empirical evidence.
Wouldn’t cognitive science or neuroscience be sufficient to falsify such a theory? All we really have to do is show that “good life”, as seen from the inside, does not correspond to maximized happy-juice or dopamine-reward.
The most that would show is what humans tend to prefer, not what they should prefer.
You’re going to have to explain what meta-ethical view you hold such that “prefer on reflection given full knowledge and rationality” and “should prefer” are different.
I don’t think neuroscience would tell you what you’d prefer on reflection given full knowledge and rationality.
Sufficiently advanced cognitive science definitely will, though.
I’m skeptical of that.
I can think of something I prefer, on reflection, against wireheading. Now what?
There’s a lot of things that people are capable of preferring that’s not pleasure, the question is whether it’s what they should prefer.
Awfully presumptuous of you to tell people what they should prefer.
Why? We do this all the time, when we advise people to do something different from what they’re currently doing.
No, we don’t. That’s making recommendations as to how they can attain their preferences. That you don’t seem to understand this distinction is concerning. Instrumental and terminal values are different.
My position is in line with that—people are wrong about what their terminal values are, and they should realize that their actual terminal value is pleasure.
Why is my terminal value pleasure? Why should I want it to be?
Fundamentally, because pleasure feels good and preferable, and it doesn’t need anything additional (such as conditioning through social norms) to make it desirable.
Why should I desire what you describe? What’s wrong with values more complex than a single transistor?
Also, naturalistic fallacy.
It’s not a matter of what you should desire, it’s a matter of what you’d desire if you were internally consistent. Theoretically, you could have values that weren’t pleasure, such as if you couldn’t experience pleasure.
Also, the naturalistic fallacy isn’t a fallacy, because “is” and “ought” are bound together.
Why is the internal consistency of my preferences desirable, particularly if it would lead me to prefer something I am rather emphatically against?
Why should the way things are be the way things are?
(Note: Being continuously downvoted is making me reluctant to continue this discussion.)
One reason to be internally consistent is that it prevents you from being Dutch booked. Another reason is that it enables you to coherently be able to get the most of what you want, without your preferences contradicting each other.
As far as preferences and motivation are concerned, however things should be must appeal to them as they are, or at least as they would be if they were internally consistent.
Retracted: Dutch booking has nothing to do with preferences; it refers entirely to doxastic probabilities.
I very much disagree. I think you’re couching this deontological moral stance as something more than the subjective position that it is. I find your morals abhorrent, and your normative statements regarding others’ preferences to be alarming and dangerous.
You can be Dutch booked with preferences too. If you prefer A to B, B to C, and C to A, I can make money off of you by offering a circular trade to you.
And if I’m unaware that such a strategy is taking place. Even if I was aware, I am a dynamic system evolving in time, and I might be perfectly happy with the expenditure per utility shift.
Unless I was opposed to that sort of arrangement, I find nothing wrong with that. It is my prerogative to spend resources to satisfy my preferences.
That’s exactly the problem—you’d be happy with the expenditure per shift, but every time a fill cycle would be made, you’d be worse off. If you start out with A and $10, pay me a dollar to switch to B, another dollar to switch to C, and a third dollar to switch to A, you’d end up with A and $7, worse off than you started, despite being satisfied with each transaction. That’s the cost of inconsistency.
And 3 utilons. I see no cost there.
But presumably you don’t get utility from switching as such, you get utility from having A, B, or C, so if you complete a cycle for free (without me charging you), you have exactly the same utility as when you started, and if I charge you, then when you’re back to A, you have lower utility.
If I have utility in the state of the world, as opposed to the transitions between A, B, and C, I don’t see how it’s possible for me to have cyclic preferences, unless you’re claiming that my utility doesn’t have ordinality for some reason. If that’s the sort of inconsistency in preferences you’re referring to, then yes, it’s bad, but I don’t see how ordinal utility necessitates wireheading.
Regarding inconsistent preferences, yes, that is what I’m referring to.
Ordinal utility doesn’t by itself necessitate wireheading, such as if you are incapable of experiencing pleasure, but if you can experience it, then you should wirehead, because pleasure has the quale of desirability (pleasure feels desirable).
And you think that “desirability” in that statement refers to the utility-maximizing path?
I mean that pleasure, by its nature, feels utility-satisfying. I don’t know what you mean by “path” in “utility-maximizing path”.
Can you define ‘terminal values’, in the context of human beings?
Terminal values are what are sought for their own sake, as opposed to instrumental values, which are sought because they ultimately produce terminal values.
I know what terminal values are and I apologize if the intent behind my question was unclear. To clarify, my request was specifically for a definition in the context of human beings—that is, entities with cognitive architectures with no explicitly defined utility functions and with multiple interacting subsystems which may value different things (ie. emotional vs deliberative systems). I’m well aware of the huge impact my emotional subsystem has on my decision making. However, I don’t consider it ‘me’ - rather, I consider it an external black box which interacts very closely with that which I do identify as me (mostly my deliberative system). I can acknowledge the strong influence it has on my motivations whilst explicitly holding a desire that this not be so, a desire which would in certain contexts lead me to knowingly make decisions that would irreversibly sacrifice a significant portion of my expected future pleasure.
To follow up on my initial question, it had been intended to lay the groundwork for this followup: What empirical claims do you consider yourself to be making about the jumble of interacting systems that is the human cognitive architecture when you say that the sole ‘actual’ terminal value of a human is pleasure?
That upon ideal rational deliberation and when having all the relevant information, a person will choose to pursue pleasure as a terminal value.
I’ve got to give it to you, the human value is not complex point is frankly aging very well with the rise of LLMs, and one of the miracles is that you can get a reasonably good human value function without very large hacks or complicated code, it’s just learned from the data.
To put it another way, I think that this contrarian opinion has been receiving Bayes points compared to a lot of other theories on how complicated are human values.
You just pointed out that what a LLM learned for even a very simple game with extensive clean data turned out to be “a bag of heuristics”: https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1?commentId=ykmKgL8GofebKfkCv
Alright, I have a few responses to this:
Contra OthelloGPT, in the case of GPT-4, and the new o1-preview model, I believe that the neural networks we are focusing on are deep and sort of wide, which I suspect prevents a lot of the “just find heuristic” behavior, and I believe Chain-Of-Thought scaling is there to make the model be biased more towards algorithmic solutions and sequential computation over heuristics and parallel computation.
You even mentioned that possibility here too:
https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1#5K5wDoMD2YtSJvfw9
Well, it would certainly be nice if that were true, but all the interpretability research thus far has pointed out the opposite of what you seem to be taking it to. The only cases where the neural nets turn out to learn a crisp, clear, extrapolable-out-many-orders-of-magnitude-correctly algorithm, verified by interpretability or formal methods to date, are not deep nets. They are tiny, tiny nets either constructed by hand or trained by grokking (which appears to not describe at all any GPT-4 model, and it’s not looking good for their successors either). The bigger deeper nets certainly get much more powerful and more intelligent, but they appear to be doing so by, well, slapping on ever more bags of heuristics at scale. Which is all well and good if you simply want raw intelligence and capability, but not good if anything morally important hinges on them reasoning correctly for the right reasons, rather than heuristics which can be broken when extrapolated far enough or manipulated by adversarial processes.
We actually have a resolution for the thread on whether LLMs naturally learn algorithmic reasoning as they scale up with COT vs just reasoning with memorized bags of heuristics, and the answer is that we have both real reasoning, which is indicative of LLMs actually using somewhat clean algorithms, but there are also a lot of heuristic reasoning involved.
So we both got some things wrong, but also got some things right.
The main thing I got wrong was in underestimating how much COT for current models still involves pretty significant memorization/bag of heuristics to get correct answers, which means I have to raise the complexity of human values, given that LLMs didn’t compress as well as I thought, and the thing I got right was that sequential computation like COT does incentive actual noisy reasoning/algorithms to appear, but I was wrong about the strength of the effect, though I was still right to be concerned about the fact that the OthelloGPT network was very wide and skinny, rather than deep and wide, which makes it harder to learn the correct algorithm.
The thread is below:
https://x.com/aksh_555/status/1843326181950828753
I wish someone is willing to do this for the o1 series of models as well.
One other relevant comment is here:
https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1#HqDWs9NHmYivyeBGk
A key crux is that I think those heuristics actually go quite far, because it’s much, much easier to learn a quite close to correct model of human values with simple heuristics and internalize the values from it’s training data as it’s own than it is to learn useful capabilities, and more generally it’s easier to learn and internalize human values as it’s own than it is to learn useful new capabilities, so even under a heuristic view of LLMs where LLMs are basically always learning a bag of heuristics and don’t have actual algorithms, the heuristics for internalizing human values is always simpler than heuristics for learning capabilities, because it’s easier to generate training data for human values than it is to generate any other capability.
See below for relevant points:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
Good point though that the claim that current LLMs are definitely learning algorithms rather than just heuristics was definitely not supported very well by the current interpretability results/evidence, though I’d argue that o1-preview is mild evidence we will start seeing more algorithmic/search parts will be used for AIs in the future (though to be clear I believe a majority of the success comes from it’s data being higher quality, and only fairly little to it’s runtime search.)