While I agree with the argument that the Superintelligence book paradigm has evidence against it, I see different problems than you do here.
For example on this point:
Notice that Bostrom imagines having to specify a formal criterion for what counts as a solution to one’s programming problem. This seems like a clear relic of a culture in which explicit utility maximizers were by far the most conceivable form advanced AI systems could take, but that’s largely changed by now. You can actually just use natural language to outline the code you want a Claude-like AI system to write, and it will do so, with a remarkable intuitive knack for “doing what you mean” (in contrast to the extreme rigidity of computer programs from decades past).
I think this is less because utility maximization is wrong, or that they don’t have utility functions, and more so due to LWers way overestimating the difficulty of Do-What-I-Mean alignment, as well as radically underestimating alignment generalization, and indeed I’d claim that alignment generalizes much more than capabilities, because values are easier to learn than capabilities, and verification being easier than generation.
I’d also say that one of the important implications is that our values are massively simpler and less fragile than LW or evopsych literature, a point which will be returned to later.
So edge instantiation problems were already solved circa 2022-2023, and no one noticed that.
A similar thing applies here:
Bostrom’s second threat model is that the AI would paperclip the world in the process of trying to produce a solution, for example to get rid of sources of interference, rather than as a side effect of whatever its solution actually is. This would also not come about as a result of a language model relentlessly striving to fulfill a user’s simple, natural-language request for a certain kind of computer program, contra Bostrom’s vision. If deep learning systems could be made to ruthlessly paperclip the universe, it would be as a result of some other failure-mode, like wireheading, mesa-optimization, or extremely malicious prompt engineering. It wouldn’t follow from the user mis-specifying an explicit utility function in the course of trying to use a language model normally.
To point out this:
I’m especially interested in safety research that treats language models as at least somewhat neuromorphic or brain-like. Here’s a whole essay I wrote on that topic; it explores analogies like predictive learning in ML and predictive processing in humans, RL in both types of systems, and consciuosness as a type of context window.
If deep learning systems are sufficiently neuromorphic, we might be able to import some insights from human alignment into AI safety research. For example, why don’t most humans like wireheading? If we figured that out, it might help us ensure deep learning systems strive to avoid doing so themselves.
I actually agree with this, and I’d go further that there are significant similarities, enough so that the insights that apply to AI can also be applied to the human brain and it’s value generators, and vice versa.
One of those insights is that the bitter lesson applies to human values/morals too, and thus the question of how your morals/values transfer to new contexts/OOD generalization is very heavily determined by the data, not the algorithm/prior, because it’s a lot simpler and less fragile than LWers and evolutionary psychology literature thought, and more importantly can be unified into the algorithms that power capabilities for AI alignment, so the simplicity and effectiveness of capabilities is also shared by alignment generalization.
This also suggests an immediate alignment strategy: Place in large synthetic datasets about human values so that it can learn about human values before it’s capable of deceptive alignment/very misaligned behavior, or use the large synthetic datasets as a way to extract a robust, much less hackable reinforcement learning value function.
I have a thread here on some of the similarities, and @Bogdan Ionut Cirstea also pointed out a very important paper which if true, is significant evidence for @johnswentworth’s Natural Abstraction hypothesis.
To be clear, I already think something like the Natural Abstraction hypothesis has quite a bit of evidence, because of the papers on the similarities between the human brain and DL:
At a broad level, I think the failure here is less a failure of utility maximization being true, and more a failure of LWers to believe the bitter lesson applying to human values and morals too, combined with radically overestimating the soundness of evolutionary psychology literature for predicting what human values are compared to a more bitter lesson type view combined with the brain being a Universal Learning Machine:
Basically, I mean that the morals/values of what a human has is by an large a function of their data sources in the world, and their compute resources and not their algorithms/priors/genetics, because they are very simple generators of value.
Another way to say it is that if you know their priors/genes, you know very little about their values, but knowing what data they received would give you far more information about what they value.
It seems to me all the more basic desires (“values”), e.g. the lower layers of Maslow’s hierarchy of needs, are mainly determined by heritable factors. Because they are relatively stable across cultures. So presumably you talk about “higher” values being a function of “data sources in the world”? I.e. of nurture rather than nature?
Another point I’d like to raise is that values (in the sense of desires/goals) are arguably quite different from morals. First, morals are more general than desires. Extraterrestrials could also come up with a familiar theory of, say, preference utilitarianism, while not sharing several of our desires, e.g. for eating chocolate or for having social contacts. Indeed, theories of ethics like utilitarianism or Kantian deontology “abstract away” from specific desires by coming up with more general principles which are independent of concrete things individuals may want. Second, it is clearly possible and consistent for someone (e.g. a psychopath) to want X without believing that X is morally right. Conversely, philosophers arguing for some theory of ethics don’t necessarily adhere perfectly to the principles of this system, in the same way in which a philosopher arguing for a theory of rationality isn’t necessarily perfectly rational himself.
It seems to me all the more basic desires (“values”), e.g. the lower layers of Maslow’s hierarchy of needs, are mainly determined by heritable factors. Because they are relatively stable across cultures. So presumably you talk about “higher” values being a function of “data sources in the world”? I.e. of nurture rather than nature?
I agree there probably are some heritable values, though my big difference here is that I think that the set of primitive values is quite a bit smaller than you might think.
Though be warned, heritability doesn’t actually answer our question, because the way it’s interpreted by laymen is pretty wrong:
I probably should have separated formal ethical theories that people are describing, which you call morals and what their values actually are more.
I was always referring to values when I was talking about morals.
You are correct that someone describing a moral theory doesn’t mean that they actually agree with or implement the theory.
I still think that if you had the amount of control over a human that an ML person had over an AI today, you could brainwash them to value ~arbitrary values with a lot of control, and it would be the central technology of political and social situations, which is a lot.
While I agree with the argument that the Superintelligence book paradigm has evidence against it, I see different problems than you do here.
For example on this point:
I think this is less because utility maximization is wrong, or that they don’t have utility functions, and more so due to LWers way overestimating the difficulty of Do-What-I-Mean alignment, as well as radically underestimating alignment generalization, and indeed I’d claim that alignment generalizes much more than capabilities, because values are easier to learn than capabilities, and verification being easier than generation.
I’d also say that one of the important implications is that our values are massively simpler and less fragile than LW or evopsych literature, a point which will be returned to later.
So edge instantiation problems were already solved circa 2022-2023, and no one noticed that.
A similar thing applies here:
To point out this:
I actually agree with this, and I’d go further that there are significant similarities, enough so that the insights that apply to AI can also be applied to the human brain and it’s value generators, and vice versa.
One of those insights is that the bitter lesson applies to human values/morals too, and thus the question of how your morals/values transfer to new contexts/OOD generalization is very heavily determined by the data, not the algorithm/prior, because it’s a lot simpler and less fragile than LWers and evolutionary psychology literature thought, and more importantly can be unified into the algorithms that power capabilities for AI alignment, so the simplicity and effectiveness of capabilities is also shared by alignment generalization.
This also suggests an immediate alignment strategy: Place in large synthetic datasets about human values so that it can learn about human values before it’s capable of deceptive alignment/very misaligned behavior, or use the large synthetic datasets as a way to extract a robust, much less hackable reinforcement learning value function.
I have a thread here on some of the similarities, and @Bogdan Ionut Cirstea also pointed out a very important paper which if true, is significant evidence for @johnswentworth’s Natural Abstraction hypothesis.
To be clear, I already think something like the Natural Abstraction hypothesis has quite a bit of evidence, because of the papers on the similarities between the human brain and DL:
https://x.com/SharmakeFarah14/status/1837528997556568523
https://x.com/BogdanIonutCir2/status/1837653632138772760
https://phillipi.github.io/prh/
At a broad level, I think the failure here is less a failure of utility maximization being true, and more a failure of LWers to believe the bitter lesson applying to human values and morals too, combined with radically overestimating the soundness of evolutionary psychology literature for predicting what human values are compared to a more bitter lesson type view combined with the brain being a Universal Learning Machine:
https://www.lesswrong.com/posts/9Yc7Pp7szcjPgPsjf/the-brain-as-a-universal-learning-machine
Could you say more on what you mean with this?
Basically, I mean that the morals/values of what a human has is by an large a function of their data sources in the world, and their compute resources and not their algorithms/priors/genetics, because they are very simple generators of value.
Another way to say it is that if you know their priors/genes, you know very little about their values, but knowing what data they received would give you far more information about what they value.
It seems to me all the more basic desires (“values”), e.g. the lower layers of Maslow’s hierarchy of needs, are mainly determined by heritable factors. Because they are relatively stable across cultures. So presumably you talk about “higher” values being a function of “data sources in the world”? I.e. of nurture rather than nature?
Another point I’d like to raise is that values (in the sense of desires/goals) are arguably quite different from morals. First, morals are more general than desires. Extraterrestrials could also come up with a familiar theory of, say, preference utilitarianism, while not sharing several of our desires, e.g. for eating chocolate or for having social contacts. Indeed, theories of ethics like utilitarianism or Kantian deontology “abstract away” from specific desires by coming up with more general principles which are independent of concrete things individuals may want. Second, it is clearly possible and consistent for someone (e.g. a psychopath) to want X without believing that X is morally right. Conversely, philosophers arguing for some theory of ethics don’t necessarily adhere perfectly to the principles of this system, in the same way in which a philosopher arguing for a theory of rationality isn’t necessarily perfectly rational himself.
I agree there probably are some heritable values, though my big difference here is that I think that the set of primitive values is quite a bit smaller than you might think.
Though be warned, heritability doesn’t actually answer our question, because the way it’s interpreted by laymen is pretty wrong:
https://www.lesswrong.com/posts/YpsGjsfT93aCkRHPh/what-does-knowing-the-heritability-of-a-trait-tell-me-in
I probably should have separated formal ethical theories that people are describing, which you call morals and what their values actually are more.
I was always referring to values when I was talking about morals.
You are correct that someone describing a moral theory doesn’t mean that they actually agree with or implement the theory.
I still think that if you had the amount of control over a human that an ML person had over an AI today, you could brainwash them to value ~arbitrary values with a lot of control, and it would be the central technology of political and social situations, which is a lot.