There is a small but maybe significant caveat on using large language models to resolve this problem. It only operates on text, on descriptions of behaviour and goals. If we used this approach to get an AI to learn human values, we would need to ensure that the textual symbols were actually grounded. It does us little good if the AI has a great textual understanding of “ensure human flourishing”, but doesn’t mean the same thing as us by “human” and “flourishing”.
I don’t think there’s actually an asterisk. My naive/uninformed opinion is that the idea that LLMs don’t actually learn a map of the world is very silly.
Language is a model of our underlying reality; “dogs are mammals” occurs more frequently in text than “dogs are reptiles” because dogs are in actuality mammals. That statistical feature of text corresponds to an empirical feature of underlying reality. I tend to think language is actually a pretty rich model of the world humans inhabit and interact with.
I expect symbol grounding to basically be a non problem for sufficiently capable LLMs (I’m not even clear that it’s a significant hurdle for current LLMs).
I think sufficiently powerful LLMs trained on humanity’s text corpus will learn rich and comprehensive models of human values.
And then there’s the problem that we don’t have a definition of “human” and “flourishing” across all future situations and scenarios. We need the AI to extrapolate these concepts similarly to how we would, and not fall into dangerous edge cases.
Insomuch as the concepts we wish to extrapolate are natural abstractions, they should extrapolate well.
Again, I perhaps naively don’t expect this to be a significant hurdle in practice.
I recognise “these problems will be easy” isn’t necessarily very valuable feedback. But I do think the case that we should expect them to actually be hurdles is not obvious and has not been clearly established.
I largely agree, though of course even human language use leaves many subtle nuances of words like “flourishing” underspecified.
If anything, language seems more useful than other modalities for learning about how the real world works. E.g., current video models completely fail to grasp basic physical intuitions that text-davinci-003 nails just fine.
Yeah, but that’s because we don’t have a unified concept of flourishing. There’s probably an intersection of popular flourishing concepts or “the minimal latents of flourishing”, etc. but “flourishing” does mean different things to different people.
That’s why I think paretopia is a better goal than utopia.
(Pareto improvements that everyone can get behind.)
I don’t think there’s actually an asterisk. My naive/uninformed opinion is that the idea that LLMs don’t actually learn a map of the world is very silly.
The algorithm might have a correct map of the world, but if its goals are phrased in terms of words, it will have a pressure to push those words away from their correct meanings. “Ensure human flourishing” is much easier if you can slide those words towards other meanings.
This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don’t seem to have an incentive to push words away from their correct meanings. They are not agents and don’t have goals beyond their simulation objective (insomuch as they are “inner aligned”).
If the system that’s optimising for human goals doesn’t control the system that interprets said goals, I don’t think an issue like this will arise.
If the system that’s optimising is separate from the system that has the linguistic output, then there’s a huge issue with the optimising system manipulating or fooling the linguistic system—another kind of “symbol grounding failure”.
I don’t think there’s actually an asterisk. My naive/uninformed opinion is that the idea that LLMs don’t actually learn a map of the world is very silly.
Language is a model of our underlying reality; “dogs are mammals” occurs more frequently in text than “dogs are reptiles” because dogs are in actuality mammals. That statistical feature of text corresponds to an empirical feature of underlying reality. I tend to think language is actually a pretty rich model of the world humans inhabit and interact with.
I expect symbol grounding to basically be a non problem for sufficiently capable LLMs (I’m not even clear that it’s a significant hurdle for current LLMs).
I think sufficiently powerful LLMs trained on humanity’s text corpus will learn rich and comprehensive models of human values.
Insomuch as the concepts we wish to extrapolate are natural abstractions, they should extrapolate well.
Again, I perhaps naively don’t expect this to be a significant hurdle in practice.
I recognise “these problems will be easy” isn’t necessarily very valuable feedback. But I do think the case that we should expect them to actually be hurdles is not obvious and has not been clearly established.
I largely agree, though of course even human language use leaves many subtle nuances of words like “flourishing” underspecified.
If anything, language seems more useful than other modalities for learning about how the real world works. E.g., current video models completely fail to grasp basic physical intuitions that text-davinci-003 nails just fine.
Yeah, but that’s because we don’t have a unified concept of flourishing. There’s probably an intersection of popular flourishing concepts or “the minimal latents of flourishing”, etc. but “flourishing” does mean different things to different people.
That’s why I think paretopia is a better goal than utopia.
(Pareto improvements that everyone can get behind.)
The algorithm might have a correct map of the world, but if its goals are phrased in terms of words, it will have a pressure to push those words away from their correct meanings. “Ensure human flourishing” is much easier if you can slide those words towards other meanings.
This is only the case if the system that is doing the optimisation is in control of the system that provides the world model/does the interpretation. Language models don’t seem to have an incentive to push words away from their correct meanings. They are not agents and don’t have goals beyond their simulation objective (insomuch as they are “inner aligned”).
If the system that’s optimising for human goals doesn’t control the system that interprets said goals, I don’t think an issue like this will arise.
If the system that’s optimising is separate from the system that has the linguistic output, then there’s a huge issue with the optimising system manipulating or fooling the linguistic system—another kind of “symbol grounding failure”.