I think there are two concerns being conflated here: “ontology mismatch” and “corrigibility”.
You can think of this as very positive news re: ontology mismatch. We have evidence of a non-goal-directed agent which seems like it would perform surprisingly better than we thought in answering the question “is world X a world Steve would like more than world Y?” So if we give this reward to the AGI and YOLO, the odds of value-preservation/friendliness at near-human levels increase.
On the other hand, this is fairly bad news re: takeoff speeds (since lots of capabilities we might associate with higher levels of cognitive functioning are available at modest compute costs), and consequently re: corrigibility (because we don’t know how to do that).
If I had to summarize my update, it’s directionally towards “shorter timelines” and towards “prosaic alignment of near-human models might be heuristically achievable” and also towards “we won’t have a ton of time between TAI and ASI, and our best bet will be prosaic alignment + hail marying to use those TAIs to solve corrigibility”.
Yeah, it’s bad news in terms of timelines, but good news in terms of an AI being able to implicitly figure out what we want it to do. Obviously, it doesn’t address issues like treacherous turns or acting according to what humans think is good as opposed to what is actually good; and I’m not claiming that this is necessarily net-positive, but there’s a silver lining here.
OK sure. But treacherous turns and acting according to what humans think is good (as opposed to what is actually good) are, like, the two big classic alignment problems. Not being capable enough to figure out what we want is… not even an alignment problem in my book, but I can understand why people would call it one.
I think the distinction here is that obviously any ASI could figure out what humans want, but it’s generally been assumed that that would only happen after its initial goal (Eg paperclips) was already baked. If we can define the goal better before creating the EUM, we’re in slightly better shape.
Treacherous turns are obviously still a problem, but they only happen towards a certain end, right? And a world where an AI does what humans at one point thought was good, as opposed to what was actually good, does seem slightly more promising than a world completely independent from what humans think is good.
That said, the “shallowness” of any such description of goodness (e.g. only needing to fool camera sensors etc) is still the primary barrier to gaming the objective.
You don’t think there could be powerful systems that take what we say too literally and thereby cause massive issues[1]. Isn’t it better if power comes along with human understanding? I admit some people desire the opposite, for powerful machines to be unable to model humans so that it can’t manipulate us, but such machines will either a) be merely imitating behaviour and thereby struggle to adapt to new situations or b) most likely not do what we want when we try to use them.
Yeah, it’s bad news in terms of timelines, but good news in terms of an AI being able to implicitly figure out what we want it to do.
Alignment:
1) Figure out what we want.
2) Do that.
People who are worried about 2/two, may still be worried. I’d agree with you on 1/one, it does seem that way. (I initially thought of it as understanding things/language better—the human nature of jokes is easily taken for granted.)
[please let me know if the following is confused; this is not my area]
Quite possibly I’m missing something, but I don’t see the sense in which this is good news on “ontology mismatch”. Whatever a system’s native ontology, we’d expect it to produce good translations into ours when it’s on distribution.
It seems to me that the system is leveraging a natural language chain-of-thought, because it must: this is the form of data it’s trained to take as input. This doesn’t mean that it’s using anything like our ontology internally—simply that it’s required to translate if it’s to break things down, and that it’s easier to make smaller inferential steps.
I don’t see a reason from this to be more confident that answers to “is world X a world Steve would like more than world Y?” would generalise well. (and I’d note that a “give this reward to the AGI” approach requires it to generalise extremely well)
Well, if we get to AGI from NLP, ie. a model trained on a giant human textdump, I think that’s promising because we’re feeding it primarily data that’s generated by the human ontology in the first place, so the human ontology would plausibly be the best compressor for it.
Sorry, I should clarify: my assumption here was that we find some consistent, non-goal-directed way of translating reality into a natural language description, and then using its potentially-great understanding of human preferences to define a utility function over states of reality. This is predicated on the belief that (1) our mapping from reality to natural language can be made to generalize just as well, even off-distribution, and (2) that future language models will actually be meaningfully difficult to knock off-distribution (given even better generalization abilities).
To my mind, the LLM’s internal activation ontology isn’t relevant. I’m imagining a system of “world model” → “text description of world” → “LLM grading of what human preferences would be about that world”. The “text description of world” is the relevant ontology, rather than whatever activations exist within the LLM.
That said, I might be misunderstanding your point. Do you mind taking another stab?
Ok, I think I see where you’re coming from now—thanks for clarifying. (in light of this, my previous comment was no argument against what you meant) My gut reaction is “that’s obviously not going to work”, but I’m still thinking through whether I have a coherent argument to that effect...
I think it comes down to essentially the same issue around sufficiently-good-generalisation: I can buy that a LLM may reach a very good idea of human preferences, but it won’t be perfect. Maximising according to good-approximation-to-values is likely to end badly for fragile value reasons (did you mention rethinking this somewhere in another comment? did I hallucinate that? might have been someone else).
We seem to need a system which adjusts on-the-fly to improve its approximation to our preferences (whether through corrigibility, actually managing to point at [“do what we want” de dicto], or by some other means). If we don’t have that in place, then it seems not to matter whether we optimize a UF based on a 50% approximation to our preferences, or a 99.99% approximation—I expect you need impractically many 9s before you end up somewhere good by aiming at a fixed target. (I could imagine a setup with a feedback loop to get improved approximations, but it seems the AGI would break that loop at the first opportunity: [allow the loop to work] ~ [allow the off-switch to be pressed])
If we do have an adjustment system in place, then with sufficient caution it doesn’t seem to make much difference in the limit whether we start from a 50% approximation or 99.99%. Though perhaps there’s still a large practical difference around early mundanely horrible failures.
The most plausible way I could imagine the above being wrong is where the very-good-approximation includes enough meta-preferences that the preferences do the preference adjustment ‘themselves’. This seems possible, but I’m not sure how we’d have confidence we’d got a sufficiently good solution. It seems to require nailing some meta-preferences pretty precisely, in order to give you a basin of attraction with respect to other preferences.
Hitting the attractor containing our true preferences does seem to be strictly easier than hitting our true preferences dead on, but it’s a double-edged sword: hit 99.9...9% of our preferences with a meta-preference slightly wrong and our post preference-self-adjustment situation may be terrible.
On a more practical level, [model of diff between worlds] → [text description of diff between worlds], may be a more workable starting point, though I suppose that’s not specific to this setup.
Yeah, I basically agree with everything you’re saying. This is very much a “lol we’re fucked what now” solution, not an “alignment” solution per se. The only reason we might vaguely hope that we don’t need 1- 0.1^10 accuracy, but rather 1 − 0.1^5 accuracy, is that not losing control in the face of a more powerful actor is a pretty basic preference that doesn’t take genius LLM moves to extract. Whether this just breaks immediately because the ASI finds a loophole is kind of dependent on “how hard is it to break, vs. to just do the thing they probably actually want me to do”.
This is functionally impossible in regimes like developing nanotechnology. Is it impossible for dumb shit, like “write me a groundbreaking alignment paper and also obey my preferences as defined from fine-tuning this LLM”? I don’t know. I don’t love the odds, but I don’t have a great argument that they’re less than 1%?
I think there are two concerns being conflated here: “ontology mismatch” and “corrigibility”.
You can think of this as very positive news re: ontology mismatch. We have evidence of a non-goal-directed agent which seems like it would perform surprisingly better than we thought in answering the question “is world X a world Steve would like more than world Y?” So if we give this reward to the AGI and YOLO, the odds of value-preservation/friendliness at near-human levels increase.
On the other hand, this is fairly bad news re: takeoff speeds (since lots of capabilities we might associate with higher levels of cognitive functioning are available at modest compute costs), and consequently re: corrigibility (because we don’t know how to do that).
If I had to summarize my update, it’s directionally towards “shorter timelines” and towards “prosaic alignment of near-human models might be heuristically achievable” and also towards “we won’t have a ton of time between TAI and ASI, and our best bet will be prosaic alignment + hail marying to use those TAIs to solve corrigibility”.
Yeah, it’s bad news in terms of timelines, but good news in terms of an AI being able to implicitly figure out what we want it to do. Obviously, it doesn’t address issues like treacherous turns or acting according to what humans think is good as opposed to what is actually good; and I’m not claiming that this is necessarily net-positive, but there’s a silver lining here.
OK sure. But treacherous turns and acting according to what humans think is good (as opposed to what is actually good) are, like, the two big classic alignment problems. Not being capable enough to figure out what we want is… not even an alignment problem in my book, but I can understand why people would call it one.
I think the distinction here is that obviously any ASI could figure out what humans want, but it’s generally been assumed that that would only happen after its initial goal (Eg paperclips) was already baked. If we can define the goal better before creating the EUM, we’re in slightly better shape.
Treacherous turns are obviously still a problem, but they only happen towards a certain end, right? And a world where an AI does what humans at one point thought was good, as opposed to what was actually good, does seem slightly more promising than a world completely independent from what humans think is good.
That said, the “shallowness” of any such description of goodness (e.g. only needing to fool camera sensors etc) is still the primary barrier to gaming the objective.
EUM? Thanks for helping explain.
Expected Utility Maximiser.
OK, fair enough.
You don’t think there could be powerful systems that take what we say too literally and thereby cause massive issues[1]. Isn’t it better if power comes along with human understanding? I admit some people desire the opposite, for powerful machines to be unable to model humans so that it can’t manipulate us, but such machines will either a) be merely imitating behaviour and thereby struggle to adapt to new situations or b) most likely not do what we want when we try to use them.
As an example, high-functioning autism exists.
Sure, there could be such systems. But I’m more worried about the classic alignment problems.
Alignment:
1) Figure out what we want.
2) Do that.
People who are worried about 2/two, may still be worried. I’d agree with you on 1/one, it does seem that way. (I initially thought of it as understanding things/language better—the human nature of jokes is easily taken for granted.)
[please let me know if the following is confused; this is not my area]
Quite possibly I’m missing something, but I don’t see the sense in which this is good news on “ontology mismatch”. Whatever a system’s native ontology, we’d expect it to produce good translations into ours when it’s on distribution.
It seems to me that the system is leveraging a natural language chain-of-thought, because it must: this is the form of data it’s trained to take as input. This doesn’t mean that it’s using anything like our ontology internally—simply that it’s required to translate if it’s to break things down, and that it’s easier to make smaller inferential steps.
I don’t see a reason from this to be more confident that answers to “is world X a world Steve would like more than world Y?” would generalise well. (and I’d note that a “give this reward to the AGI” approach requires it to generalise extremely well)
Well, if we get to AGI from NLP, ie. a model trained on a giant human textdump, I think that’s promising because we’re feeding it primarily data that’s generated by the human ontology in the first place, so the human ontology would plausibly be the best compressor for it.
Sorry, I should clarify: my assumption here was that we find some consistent, non-goal-directed way of translating reality into a natural language description, and then using its potentially-great understanding of human preferences to define a utility function over states of reality. This is predicated on the belief that (1) our mapping from reality to natural language can be made to generalize just as well, even off-distribution, and (2) that future language models will actually be meaningfully difficult to knock off-distribution (given even better generalization abilities).
To my mind, the LLM’s internal activation ontology isn’t relevant. I’m imagining a system of “world model” → “text description of world” → “LLM grading of what human preferences would be about that world”. The “text description of world” is the relevant ontology, rather than whatever activations exist within the LLM.
That said, I might be misunderstanding your point. Do you mind taking another stab?
Ok, I think I see where you’re coming from now—thanks for clarifying. (in light of this, my previous comment was no argument against what you meant)
My gut reaction is “that’s obviously not going to work”, but I’m still thinking through whether I have a coherent argument to that effect...
I think it comes down to essentially the same issue around sufficiently-good-generalisation: I can buy that a LLM may reach a very good idea of human preferences, but it won’t be perfect. Maximising according to good-approximation-to-values is likely to end badly for fragile value reasons (did you mention rethinking this somewhere in another comment? did I hallucinate that? might have been someone else).
We seem to need a system which adjusts on-the-fly to improve its approximation to our preferences (whether through corrigibility, actually managing to point at [“do what we want” de dicto], or by some other means).
If we don’t have that in place, then it seems not to matter whether we optimize a UF based on a 50% approximation to our preferences, or a 99.99% approximation—I expect you need impractically many 9s before you end up somewhere good by aiming at a fixed target. (I could imagine a setup with a feedback loop to get improved approximations, but it seems the AGI would break that loop at the first opportunity: [allow the loop to work] ~ [allow the off-switch to be pressed])
If we do have an adjustment system in place, then with sufficient caution it doesn’t seem to make much difference in the limit whether we start from a 50% approximation or 99.99%. Though perhaps there’s still a large practical difference around early mundanely horrible failures.
The most plausible way I could imagine the above being wrong is where the very-good-approximation includes enough meta-preferences that the preferences do the preference adjustment ‘themselves’. This seems possible, but I’m not sure how we’d have confidence we’d got a sufficiently good solution. It seems to require nailing some meta-preferences pretty precisely, in order to give you a basin of attraction with respect to other preferences.
Hitting the attractor containing our true preferences does seem to be strictly easier than hitting our true preferences dead on, but it’s a double-edged sword: hit 99.9...9% of our preferences with a meta-preference slightly wrong and our post preference-self-adjustment situation may be terrible.
On a more practical level, [model of diff between worlds] → [text description of diff between worlds], may be a more workable starting point, though I suppose that’s not specific to this setup.
Yeah, I basically agree with everything you’re saying. This is very much a “lol we’re fucked what now” solution, not an “alignment” solution per se. The only reason we might vaguely hope that we don’t need 1- 0.1^10 accuracy, but rather 1 − 0.1^5 accuracy, is that not losing control in the face of a more powerful actor is a pretty basic preference that doesn’t take genius LLM moves to extract. Whether this just breaks immediately because the ASI finds a loophole is kind of dependent on “how hard is it to break, vs. to just do the thing they probably actually want me to do”.
This is functionally impossible in regimes like developing nanotechnology. Is it impossible for dumb shit, like “write me a groundbreaking alignment paper and also obey my preferences as defined from fine-tuning this LLM”? I don’t know. I don’t love the odds, but I don’t have a great argument that they’re less than 1%?