Here’s some (hopefully useful) context on why I (SERI MATS 4.0, independent alignment researcher) feel helplessness at the idea of applying: I expect to not actually make a difference by working as a part of your team, because I don’t expect my model of the alignment problem [which is essentially that of MIRI and John Wentworth] to be shared by you or the OpenPhil leadership.
I don’t expect a discontinuous jump in AI systems’ generality or depth of thought from stumbling upon a deep core of intelligence; I’m not totally sure I understand it but I probably don’t expect a sharp left turn.
This is probably our biggest crux. To me it seems pretty clear that “capabilities generalize farther than alignment”. The existence of a “deep core of intelligence” also seems very obvious to me, although behaviorally I’m currently uncertain as to whether we could see a discontinuous jump in AI systems’ generality.
Overall I sense that what is being selected for may be close to “be as epistemically accurate in decisions and communications as possible given our constraints of moving fast”, and it makes sense; but I expect that this also selects for people who are less comfortable with the sort of non-verbal epistemic reasoning heuristics that seem crucial to not slide your attention away from noticing that one is confused, and by extension, the “hard parts of the problem”. I think the former is very useful when dealing with problems in domains where we have a clear idea of the problem (rocket engineering), but probably net negative when dealing with a domain we are still confused about.
Sure. I’ve made two attempts to point at what I mean: one Yudkowsky-like, and the other Nate-like. I’m hoping that the combination should at least make someone get what I’m pointing at.
Attempt 1
There is a ‘ground truth’ for capabilities, and that is our universe. Our universe is coherent and Lawful -- 2+2=4, and <the fundamental physics laws governing our universe> hold at every moment, every where. Every piece of data given to an optimizer tells the optimizer about these things. You can learn arithmetic from a thousand different examples of data drawn from the real world, none of which need to be explicitly about arithmetic. You can detect the shape of the physics laws constraining our universe via a myriad of ways, none of which can make it seem obvious to you or I as to how we can infer these laws from the data. Combine that with highly focused optimization pressure, and what you get is a system that is incredibly capable.
There are an infinite paths to the truth of reality, and that is reflected in the data we provide an optimizer. This is not the case for our values. Human values are very complex and arbitrary—a result of the specific brain architecture we seem to have evolved.
Every data point an optimizer is provided from the real world tells it the same thing about ‘capabilities’: 2+2 = 4, for example. Even inaccurate datapoints are causally upstream of a coherent universe, and therefore provide the optimizer information about the causes of these inaccurate data points. If an optimizer uses ‘proxy correlates’ of reality, it shall soon start to converge to understanding the actual structure of reality.
In contrast, it does not seem to be the case that we know how to get an optimizer to converge to understanding the actual goal (even if it is something as simple as “maximize the amount of diamonds in the universe”). All we seem to know how to do is to train proxy correlates into a model. These proxy correlates do not generalize out of distribution, and once an optimizer ‘groks’ reality, it shall see the ways it can achieve the outcomes it is meant to achieve using paths other than the ones it was shaped to follow by the stupider optimizers that built it.
Attempt 2
Until now, all SOTA AI systems we can see are limited-domain consequentialists (or approximations thereof). None of them are truly general in the sense that they seem to be able to chain actions across multiple wildly differing domains (social, programming, cognitive heuristics improvement, maintenance and upgrade of the infrastructure the AI system is running on—to give a few examples) to achieve whatever outcomes they could be perceived as aiming towards[1]. GPT-4 is a predictor that can be prompted to simulate a consequentialist (such as a human being), but GPT-4 is not capable enough to simulate the cross-domain capabilities of such an agent, at least as far as I know.
When your ‘alignment’ techniques involve training an AI system to behave in ways you like when the AI system is restricted to these isolated domains, all you are doing is teaching your system decision-making influences that are proxies of the actual values you wish the system would have. These decision-making influences will not hold across all domains that an AI might chain their actions across—and this will especially be true in the case of the domains that enable an AI system to chain actions across multiple widely differing domains, such as abstract reasoning. Since the specific ontology that an AI system uses itself changes what inputs and outputs an AI system has for its abstract reasoning algorithms, you cannot use external behavioral outputs as evidence to be able to shape an AI’s reasoning[2].
Note, this does not necessarily mean that we shall see systems with cleanly describable internals that seem to contain a concrete ‘outcome’ that the AI is ‘intentionally’ trying to achieve! I’m describing what we can infer based on the observed behavior of such an AI system—it seems far more likely that such systems will likely not have such clean ‘outcomes’ in mind that they are deliberately aiming towards, even if one can easily imagine evidence of easily detectable convergent instrumental goals (which do not provide us much evidence for whether or not a model is aligned).
I wasn’t expecting such a detailed answer, I guess I should have asked a more specific question. This is great though. The thing I was confused about was: “Capabilities generalize further than alignment” makes it sound like capabilities-properties or capabilities-skills (such as accurate beliefs, heuristics for arriving at accurate beliefs, useful convergent instrumental strategies, etc.) will work in a wider range of environments than alignment-properties like honesty, niceness, etc. But I don’t think what you’ve said establishes that.
But I think what you mean is different—you mean “If you train an AI using human feedback on diverse tasks, hoping it’ll acquire both general-purpose capabilities and also robust alignment properties, what’ll happen by default is that it DOES acquire the former but it does not acquire the latter.” (And the reason for this is basically that capabilities properties are more simple/natural/universal/convergent than alignment properties; with alignment properties there are all sorts of other similarly-simple properties that perform just as well in training, but for capabilities properties there generally aren’t (at least not for sufficiently diverse challenging environments; in simple environments they just ‘memorize’ or otherwise learn ‘simple heuristics that don’t generalize’))
Here’s some (hopefully useful) context on why I (SERI MATS 4.0, independent alignment researcher) feel helplessness at the idea of applying: I expect to not actually make a difference by working as a part of your team, because I don’t expect my model of the alignment problem [which is essentially that of MIRI and John Wentworth] to be shared by you or the OpenPhil leadership.
From your updated timelines post:
This is probably our biggest crux. To me it seems pretty clear that “capabilities generalize farther than alignment”. The existence of a “deep core of intelligence” also seems very obvious to me, although behaviorally I’m currently uncertain as to whether we could see a discontinuous jump in AI systems’ generality.
Overall I sense that what is being selected for may be close to “be as epistemically accurate in decisions and communications as possible given our constraints of moving fast”, and it makes sense; but I expect that this also selects for people who are less comfortable with the sort of non-verbal epistemic reasoning heuristics that seem crucial to not slide your attention away from noticing that one is confused, and by extension, the “hard parts of the problem”. I think the former is very useful when dealing with problems in domains where we have a clear idea of the problem (rocket engineering), but probably net negative when dealing with a domain we are still confused about.
Can you say more about what you mean by “capabilities generalize further than alignment?”
Sure. I’ve made two attempts to point at what I mean: one Yudkowsky-like, and the other Nate-like. I’m hoping that the combination should at least make someone get what I’m pointing at.
Attempt 1
There is a ‘ground truth’ for capabilities, and that is our universe. Our universe is coherent and Lawful --
2+2=4
, and <the fundamental physics laws governing our universe> hold at every moment, every where. Every piece of data given to an optimizer tells the optimizer about these things. You can learn arithmetic from a thousand different examples of data drawn from the real world, none of which need to be explicitly about arithmetic. You can detect the shape of the physics laws constraining our universe via a myriad of ways, none of which can make it seem obvious to you or I as to how we can infer these laws from the data. Combine that with highly focused optimization pressure, and what you get is a system that is incredibly capable.There are an infinite paths to the truth of reality, and that is reflected in the data we provide an optimizer. This is not the case for our values. Human values are very complex and arbitrary—a result of the specific brain architecture we seem to have evolved.
Every data point an optimizer is provided from the real world tells it the same thing about ‘capabilities’:
2+2 = 4
, for example. Even inaccurate datapoints are causally upstream of a coherent universe, and therefore provide the optimizer information about the causes of these inaccurate data points. If an optimizer uses ‘proxy correlates’ of reality, it shall soon start to converge to understanding the actual structure of reality.In contrast, it does not seem to be the case that we know how to get an optimizer to converge to understanding the actual goal (even if it is something as simple as “maximize the amount of diamonds in the universe”). All we seem to know how to do is to train proxy correlates into a model. These proxy correlates do not generalize out of distribution, and once an optimizer ‘groks’ reality, it shall see the ways it can achieve the outcomes it is meant to achieve using paths other than the ones it was shaped to follow by the stupider optimizers that built it.
Attempt 2
Until now, all SOTA AI systems we can see are limited-domain consequentialists (or approximations thereof). None of them are truly general in the sense that they seem to be able to chain actions across multiple wildly differing domains (social, programming, cognitive heuristics improvement, maintenance and upgrade of the infrastructure the AI system is running on—to give a few examples) to achieve whatever outcomes they could be perceived as aiming towards[1]. GPT-4 is a predictor that can be prompted to simulate a consequentialist (such as a human being), but GPT-4 is not capable enough to simulate the cross-domain capabilities of such an agent, at least as far as I know.
When your ‘alignment’ techniques involve training an AI system to behave in ways you like when the AI system is restricted to these isolated domains, all you are doing is teaching your system decision-making influences that are proxies of the actual values you wish the system would have. These decision-making influences will not hold across all domains that an AI might chain their actions across—and this will especially be true in the case of the domains that enable an AI system to chain actions across multiple widely differing domains, such as abstract reasoning. Since the specific ontology that an AI system uses itself changes what inputs and outputs an AI system has for its abstract reasoning algorithms, you cannot use external behavioral outputs as evidence to be able to shape an AI’s reasoning[2].
Note, this does not necessarily mean that we shall see systems with cleanly describable internals that seem to contain a concrete ‘outcome’ that the AI is ‘intentionally’ trying to achieve! I’m describing what we can infer based on the observed behavior of such an AI system—it seems far more likely that such systems will likely not have such clean ‘outcomes’ in mind that they are deliberately aiming towards, even if one can easily imagine evidence of easily detectable convergent instrumental goals (which do not provide us much evidence for whether or not a model is aligned).
Which is why, it seems, a lot of people working on AGI alignment are converging to ontology identification as the goal of their research agendas.
Thanks!
I wasn’t expecting such a detailed answer, I guess I should have asked a more specific question. This is great though. The thing I was confused about was: “Capabilities generalize further than alignment” makes it sound like capabilities-properties or capabilities-skills (such as accurate beliefs, heuristics for arriving at accurate beliefs, useful convergent instrumental strategies, etc.) will work in a wider range of environments than alignment-properties like honesty, niceness, etc. But I don’t think what you’ve said establishes that.
But I think what you mean is different—you mean “If you train an AI using human feedback on diverse tasks, hoping it’ll acquire both general-purpose capabilities and also robust alignment properties, what’ll happen by default is that it DOES acquire the former but it does not acquire the latter.” (And the reason for this is basically that capabilities properties are more simple/natural/universal/convergent than alignment properties; with alignment properties there are all sorts of other similarly-simple properties that perform just as well in training, but for capabilities properties there generally aren’t (at least not for sufficiently diverse challenging environments; in simple environments they just ‘memorize’ or otherwise learn ‘simple heuristics that don’t generalize’))
Is this an accurate summary of your view?
Yes, as far as I can tell. “Alignment properties” do not seem to me to be convergent or universal in any way.