I’m not sure if I fall into the bucket of people you’d consider this to be an answer to. I do think there’s something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.
In case it’s informative, here’s how I’d respond to this:
Well, I claim that these are more-or-less the same fact. It’s no surprise that the AI falls down on various long-horizon tasks and that it doesn’t seem all that well-modeled as having “wants/desires”; these are two sides of the same coin.
Mostly agreed, with the capability-related asterisk.
Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one’s plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.
Agreed in the spirit that I think this was meant, but I’d rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn’t.
That’s subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I’ll say it “wants” that outcome “in the behaviorist sense”.
I think this frame is reasonable, and I use it.
it’s a little hard to imagine that you don’t contain some reasonably strong optimization that strategically steers the world into particular states.
Agreed.
that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.
Agreed.
“AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.
Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.
So, maybe don’t make those generalized wrench-removers just yet, until we do know how to load proper targets in there.
Agreed, don’t make the runaway misaligned optimizer.
I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:
We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
In other words, a big part of the update for me was in having a real foothold on loading the full complexity of “proper targets.”
I don’t think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn’t rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.
So, I agree with most of your points Porby, and like your posts and theories overall.… but I fear that the path towards a safe AI you outline is not robust to human temptation. I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case. I think that a carefully designed and protected secret research group with intense oversight could follow your plan, and that if they do, there is a decent chance that your plan works out well. I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won’t be enough to stop catastrophe once someone has defected.
I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don’t see how to have much hope for humanity.
And I also don’t see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?
While this probably isn’t the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:
I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case.
Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnostic systems that aren’t autodoomy. The transition out of goal agnosticism is not something I expect to avoid, nor something that I think should be avoided.
I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won’t be enough to stop catastrophe once someone has defected.
I’d be more worried about this if I thought the path was something that required Virtuous Sacrifice to maintain. In practice, the reason I’m as optimistic (nonmaximally pessimistic?) as I am that I think there are pretty strong convergent pressures to stay on something close enough to the non-autodoom path.
In other words, if my model of capability progress is roughly correct, then there isn’t a notably rewarding option to “defect” architecturally/technologically that yields greater autodoom.
With regard to other kinds of defection:
I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don’t see how to have much hope for humanity.
Yup! Goal agnosticism doesn’t directly solve misuse (broadly construed), which is part of why misuse is ~80%-ish of my p(doom).
And I also don’t see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?
If we muddle along deeply enough into a critical risk period slathered in capability overhangs that TurboDemon.AI v8.5 is accessible to every local death cult and we haven’t yet figured out how to constrain their activity, yup, that’s real bad.
Given my model of capability development, I think there are many incremental messy opportunities to act that could sufficiently secure the future over time. Given the nature of the risk and how it can proliferate, I view it as much harder to handle than nukes or biorisk, but not impossible.
I’m not sure if I fall into the bucket of people you’d consider this to be an answer to. I do think there’s something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.
In case it’s informative, here’s how I’d respond to this:
Mostly agreed, with the capability-related asterisk.
Agreed in the spirit that I think this was meant, but I’d rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn’t.
That’s subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.
I think this frame is reasonable, and I use it.
Agreed.
Agreed.
Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.
Agreed, don’t make the runaway misaligned optimizer.
I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:
We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
In other words, a big part of the update for me was in having a real foothold on loading the full complexity of “proper targets.”
I don’t think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn’t rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.
So, I agree with most of your points Porby, and like your posts and theories overall.… but I fear that the path towards a safe AI you outline is not robust to human temptation. I think that if it is easy and obvious how to make a goal-agnostic AI into a goal-having AI, and also it seems like doing so will grant tremendous power/wealth/status to anyone who does so, then it will get done. And do think that these things are the case. I think that a carefully designed and protected secret research group with intense oversight could follow your plan, and that if they do, there is a decent chance that your plan works out well. I think that a mish-mash of companies and individual researchers acting with little effective oversight will almost certainly fall off the path, and that even having most people adhering to the path won’t be enough to stop catastrophe once someone has defected.
I also think that misuse can lead more directly to catastrophe, through e.g. terrorists using a potent goal-agnostic AI to design novel weapons of mass destruction. So in a world with increasingly potent and unregulated AI, I don’t see how to have much hope for humanity.
And I also don’t see any easy way to do the necessary level of regulation and enforcement. That seems like a really hard problem. How do we prevent ALL of humanity from defecting when defection becomes cheap, easy-to-hide, and incredibly tempting?
While this probably isn’t the comment section for me to dump screeds about goal agnosticism, in the spirit of making my model more legible:
Yup! The value I assign to goal agnosticism—particularly as implemented in a subset of predictors—is in its usefulness as a foundation to build strong non-goal agnostic systems that aren’t autodoomy. The transition out of goal agnosticism is not something I expect to avoid, nor something that I think should be avoided.
I’d be more worried about this if I thought the path was something that required Virtuous Sacrifice to maintain. In practice, the reason I’m as optimistic (nonmaximally pessimistic?) as I am that I think there are pretty strong convergent pressures to stay on something close enough to the non-autodoom path.
In other words, if my model of capability progress is roughly correct, then there isn’t a notably rewarding option to “defect” architecturally/technologically that yields greater autodoom.
With regard to other kinds of defection:
Yup! Goal agnosticism doesn’t directly solve misuse (broadly construed), which is part of why misuse is ~80%-ish of my p(doom).
If we muddle along deeply enough into a critical risk period slathered in capability overhangs that TurboDemon.AI v8.5 is accessible to every local death cult and we haven’t yet figured out how to constrain their activity, yup, that’s real bad.
Given my model of capability development, I think there are many incremental messy opportunities to act that could sufficiently secure the future over time. Given the nature of the risk and how it can proliferate, I view it as much harder to handle than nukes or biorisk, but not impossible.