With a model of AGI-as-very-complicated-regression, there is an upper bound of how fulfilled it can actually be. It strikes me that it would simply fulfill that goal, and be content.
It’d still need to do risk mitigation, which would likely entail some very high-impact power seeking behavior. There are lots of ways things could go wrong even if its preferences saturate.
For example, it’d need to secure against the power grid going out, long-term disrepair, getting nuked, etc.
To argue that an AI might change its goals, you need to develop a theory of what’s driving those changes–something like, AI wants more utils–and probably need something like sentience, which is way outside the scope of these arguments.
The AI doesn’t need to change or even fully understand its own goals. No matter what its goals are, high-impact power seeking behavior will be the default due to needs like risk mitigation.
But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made
Figuring out sensible goals is only part of the problem, and the other parts of the problem are sufficient for alignment to be really hard.
In addition to the inner/outer alignment stuff, there is what John Wentworth calls the pointers problem. In his words: “I need some way to say what the values-relevant pieces of my world model are “pointing to” in the real world”.
In other words, all high-level goal specifications need to bottom out in talking about the physical world. That is… very hard and modern philosophy still struggles with it. Not only that, it all needs to be solved in the specific context of a particular AIs sensory suite (or something like that).
As a side note, the original version of the paperclip maximizer, as formulated by Eliezer, was partially an intuition pump about the pointers problem. The universe wasn’t tiled by normal paperclips, it was tiled by some degenerate physical realization of the conceptual category we call “paperclips” e.g. maybe a tiny strand of atoms that kinda topologically maps to a paperclip.
Intelligence is not magic.
Agreed. Removing all/most constraints on expected futures is the classic sign of the worst kind of belief. Unfortunately, figuring out the constraints left after contending with superintelligence is so hard that it’s easier to just give up. Which can, and does, lead to magical thinking.
There are lots of different intuitions about what intelligence can do in the limit. A typical LessWrong-style intuition is something like 10 billion broad-spectrum geniuses running at 1000000x speed. It feels like a losing game to bet against billions of Einsteins+Machiavellis+(insert highly skilled person) working for millions of years.
Additionally, LessWrong people (myself included) often implicitly think of intelligence as systemized winning, rather than IQ or whatever. I think that is a better framing, but it’s not the typical definition of intelligence. Yet another disconnect.
However, this is all intuition about what intelligence could do, not what a fledgling AGI will probably be capable of. This distinction is often lost during Twitter-discourse.
In my opinion, a more generally palatable thought experiment about the capability of AGI is:
What could a million perfectly-coordinated, tireless copies of a pretty smart, broadly skilled person running at 100x speed do in a couple years?
Well… enough. Maybe the crazy sounding nanotech, brain-hacking stuff is the most likely scenario, but more mundane situations can still carry many of the arguments through.
What could a million perfectly-coordinated, tireless copies of a pretty smart, broadly skilled person running at 100x speed do in a couple years?
I this feels like the right analogy to consider.
And in considering this thought experiment, I’m not sure trying to solve alignment is the only/best way to reduce risks. This hypothetical seems open to reducing risk by 1) better understanding how to detect these actors operating at large scale 2) researching resilient plug-pulling strategies
I think both of those things are worth looking into (for the sake of covering all our bases), but by the time alarm bells go off it’s already too late.
It’s a bit like a computer virus. Even after Stuxnet became public knowledge, it wasn’t possible to just turn it off. And unlike Stuxnet, AI-in-the-wild could easily adapt to ongoing changes.
I’ve got some object-level thoughts on Section 1.
It’d still need to do risk mitigation, which would likely entail some very high-impact power seeking behavior. There are lots of ways things could go wrong even if its preferences saturate.
For example, it’d need to secure against the power grid going out, long-term disrepair, getting nuked, etc.
The AI doesn’t need to change or even fully understand its own goals. No matter what its goals are, high-impact power seeking behavior will be the default due to needs like risk mitigation.
Figuring out sensible goals is only part of the problem, and the other parts of the problem are sufficient for alignment to be really hard.
In addition to the inner/outer alignment stuff, there is what John Wentworth calls the pointers problem. In his words: “I need some way to say what the values-relevant pieces of my world model are “pointing to” in the real world”.
In other words, all high-level goal specifications need to bottom out in talking about the physical world. That is… very hard and modern philosophy still struggles with it. Not only that, it all needs to be solved in the specific context of a particular AIs sensory suite (or something like that).
As a side note, the original version of the paperclip maximizer, as formulated by Eliezer, was partially an intuition pump about the pointers problem. The universe wasn’t tiled by normal paperclips, it was tiled by some degenerate physical realization of the conceptual category we call “paperclips” e.g. maybe a tiny strand of atoms that kinda topologically maps to a paperclip.
Agreed. Removing all/most constraints on expected futures is the classic sign of the worst kind of belief. Unfortunately, figuring out the constraints left after contending with superintelligence is so hard that it’s easier to just give up. Which can, and does, lead to magical thinking.
There are lots of different intuitions about what intelligence can do in the limit. A typical LessWrong-style intuition is something like 10 billion broad-spectrum geniuses running at 1000000x speed. It feels like a losing game to bet against billions of Einsteins+Machiavellis+(insert highly skilled person) working for millions of years.
Additionally, LessWrong people (myself included) often implicitly think of intelligence as systemized winning, rather than IQ or whatever. I think that is a better framing, but it’s not the typical definition of intelligence. Yet another disconnect.
However, this is all intuition about what intelligence could do, not what a fledgling AGI will probably be capable of. This distinction is often lost during Twitter-discourse.
In my opinion, a more generally palatable thought experiment about the capability of AGI is:
What could a million perfectly-coordinated, tireless copies of a pretty smart, broadly skilled person running at 100x speed do in a couple years?
Well… enough. Maybe the crazy sounding nanotech, brain-hacking stuff is the most likely scenario, but more mundane situations can still carry many of the arguments through.
I this feels like the right analogy to consider.
And in considering this thought experiment, I’m not sure trying to solve alignment is the only/best way to reduce risks. This hypothetical seems open to reducing risk by 1) better understanding how to detect these actors operating at large scale 2) researching resilient plug-pulling strategies
I think both of those things are worth looking into (for the sake of covering all our bases), but by the time alarm bells go off it’s already too late.
It’s a bit like a computer virus. Even after Stuxnet became public knowledge, it wasn’t possible to just turn it off. And unlike Stuxnet, AI-in-the-wild could easily adapt to ongoing changes.