We can argue over the hidden complexity of wishes, but it’s very obvious that there’s at least a good chance the populace would survive, so long as humans are the ones giving the AGI its goal.
Quite likely, depending on how you specify goals relating to humans[1], though it could wind up quite dystopic due to that hidden complexity.
Here, we arrive at the second argument. AGI will understand its own code perfectly, and so be able to “wirehead” by changing whatever its goals are so that they can be maximized to an even greater extent.
I don’t think wireheading is a common argument in fact? It doesn’t seem like it would be a crux issue on doom probability, anyway. Self-modification on the other hand is very important since it could lead to expansion of capabilities.
It strikes me that it would simply fulfill that goal, and be content.
I think this is a valid point—it will by default carry out something related to the goals it is trained to do[2], albeit with some mis-specification and mis-generalization or whatever, and I agree that mesa-optimizers are generally overrated. However I don’t think the following works to support that point:
If we are talking about an opaque black box, how can you be >90% confident about what it contains?
I think Eliezer would turn that around on you and ask how you are so confident that the opaque black box has goals that fall into the narrow set that would work out for humans, when a vastly larger set would not?
There is another, more important, objection here. So far, we have talked about “tiling the universe” and turning human atoms into GPUs as though that’s easily attainable given enough intelligence. I highly doubt that’s actually true. Creating GPUs is a costly, time-consuming task. Intelligence is not magic. Eliezer writes that he thinks a superintelligence could “hack a human brain” and “bootstrap nanotechnology” relatively quickly. This is an absolutely enormous call and seems very unlikely. You don’t know that human brains can be hacked using VR headsets; it has never been demonstrated that it’s possible and there are common sense reasons to think it’s not. The brain is an immensely complicated, poorly-understood organ. Applying a lot of computing power to that problem is very unlikely to yield total mastery of it by shining light in someone’s eyes. Nanotechnology, which is basically just moving around atoms to create different materials, is another thing that he thinks compute is definitely able to just solve and be able to recombine atoms easily. Probably not. I cannot think of anything that was invented by a very smart person sitting in an armchair considering it. Is it possible that over years of experimentation like anyone else, an AGI could create something amazingly powerful? Yes. Is that going to happen in a short period of time (or aggressively all at once)? Very unlikely. Eliezer says he doesn’t think intelligence is magic, and understands that it can’t violate the laws of physics, but seemingly thinks that anything that humans think might potentially be possible but is way beyond our understanding or capabilities can be solved with a lot of intelligence. This does not fit my model of how useful intelligence is.
Intelligence is not magic? Tell that to the chimpanzees...
We don’t really know how much room there is for software-level improvement; if it’s large, self-improvement could create far super-human capabilities in existing hardware. And with great intelligence comes great capabilities:
It will be superhumanly good at persuading humans even if that doesn’t lead to exactly “hack a human brain”
I think at least a substantial minority of humans might side with even an openly misaligned AI if they are convinced it will win, and through higher bandwidth and unified command the AI would be able to coordinate its supporters much better than the opponents can coordinate, and it could actively disrupt or subvert nominally opposing organizations through its agents
regarding experimentation, an AI may be able to substitute simulation. Its imagination need not be constrained by a human’s meagre working memory
these are just a few examples that mere human-level intelligence can think of. A superintelligence will likely have more options that a superintelligence can think of and I haven’t
Moreover, even if these things don’t work that way and we get a slow takeoff, that doesn’t necessarily save humanity. It just means that it will take a little longer for AI to be the dominant form of intelligence on the planet. That still sets a deadline to adequately solve alignment.
As I said before, I’m very confused about how you get to >90% chance of doom given the complexity of the systems we’re discussing.
As alluded to before, there’s more ways for the AI to kill us than not to kill us.
My own doom percentage is lower than this, though not because of any disagreement with >90% doomers that we are headed to (at least dystopian if not extinction) doom if capabilities continue to advance without alignment theory also doing so. I just think the problems are soluble.
III. The way the material I’ve interacted with is presented will dissuade many, probably most, non-rationalistreaders
I think that this leads to the conclusion that some 101-level version could be made, and promoted for outreach purposes rather than the more advanced stuff. But that depends on outreach actually occurring—we still need to have the more advanced discussions, and those will provide the default materials if the 101-stuff doesn’t exist or isn’t known.
Further, I think the whole “>90%” business is overemphasized by the community. It would be more believable if the argument were watered down into, “I don’t see how we avoid a catastrophe here, but there are a lot of unknown unknowns, so let’s say it’s 50 or 60% chance of everyone dying”. This is still a massive call, and I think more in line with what a lot of the community actually believes. The emphasis on certainty-of-doom as opposed to just sounding-the-alarm-on-possible-doom hurts the cause.
Yes, I do think that’s more in line with what a lot of the community actually believes, including me. But, I’m not sure why you’re saying in that case that “the community” overemphasizes >90%? Do you mean to say, for example, that certain members of the community (e.g. Eliezer) overemphasize >90%, and you think that those members are too prominent, at least from the perspective of outsiders?
I think, yes, perhaps Eliezer could be a better ambassador for the community or it would be better if someone else who would be better in that role took that role more. I don’t know if this is a “community” issue though?
I think Eliezer might be imagining that everything including goals relating to humans would ultimately be defined in relation to fundamental descriptions of the universe, because Solomonoff or something, and I would think such a definition would lead to certain doom unless unrealistically precise.
But IMO things like human values will have a large influence on AI data such that they should likely naturally abstract them (“grounding” in the input data but not necessarily in fundamental descriptions) so humans can plug in to those abstractions either directly or indirectly. I think it should be possible to safeguard against the AI redefining these abstractions under self-modification in terms that would undermine satisfying the original goals, and in any case I am skeptical that an optimal limited-compute Solomonoff approximator defines everything only in terms of fundamental descriptions at achievable levels of compute. Thus, I agree more with you than my imagining of Eliezer on this point. But maybe I am mis-imagining Eliezer.
A potentially crux-y issue that I also note is that Eliezer, I think, thinks we are stuck with what we get from the initial definition of goals in terms human values due to consequentialism (in his view) being a stable attractor. I think he is wrong on consequentialism[3] (about the attractor part, or at least the size of its attractor basin, but the stable part is right) and that self-correcting alignment is feasible.
However I do have concerns about agents arising from mostly tool-ish AI, such as:
- takeover of language model by agentic simulacra
- person uses language model’s coding capabilities to make bootstrapped agent
- person asks oracle AI what it could do to achieve some effect in the world, and its response includes insufficiently sanitized raw output (such as rewrite of its own code) that achieves that
Note that these are downstream, not upstream, of the AI’s fulfilling their intended goals. I’m somewhat less concerned of agents arising upstream or direct unintented agentification at the level of the original goals, but note that agentiness is something that people will be pushing for for capability reasons, and once a self-modifying AI is expressing agentiness at one point in time it will tend to self-modify if it can to follow that objective more consistently.
And by consequentialism, I really do mean consequentialism (goals directed at specific world states) and not utility functions, which is often confused with consequentialism in this community. Non-consequentialist utility functions are fine in my view! Note that the VNM theorem has the form (consequentialism (+ rationality) → utility function) and does not imply consequentialism is rational.
Moreover, even if these things don’t work that way and we get a slow takeoff, that doesn’t necessarily save humanity. It just means that it will take a little longer for AI to be the dominant form of intelligence on the planet. That still sets a deadline to adequately solve alignment.
If a slow takeoff is all that’s possible, doesn’t that open up other options for saving humanity besides solving alignment?
I imagine far morehumans will agree p(doom) is high if they see AI isn’t aligned and it’s growing to be the dominant form of intelligence that holds power. In a slow-takeoff, people should be able to realize this is happening, and effect non-alignment based solutions (like bombing compute infrastructure).
Intelligence is indeed not magic. None of the behaviors that you display that are more intelligent than a chimpanzee’s behaviors are things you have invented. I’m willing to bet that virtually no behavior that you have personally come up with is an improvement. (That’s not an insult, it’s simply par for the course for humans.) In other words, a human is not smarter than a chimpanzee.
The reason humans are able to display more intelligent behavior is because we’ve evolved to sustain cultural evolution, i.e., the mutation and selection of behaviors from one generation to the next. All of the smart things you do are a result of that slow accumulation of behaviors, such as language, counting, etc., that you have been able to simply imitate. So the author’s point stands that you need new information from experiments in order to do something new, including new kinds of persuasion.
Quite likely, depending on how you specify goals relating to humans[1], though it could wind up quite dystopic due to that hidden complexity.
I don’t think wireheading is a common argument in fact? It doesn’t seem like it would be a crux issue on doom probability, anyway. Self-modification on the other hand is very important since it could lead to expansion of capabilities.
I think this is a valid point—it will by default carry out something related to the goals it is trained to do[2], albeit with some mis-specification and mis-generalization or whatever, and I agree that mesa-optimizers are generally overrated. However I don’t think the following works to support that point:
I think Eliezer would turn that around on you and ask how you are so confident that the opaque black box has goals that fall into the narrow set that would work out for humans, when a vastly larger set would not?
Intelligence is not magic? Tell that to the chimpanzees...
We don’t really know how much room there is for software-level improvement; if it’s large, self-improvement could create far super-human capabilities in existing hardware. And with great intelligence comes great capabilities:
It will be superhumanly good at persuading humans even if that doesn’t lead to exactly “hack a human brain”
I think at least a substantial minority of humans might side with even an openly misaligned AI if they are convinced it will win, and through higher bandwidth and unified command the AI would be able to coordinate its supporters much better than the opponents can coordinate, and it could actively disrupt or subvert nominally opposing organizations through its agents
regarding experimentation, an AI may be able to substitute simulation. Its imagination need not be constrained by a human’s meagre working memory
these are just a few examples that mere human-level intelligence can think of. A superintelligence will likely have more options that a superintelligence can think of and I haven’t
Moreover, even if these things don’t work that way and we get a slow takeoff, that doesn’t necessarily save humanity. It just means that it will take a little longer for AI to be the dominant form of intelligence on the planet. That still sets a deadline to adequately solve alignment.
As alluded to before, there’s more ways for the AI to kill us than not to kill us.
My own doom percentage is lower than this, though not because of any disagreement with >90% doomers that we are headed to (at least dystopian if not extinction) doom if capabilities continue to advance without alignment theory also doing so. I just think the problems are soluble.
I think that this leads to the conclusion that some 101-level version could be made, and promoted for outreach purposes rather than the more advanced stuff. But that depends on outreach actually occurring—we still need to have the more advanced discussions, and those will provide the default materials if the 101-stuff doesn’t exist or isn’t known.
Yes, I do think that’s more in line with what a lot of the community actually believes, including me. But, I’m not sure why you’re saying in that case that “the community” overemphasizes >90%? Do you mean to say, for example, that certain members of the community (e.g. Eliezer) overemphasize >90%, and you think that those members are too prominent, at least from the perspective of outsiders?
I think, yes, perhaps Eliezer could be a better ambassador for the community or it would be better if someone else who would be better in that role took that role more. I don’t know if this is a “community” issue though?
I think Eliezer might be imagining that everything including goals relating to humans would ultimately be defined in relation to fundamental descriptions of the universe, because Solomonoff or something, and I would think such a definition would lead to certain doom unless unrealistically precise.
But IMO things like human values will have a large influence on AI data such that they should likely naturally abstract them (“grounding” in the input data but not necessarily in fundamental descriptions) so humans can plug in to those abstractions either directly or indirectly. I think it should be possible to safeguard against the AI redefining these abstractions under self-modification in terms that would undermine satisfying the original goals, and in any case I am skeptical that an optimal limited-compute Solomonoff approximator defines everything only in terms of fundamental descriptions at achievable levels of compute. Thus, I agree more with you than my imagining of Eliezer on this point. But maybe I am mis-imagining Eliezer.
A potentially crux-y issue that I also note is that Eliezer, I think, thinks we are stuck with what we get from the initial definition of goals in terms human values due to consequentialism (in his view) being a stable attractor. I think he is wrong on consequentialism[3] (about the attractor part, or at least the size of its attractor basin, but the stable part is right) and that self-correcting alignment is feasible.
However I do have concerns about agents arising from mostly tool-ish AI, such as:
- takeover of language model by agentic simulacra
- person uses language model’s coding capabilities to make bootstrapped agent
- person asks oracle AI what it could do to achieve some effect in the world, and its response includes insufficiently sanitized raw output (such as rewrite of its own code) that achieves that
Note that these are downstream, not upstream, of the AI’s fulfilling their intended goals. I’m somewhat less concerned of agents arising upstream or direct unintented agentification at the level of the original goals, but note that agentiness is something that people will be pushing for for capability reasons, and once a self-modifying AI is expressing agentiness at one point in time it will tend to self-modify if it can to follow that objective more consistently.
And by consequentialism, I really do mean consequentialism (goals directed at specific world states) and not utility functions, which is often confused with consequentialism in this community. Non-consequentialist utility functions are fine in my view! Note that the VNM theorem has the form (consequentialism (+ rationality) → utility function) and does not imply consequentialism is rational.
If a slow takeoff is all that’s possible, doesn’t that open up other options for saving humanity besides solving alignment?
I imagine far more humans will agree p(doom) is high if they see AI isn’t aligned and it’s growing to be the dominant form of intelligence that holds power. In a slow-takeoff, people should be able to realize this is happening, and effect non-alignment based solutions (like bombing compute infrastructure).
Intelligence is indeed not magic. None of the behaviors that you display that are more intelligent than a chimpanzee’s behaviors are things you have invented. I’m willing to bet that virtually no behavior that you have personally come up with is an improvement. (That’s not an insult, it’s simply par for the course for humans.) In other words, a human is not smarter than a chimpanzee.
The reason humans are able to display more intelligent behavior is because we’ve evolved to sustain cultural evolution, i.e., the mutation and selection of behaviors from one generation to the next. All of the smart things you do are a result of that slow accumulation of behaviors, such as language, counting, etc., that you have been able to simply imitate. So the author’s point stands that you need new information from experiments in order to do something new, including new kinds of persuasion.