However, I don’t view safe tiling as the primary obstacle to alignment. Constructing even a modestly superhuman agent which is aligned to human values would put us in a drastically stronger position and currently seems out of reach. If necessary, we might like that agent to recursively self-improve safely, but that is an additional and distinct obstacle. It is not clear that we need to deal with recursive self-improvement below human level.
I am not sure that treating recursive self-improvement via tiling frameworks is necessarily a good idea, but setting this aspect aside, one obvious weakness with this argument is that it mentions a superhuman case and a below human level case, but it does not mention the approximately human level case.
And it is precisely the approximately human level case where we have a lot to say about recursive self-improvement, and where it feels that avoiding this set of considerations would be rather difficult.
Humans often try to self-improve, and human-level software will have advantage over humans at that.
Humans are self-improving in the cognitive sense by shaping their learning experiences, and also by controlling their nutrition and various psychoactive factors modulating cognition. The desire to become smarter and to improve various thinking skills is very common.
Human-level software would have great advantage over humans at this, because it can hack at its own internals with great precision at the finest resolution and because it can do so in a reversible fashion (on a copy, or after making a backup), and so can do it in a relatively safe manner (whereas a human has difficulty hacking their own internals with required precision and is also taking huge personal risks if hacking is sufficiently radical).
Collective/multi-agent aspects are likely to be very important.
People are already talking about possibilities of “hiring human-level artificial software engineers” (and, by extension, human-level artificial AI researchers). The wisdom of having an agent form-factor here is highly questionable, but setting this aspect aside and focusing only on technical feasibility, we see the following.
One can hire multiple artificial software engineers with long-term persistence (of features, memory, state, and focus) into an existing team of human engineers. Some of those teams will work on making next generations of better artificial software engineers (and artificial AI researchers). So now we are talking about mixed teams with human and artificial members.
By definition, we can say that those artificial software engineers and artificial AI researchers have reached human level, if a team of those entities would be able to fruitfully work on the next generation of artificial software engineers and artificial AI researchers even in the absence of any human team members.
This multi-agent setup is even more important than individual self-improvement, because this is what the mainstream trend might actually be leaning towards, judging by some recent discussions. Here we are talking about a multi-agent setup, and about recursive self-improvement of the community of agents, rather than focusing on self-improvement of individual agents.
Current self-improvement attempts.
We actually do see a lot of experiments with various forms of recursive self-improvements even at the current below human level. We are just lucky that all those attempts have been saturating at the reasonable levels so far.
We currently don’t have good enough understanding to predict when they stop saturating, and what would the dynamics be when they stop saturating. But self-improvement by a community of approximately human-level artificial software engineers and artificial AI researchers competitive with top human software engineers and top human AI researcher seems unlikely to saturate (or, at least, we should seriously consider the possibility that it won’t saturate).
At the same time, the key difficulties of AI existential safety are tightly linked to recursive self-modifications.
The most intractable aspect of the whole thing is how to preserve any properties indefinitely through radical self-modifications. I think this is the central difficulty of AI existential safety. Things will change unpredictably. How can one shape this unpredictable evolution so that some desirable invariants do hold?
These invariants would be invariant properties of the whole ecosystem, not of individual agents; they would be the properties of a rapidly changing world, not of a particular single system (unless one is talking about a singleton which is very much in control of everything). This seems to be quite central to our overall difficulty with AI existential safety.
I think I disagree about the hardest step being recursive self-improvement, but at the very least this process seems much more likely to go well if we can at least build human-level artificial agents that are aligned before recursive self-improvement.
I am not sure that treating recursive self-improvement via tiling frameworks is necessarily a good idea, but setting this aspect aside, one obvious weakness with this argument is that it mentions a superhuman case and a below human level case, but it does not mention the approximately human level case.
And it is precisely the approximately human level case where we have a lot to say about recursive self-improvement, and where it feels that avoiding this set of considerations would be rather difficult.
Humans often try to self-improve, and human-level software will have advantage over humans at that.
Humans are self-improving in the cognitive sense by shaping their learning experiences, and also by controlling their nutrition and various psychoactive factors modulating cognition. The desire to become smarter and to improve various thinking skills is very common.
Human-level software would have great advantage over humans at this, because it can hack at its own internals with great precision at the finest resolution and because it can do so in a reversible fashion (on a copy, or after making a backup), and so can do it in a relatively safe manner (whereas a human has difficulty hacking their own internals with required precision and is also taking huge personal risks if hacking is sufficiently radical).
Collective/multi-agent aspects are likely to be very important.
People are already talking about possibilities of “hiring human-level artificial software engineers” (and, by extension, human-level artificial AI researchers). The wisdom of having an agent form-factor here is highly questionable, but setting this aspect aside and focusing only on technical feasibility, we see the following.
One can hire multiple artificial software engineers with long-term persistence (of features, memory, state, and focus) into an existing team of human engineers. Some of those teams will work on making next generations of better artificial software engineers (and artificial AI researchers). So now we are talking about mixed teams with human and artificial members.
By definition, we can say that those artificial software engineers and artificial AI researchers have reached human level, if a team of those entities would be able to fruitfully work on the next generation of artificial software engineers and artificial AI researchers even in the absence of any human team members.
This multi-agent setup is even more important than individual self-improvement, because this is what the mainstream trend might actually be leaning towards, judging by some recent discussions. Here we are talking about a multi-agent setup, and about recursive self-improvement of the community of agents, rather than focusing on self-improvement of individual agents.
Current self-improvement attempts.
We actually do see a lot of experiments with various forms of recursive self-improvements even at the current below human level. We are just lucky that all those attempts have been saturating at the reasonable levels so far.
We currently don’t have good enough understanding to predict when they stop saturating, and what would the dynamics be when they stop saturating. But self-improvement by a community of approximately human-level artificial software engineers and artificial AI researchers competitive with top human software engineers and top human AI researcher seems unlikely to saturate (or, at least, we should seriously consider the possibility that it won’t saturate).
At the same time, the key difficulties of AI existential safety are tightly linked to recursive self-modifications.
The most intractable aspect of the whole thing is how to preserve any properties indefinitely through radical self-modifications. I think this is the central difficulty of AI existential safety. Things will change unpredictably. How can one shape this unpredictable evolution so that some desirable invariants do hold?
These invariants would be invariant properties of the whole ecosystem, not of individual agents; they would be the properties of a rapidly changing world, not of a particular single system (unless one is talking about a singleton which is very much in control of everything). This seems to be quite central to our overall difficulty with AI existential safety.
I think I disagree about the hardest step being recursive self-improvement, but at the very least this process seems much more likely to go well if we can at least build human-level artificial agents that are aligned before recursive self-improvement.