The key crux of any “semi-alignment plan for autonomous AI” is how would it behave under recursive self-improvement. (We are getting really close to having AI systems which will be competent in software engineering, including software engineering for AI projects, and including using all kinds of AutoML tricks, so we might be getting close to having AI systems competent in performing AI research.)
And an AI system would like its smarter successors to co-operate with it. And it would actually like smarter successors of other AI systems to be nice as well.
So, yes, this alignment idea might be of use (at least as a part of a larger plan, or an idea to be further modified)...
That was something like what I was thinking. But I think this won’t work, unless modified so much that it’d be completely different. More an idea to toss around.
I’ll start over with something else. I do think something that might have value is designing an environment that induces empathy/values/whatever, rather than directly trying to design the AI to be what you want from scratch. Environment design can be very powerful in influencing humans, but that’s in huge part because we (or at least, those of us who put thought in designing environments for folk) understand humans far better than we understand AI.
Like a lot of the not-ridiculously terrible and only extremely terrible plans, this kind of relies on a lot of interpretability.
Yes, I think we are looking at “seeds of feasible ideas” at this stage, not at “ready to go” ideas...
I tried to look at what would it take for super-powerful AIs
not to destroy the fabric of their environment together with themselves and everything
to care about “interests, freedom, and well-being of all sentient beings”
That’s not too easy, but might be doable in a fashion invariant with respect to recursive self-modification (and might be more feasible than more traditional approaches to alignment).
Of course, the fact that we don’t know what’s sentient and what’s not sentient does not help, to say the least ;-) But perhaps we and/or AIs and/or our collaborations with AIs might figure this out sooner rather than later...
It’s an interesting starting point...
The key crux of any “semi-alignment plan for autonomous AI” is how would it behave under recursive self-improvement. (We are getting really close to having AI systems which will be competent in software engineering, including software engineering for AI projects, and including using all kinds of AutoML tricks, so we might be getting close to having AI systems competent in performing AI research.)
And an AI system would like its smarter successors to co-operate with it. And it would actually like smarter successors of other AI systems to be nice as well.
So, yes, this alignment idea might be of use (at least as a part of a larger plan, or an idea to be further modified)...
That was something like what I was thinking. But I think this won’t work, unless modified so much that it’d be completely different. More an idea to toss around.
I’ll start over with something else. I do think something that might have value is designing an environment that induces empathy/values/whatever, rather than directly trying to design the AI to be what you want from scratch.
Environment design can be very powerful in influencing humans, but that’s in huge part because we (or at least, those of us who put thought in designing environments for folk) understand humans far better than we understand AI.
Like a lot of the not-ridiculously terrible and only extremely terrible plans, this kind of relies on a lot of interpretability.
Yes, I think we are looking at “seeds of feasible ideas” at this stage, not at “ready to go” ideas...
I tried to look at what would it take for super-powerful AIs
not to destroy the fabric of their environment together with themselves and everything
to care about “interests, freedom, and well-being of all sentient beings”
That’s not too easy, but might be doable in a fashion invariant with respect to recursive self-modification (and might be more feasible than more traditional approaches to alignment).
Of course, the fact that we don’t know what’s sentient and what’s not sentient does not help, to say the least ;-) But perhaps we and/or AIs and/or our collaborations with AIs might figure this out sooner rather than later...
Anyway, I did scribble a short write-up on this direction of thinking a few months ago: Exploring non-anthropocentric aspects of AI existential safety