That was something like what I was thinking. But I think this won’t work, unless modified so much that it’d be completely different. More an idea to toss around.
I’ll start over with something else. I do think something that might have value is designing an environment that induces empathy/values/whatever, rather than directly trying to design the AI to be what you want from scratch. Environment design can be very powerful in influencing humans, but that’s in huge part because we (or at least, those of us who put thought in designing environments for folk) understand humans far better than we understand AI.
Like a lot of the not-ridiculously terrible and only extremely terrible plans, this kind of relies on a lot of interpretability.
Yes, I think we are looking at “seeds of feasible ideas” at this stage, not at “ready to go” ideas...
I tried to look at what would it take for super-powerful AIs
not to destroy the fabric of their environment together with themselves and everything
to care about “interests, freedom, and well-being of all sentient beings”
That’s not too easy, but might be doable in a fashion invariant with respect to recursive self-modification (and might be more feasible than more traditional approaches to alignment).
Of course, the fact that we don’t know what’s sentient and what’s not sentient does not help, to say the least ;-) But perhaps we and/or AIs and/or our collaborations with AIs might figure this out sooner rather than later...
That was something like what I was thinking. But I think this won’t work, unless modified so much that it’d be completely different. More an idea to toss around.
I’ll start over with something else. I do think something that might have value is designing an environment that induces empathy/values/whatever, rather than directly trying to design the AI to be what you want from scratch.
Environment design can be very powerful in influencing humans, but that’s in huge part because we (or at least, those of us who put thought in designing environments for folk) understand humans far better than we understand AI.
Like a lot of the not-ridiculously terrible and only extremely terrible plans, this kind of relies on a lot of interpretability.
Yes, I think we are looking at “seeds of feasible ideas” at this stage, not at “ready to go” ideas...
I tried to look at what would it take for super-powerful AIs
not to destroy the fabric of their environment together with themselves and everything
to care about “interests, freedom, and well-being of all sentient beings”
That’s not too easy, but might be doable in a fashion invariant with respect to recursive self-modification (and might be more feasible than more traditional approaches to alignment).
Of course, the fact that we don’t know what’s sentient and what’s not sentient does not help, to say the least ;-) But perhaps we and/or AIs and/or our collaborations with AIs might figure this out sooner rather than later...
Anyway, I did scribble a short write-up on this direction of thinking a few months ago: Exploring non-anthropocentric aspects of AI existential safety