I added a new section “How to deal with recursive self-improvement” near the end after reading your comment. I would say yes, recursive self-improvement is too dangerous because between the current AI and the next there is an alignment problem and I would not think it wise to trust the AI will always be successful in aligning its successor. Yes, the simbox is supposed to be robust to any kind of agent, also ones that are always learning like humans are. I personally estimate that the testing without programming will show what we need. If it is always aligned without programming in the simulation, I expect it has generalized to “do what they want me to do”. If that is true, then being able to program does not change anything. Of course, I could be wrong, but I think we should at least try this to filter out the alignment approaches that fail in the simbox world that does not have programming.
Curious what you think. I also made other updates, for example added a new religion for AI’s and some on why we need to treat the AI’s horribly.
One thing to keep in mind is that “getting it right on the first try” is a good framing if one is actually going to create an AI system which would take over the world (which is a very risky proposition).
If one is not aiming for that, and instead thinks in terms of making sure AI systems don’t try to take over the world as one of their safety properties, then things are somewhat different:
on one hand, one needs to avoid the catastrophe not just on the first try, but on every try, which is a much higher bar;
on the other hand, one needs to ponder the collective dynamics of the AI ecosystem (and the AI-human ecosystem); things are getting rather non-trivial in the absence of the dominant actor.
When we ponder the questions of AI existential safety, we should consider both models (“singleton” vs “multi-polar”).
It’s traditional for the AI alignment community to mostly focus on the “single AI” scenario, but since avoiding the singleton takeover is usually considered to be one of the goals, we should also pay more attention to the multi-polar track which is the default fall-back in the absence of a singleton takeover (at some point I scribbled a bit of notes reflecting my thoughts with regard to the multi-polar track, Exploring non-anthropocentric aspects of AI existential safety)
But many people are hoping that our collaborations with emerging AI systems, thinking together with those AI systems about all these issues, will lead to more insights and, perhaps, to different fruitful approaches (assuming that we have enough time to take advantage of this stronger joint thinking power, that is assuming that things develop and become more smart at a reasonable pace, without rapid blow-ups). So there is reason for hope in this sense...
I added a new section “How to deal with recursive self-improvement” near the end after reading your comment. I would say yes, recursive self-improvement is too dangerous because between the current AI and the next there is an alignment problem and I would not think it wise to trust the AI will always be successful in aligning its successor.
Yes, the simbox is supposed to be robust to any kind of agent, also ones that are always learning like humans are.
I personally estimate that the testing without programming will show what we need. If it is always aligned without programming in the simulation, I expect it has generalized to “do what they want me to do”. If that is true, then being able to program does not change anything. Of course, I could be wrong, but I think we should at least try this to filter out the alignment approaches that fail in the simbox world that does not have programming.
Curious what you think. I also made other updates, for example added a new religion for AI’s and some on why we need to treat the AI’s horribly.
Thanks!
I’ll make sure to read the new version.
I pondered this more...
One thing to keep in mind is that “getting it right on the first try” is a good framing if one is actually going to create an AI system which would take over the world (which is a very risky proposition).
If one is not aiming for that, and instead thinks in terms of making sure AI systems don’t try to take over the world as one of their safety properties, then things are somewhat different:
on one hand, one needs to avoid the catastrophe not just on the first try, but on every try, which is a much higher bar;
on the other hand, one needs to ponder the collective dynamics of the AI ecosystem (and the AI-human ecosystem); things are getting rather non-trivial in the absence of the dominant actor.
When we ponder the questions of AI existential safety, we should consider both models (“singleton” vs “multi-polar”).
It’s traditional for the AI alignment community to mostly focus on the “single AI” scenario, but since avoiding the singleton takeover is usually considered to be one of the goals, we should also pay more attention to the multi-polar track which is the default fall-back in the absence of a singleton takeover (at some point I scribbled a bit of notes reflecting my thoughts with regard to the multi-polar track, Exploring non-anthropocentric aspects of AI existential safety)
But many people are hoping that our collaborations with emerging AI systems, thinking together with those AI systems about all these issues, will lead to more insights and, perhaps, to different fruitful approaches (assuming that we have enough time to take advantage of this stronger joint thinking power, that is assuming that things develop and become more smart at a reasonable pace, without rapid blow-ups). So there is reason for hope in this sense...