I think the main crux is recursive self-improvement. If we have this situation
No computers, no programming and no mention of AI
then do we allow self-modification at all for our AI? Do we want to work with recursive self-improvement at all, or is it too dangerous, even in simulation? And how curtailed would self-improvement be without some version of programming?
And then for the distribution shift: what happens when we put such a system into the world where there is programming? Is it a system which is supposed to learn new things after being placed into the real world?
Our testing does not seem to tell us much about what the system’s behavior will be after programming is added to the mix...
I think some of your questions here are answered in Greg Egan’s story Crystal Nights and in Jake’s simbox post. We can have programming without mention of computers. It can be based on things like a fictional system of magic.
Thank you! It was valuable to read Crystal Nights and the simbox post gave me new insights and I have made a lot of updates thanks to these reading tips. I would think it to be a lot safer to not go for a fictional system of magic that lets it program. I estimate it would greatly increase the chance it thinks it is inside a computer and gives a lot of clues about perhaps being inside a simulation to test it, which we want to prevent. I would say, first see if it passes the non-programming simbox. If it does not, great, we found an alignment technique that does not work. Then after that, then you can think of doing a run with programming. I do realize these runs can cost hundreds of millions of dollars, but not going extinct is worth the extra caution, I would say. What do you think?
I added a new section “How to deal with recursive self-improvement” near the end after reading your comment. I would say yes, recursive self-improvement is too dangerous because between the current AI and the next there is an alignment problem and I would not think it wise to trust the AI will always be successful in aligning its successor. Yes, the simbox is supposed to be robust to any kind of agent, also ones that are always learning like humans are. I personally estimate that the testing without programming will show what we need. If it is always aligned without programming in the simulation, I expect it has generalized to “do what they want me to do”. If that is true, then being able to program does not change anything. Of course, I could be wrong, but I think we should at least try this to filter out the alignment approaches that fail in the simbox world that does not have programming.
Curious what you think. I also made other updates, for example added a new religion for AI’s and some on why we need to treat the AI’s horribly.
One thing to keep in mind is that “getting it right on the first try” is a good framing if one is actually going to create an AI system which would take over the world (which is a very risky proposition).
If one is not aiming for that, and instead thinks in terms of making sure AI systems don’t try to take over the world as one of their safety properties, then things are somewhat different:
on one hand, one needs to avoid the catastrophe not just on the first try, but on every try, which is a much higher bar;
on the other hand, one needs to ponder the collective dynamics of the AI ecosystem (and the AI-human ecosystem); things are getting rather non-trivial in the absence of the dominant actor.
When we ponder the questions of AI existential safety, we should consider both models (“singleton” vs “multi-polar”).
It’s traditional for the AI alignment community to mostly focus on the “single AI” scenario, but since avoiding the singleton takeover is usually considered to be one of the goals, we should also pay more attention to the multi-polar track which is the default fall-back in the absence of a singleton takeover (at some point I scribbled a bit of notes reflecting my thoughts with regard to the multi-polar track, Exploring non-anthropocentric aspects of AI existential safety)
But many people are hoping that our collaborations with emerging AI systems, thinking together with those AI systems about all these issues, will lead to more insights and, perhaps, to different fruitful approaches (assuming that we have enough time to take advantage of this stronger joint thinking power, that is assuming that things develop and become more smart at a reasonable pace, without rapid blow-ups). So there is reason for hope in this sense...
I think the main crux is recursive self-improvement. If we have this situation
then do we allow self-modification at all for our AI? Do we want to work with recursive self-improvement at all, or is it too dangerous, even in simulation? And how curtailed would self-improvement be without some version of programming?
And then for the distribution shift: what happens when we put such a system into the world where there is programming? Is it a system which is supposed to learn new things after being placed into the real world?
Our testing does not seem to tell us much about what the system’s behavior will be after programming is added to the mix...
I think some of your questions here are answered in Greg Egan’s story Crystal Nights and in Jake’s simbox post. We can have programming without mention of computers. It can be based on things like a fictional system of magic.
Thanks!
These are very useful references.
Thank you! It was valuable to read Crystal Nights and the simbox post gave me new insights and I have made a lot of updates thanks to these reading tips. I would think it to be a lot safer to not go for a fictional system of magic that lets it program. I estimate it would greatly increase the chance it thinks it is inside a computer and gives a lot of clues about perhaps being inside a simulation to test it, which we want to prevent. I would say, first see if it passes the non-programming simbox. If it does not, great, we found an alignment technique that does not work. Then after that, then you can think of doing a run with programming. I do realize these runs can cost hundreds of millions of dollars, but not going extinct is worth the extra caution, I would say. What do you think?
I agree, but I do see the high cost as a weakness of the plan. For my latest ideas on this, see here: https://ai-plans.com/post/2e2202d0dc87
I added a new section “How to deal with recursive self-improvement” near the end after reading your comment. I would say yes, recursive self-improvement is too dangerous because between the current AI and the next there is an alignment problem and I would not think it wise to trust the AI will always be successful in aligning its successor.
Yes, the simbox is supposed to be robust to any kind of agent, also ones that are always learning like humans are.
I personally estimate that the testing without programming will show what we need. If it is always aligned without programming in the simulation, I expect it has generalized to “do what they want me to do”. If that is true, then being able to program does not change anything. Of course, I could be wrong, but I think we should at least try this to filter out the alignment approaches that fail in the simbox world that does not have programming.
Curious what you think. I also made other updates, for example added a new religion for AI’s and some on why we need to treat the AI’s horribly.
Thanks!
I’ll make sure to read the new version.
I pondered this more...
One thing to keep in mind is that “getting it right on the first try” is a good framing if one is actually going to create an AI system which would take over the world (which is a very risky proposition).
If one is not aiming for that, and instead thinks in terms of making sure AI systems don’t try to take over the world as one of their safety properties, then things are somewhat different:
on one hand, one needs to avoid the catastrophe not just on the first try, but on every try, which is a much higher bar;
on the other hand, one needs to ponder the collective dynamics of the AI ecosystem (and the AI-human ecosystem); things are getting rather non-trivial in the absence of the dominant actor.
When we ponder the questions of AI existential safety, we should consider both models (“singleton” vs “multi-polar”).
It’s traditional for the AI alignment community to mostly focus on the “single AI” scenario, but since avoiding the singleton takeover is usually considered to be one of the goals, we should also pay more attention to the multi-polar track which is the default fall-back in the absence of a singleton takeover (at some point I scribbled a bit of notes reflecting my thoughts with regard to the multi-polar track, Exploring non-anthropocentric aspects of AI existential safety)
But many people are hoping that our collaborations with emerging AI systems, thinking together with those AI systems about all these issues, will lead to more insights and, perhaps, to different fruitful approaches (assuming that we have enough time to take advantage of this stronger joint thinking power, that is assuming that things develop and become more smart at a reasonable pace, without rapid blow-ups). So there is reason for hope in this sense...