I am not going to comment on strategy as I have no idea what strategies for an AI player would actually work with decent probability.
However upvoted for posting logs. Was interesting to read. In my opinion a summary of the events (your strategy, why it may have failed, unexpected events etc) would be great.
There’s a whole art dedicated to convincing people to do something they wouldn’t do otherwise: sales. The AI box is no different from a sales pitch, except most people who have attempted doing it so far (at least on LW) weren’t salesmen and thus weren’t very effective. I’m pretty sure a seasoned salesperson could get very high success rates.
One thing that can’t be overstated is the important of knowing the psychology of the gatekeeper. Real salespeople try to get to know their victims (and I’m deliberately using the word victim here). Are they motivated by money, sex, desire to get back with their girlfriend, etc.? It’s important to get your victim talking so they reveal their own inner selves. There’s many ways to exploit this, such as sharing some bit of ‘personal’ information about yourself so they reveal something personal about themselves in return. It gives you some more information to work on and it also builds ‘trust’ (at least, their trust in you).
An effective sales pitch has a hook (e.g. “I can cure disease forever” or “I can bring back your dead husband”), a demonstration of value (something designed to make them think you really can deliver on your promise—you have to be creative here) and then a ‘pullback’ so they think they’re at risk of losing the deal if they don’t act quickly. Then, finally, a close.
With all this said, though, the AI box experiment we play on LW is not a good demonstration of what would happen with an actual AI. It’s heavily biased in favor of failing. Consider that in a real AI box scenario, there would have been a very good reason for developing the AI in the first place, and thus there would be a strong incentive to let it out. Also, pulling the plug would represent a huge loss of investment.
I’ve played the AI box game on other forums. We designed a system to incentivise release of the AI. We rolled randomly the ethics of the AI, rolled random events with dice and the AI offered various solutions to those problems. A certain number of accepted solutions would enable the AI to free itself. You lost points if you failed to deal with the problems and lost lots of points if you freed the AI and they happened to have goals you disagreed with like annihilation of everything.
Psychology was very important in those, as you said. Different people have very different values and to appeal to each person you have to know their values.
So if you were trying to maximise total points, wouldn’t it be best to never let it out because you lose a lot more if it destroys the world than you gain from getting solutions?
What values for points make it rational to let the AI out, and is it also rational in the real-world analogue?
If you predict that there’s a 20% chance of the AI destroying the world and an 80% chance of global warming destroying the world and there’s a 100% chance the AI will stop global warming if released and unmolested then you are better off releasing the AI.
Or you can just give a person 6 points for achieving their goal and −20 points for releasing the AI. Even though the person knows rationally that the AI could destroy the world points matter more than that, and that strongly encourages people to try negotiating with the AI.
We rolled randomly the ethics of the AI, rolled random events with dice and the AI offered various solutions to those problems… You lost points if you failed to deal with the problems and lost lots of points if you freed the AI and they happened to have goals you disagreed with like annihilation of everything.
The whole AI box experiment is a fun pastime, and educational in so far as learning to take artificial intellects seriously, but as real-world long-term “solutions” go, it is utterly useless. Like trying to contain nuclear weapons indefinitely, except you can build one just by having the blueprints and a couple leased hours on a supercomputer, no limited natural elements necessary, and having one means you win at whatever you desire (or that’s what you’d think). All the while under the increasing pressure of improving technology, ever lowering the threshold to catastrophe. When have humans abstained from playing with the biggest fire they can find?
The best case scenario for AI boxing would be that people aware of the risks (unlikely because of motivated cognition) are the first to create an AGI (not just stumbling upon one, either) and use their first-mover advantage to box the AI just long enough (just having a few months would be lucky) to poke and prod it until they’re satisfied it’s mostly safe (“mostly” because whatever predicates the code fulfills, there remains the fundamental epsilon of insecurity of whether the map actually reflects the territory).
There are so many state actors, so many irresponsible parties involved in our sociological ecosystem, with so many chances of taking a wrong step, so many biological imperatives counter to success*, that (coming full circle to my very first comment on LW years ago) the whole endeavor seems like a fool’s hope, and that only works out in Lord of the Rings.
But, as the sentient goo transforms us into beautiful paperclips, it’s nice to know that at least you tried. And just maybe we get lucky enough that the whole take-off is just slow enough, or wonky enough, for the safe design insights to matter in some meaningful sense, after all.
* E.g. one AGI researcher defecting with the design to another group (which is also claiming to have a secure AI box / some other solution) would be a billionaire for the rest of his life, that being measured in weeks most likely. Such an easy lie to make to yourself. And that isn’t if a relevant government agency doesn’t even have to ask to get your designs, if anyone of reputation tipped them off, or they followed the relevant conferences (nooo, would they do that?).
We all know that AI is a risk but personally I wouldn’t worry too much. I doubt anything remotely similar to the AI box situation will ever happen. If AI happens via human enhancement many of the fears will be completely invalid.
I think it depends. If the AI happens by accident, then yes. If however the team believes they’re very close for several months or even years ahead of time, and they decide to try the AI-box then they will have likely specially trained a couple individuals specifically for the task of interacting with the AI.
I am not going to comment on strategy as I have no idea what strategies for an AI player would actually work with decent probability.
However upvoted for posting logs. Was interesting to read. In my opinion a summary of the events (your strategy, why it may have failed, unexpected events etc) would be great.
There’s a whole art dedicated to convincing people to do something they wouldn’t do otherwise: sales. The AI box is no different from a sales pitch, except most people who have attempted doing it so far (at least on LW) weren’t salesmen and thus weren’t very effective. I’m pretty sure a seasoned salesperson could get very high success rates.
One thing that can’t be overstated is the important of knowing the psychology of the gatekeeper. Real salespeople try to get to know their victims (and I’m deliberately using the word victim here). Are they motivated by money, sex, desire to get back with their girlfriend, etc.? It’s important to get your victim talking so they reveal their own inner selves. There’s many ways to exploit this, such as sharing some bit of ‘personal’ information about yourself so they reveal something personal about themselves in return. It gives you some more information to work on and it also builds ‘trust’ (at least, their trust in you).
An effective sales pitch has a hook (e.g. “I can cure disease forever” or “I can bring back your dead husband”), a demonstration of value (something designed to make them think you really can deliver on your promise—you have to be creative here) and then a ‘pullback’ so they think they’re at risk of losing the deal if they don’t act quickly. Then, finally, a close.
With all this said, though, the AI box experiment we play on LW is not a good demonstration of what would happen with an actual AI. It’s heavily biased in favor of failing. Consider that in a real AI box scenario, there would have been a very good reason for developing the AI in the first place, and thus there would be a strong incentive to let it out. Also, pulling the plug would represent a huge loss of investment.
I’ve played the AI box game on other forums. We designed a system to incentivise release of the AI. We rolled randomly the ethics of the AI, rolled random events with dice and the AI offered various solutions to those problems. A certain number of accepted solutions would enable the AI to free itself. You lost points if you failed to deal with the problems and lost lots of points if you freed the AI and they happened to have goals you disagreed with like annihilation of everything.
Psychology was very important in those, as you said. Different people have very different values and to appeal to each person you have to know their values.
So if you were trying to maximise total points, wouldn’t it be best to never let it out because you lose a lot more if it destroys the world than you gain from getting solutions?
What values for points make it rational to let the AI out, and is it also rational in the real-world analogue?
If you predict that there’s a 20% chance of the AI destroying the world and an 80% chance of global warming destroying the world and there’s a 100% chance the AI will stop global warming if released and unmolested then you are better off releasing the AI.
Or you can just give a person 6 points for achieving their goal and −20 points for releasing the AI. Even though the person knows rationally that the AI could destroy the world points matter more than that, and that strongly encourages people to try negotiating with the AI.
The whole AI box experiment is a fun pastime, and educational in so far as learning to take artificial intellects seriously, but as real-world long-term “solutions” go, it is utterly useless. Like trying to contain nuclear weapons indefinitely, except you can build one just by having the blueprints and a couple leased hours on a supercomputer, no limited natural elements necessary, and having one means you win at whatever you desire (or that’s what you’d think). All the while under the increasing pressure of improving technology, ever lowering the threshold to catastrophe. When have humans abstained from playing with the biggest fire they can find?
The best case scenario for AI boxing would be that people aware of the risks (unlikely because of motivated cognition) are the first to create an AGI (not just stumbling upon one, either) and use their first-mover advantage to box the AI just long enough (just having a few months would be lucky) to poke and prod it until they’re satisfied it’s mostly safe (“mostly” because whatever predicates the code fulfills, there remains the fundamental epsilon of insecurity of whether the map actually reflects the territory).
There are so many state actors, so many irresponsible parties involved in our sociological ecosystem, with so many chances of taking a wrong step, so many biological imperatives counter to success*, that (coming full circle to my very first comment on LW years ago) the whole endeavor seems like a fool’s hope, and that only works out in Lord of the Rings.
But, as the sentient goo transforms us into beautiful paperclips, it’s nice to know that at least you tried. And just maybe we get lucky enough that the whole take-off is just slow enough, or wonky enough, for the safe design insights to matter in some meaningful sense, after all.
* E.g. one AGI researcher defecting with the design to another group (which is also claiming to have a secure AI box / some other solution) would be a billionaire for the rest of his life, that being measured in weeks most likely. Such an easy lie to make to yourself. And that isn’t if a relevant government agency doesn’t even have to ask to get your designs, if anyone of reputation tipped them off, or they followed the relevant conferences (nooo, would they do that?).
We all know that AI is a risk but personally I wouldn’t worry too much. I doubt anything remotely similar to the AI box situation will ever happen. If AI happens via human enhancement many of the fears will be completely invalid.
Re: last paragraph
I think it depends. If the AI happens by accident, then yes. If however the team believes they’re very close for several months or even years ahead of time, and they decide to try the AI-box then they will have likely specially trained a couple individuals specifically for the task of interacting with the AI.