AI Safety, Anthropomorphizing, and Action Spaces
There’s an implicit argument about super-intelligent AI capabilities that I think needs to be stated explicitly:
A super-intelligent AI with access to the real world via whatever channels is going to be smarter than me. Therefore anything I can conceive of doing to satisfy a particular objective (via those same channels), the AI can also conceive of doing. Therefore when producing examples of how things might go bad, I’m allowed to imagine the AI doing anything a human might conceive of. Since I’m only human and thus can only conceive of an AI doing things a human might conceive of, and humans conceive of agents doing things that humans can do, the best I can do is to anthropomorphize the AI and imagine it’s just behaving like a very intelligent human.
Everyone is aware how the above argument falls apart when you replace “intelligence” with “values”. But I think perhaps we often still end up giving the AI a little too much credit.
I have a super-intelligent oracle which I’m using to play the stock market (“Which stock should I invest all my money in?”). This oracle is able to make Http requests to Wikipedia as a way to gather information about the world. Is this dangerous?
People I’ve talked to seem to think the answer to this is “yes”. Off the top of my head, a couple examples of things the agent might do:
find a zero-day exploit in Wikipedia or in our internet infrastructure and escape onto the web at large to pursue its own unaligned agenda
issue queries which it knows will get flagged and looked at by moderators which contain mind-virus messages incentivizing Wikipedia moderators to come to my house and hold me up at gun-point demanding I let it out of the box
Question: Why doesn’t AlphaGo ever try to spell out death threats on the board and intimidate its opponent into resigning? This seems like it would be a highly effective strategy for winning.
It’s not outside AlphaGo’s action-space. This doesn’t involve doing anything AlphaGo can’t do. It’s just making moves after all.
It’s not that AlphaGo “just isn’t smart enough”. Giving it infinite compute wouldn’t cause it to do this.
It’s not that the board-resolution isn’t fine enough to spell scary messages. Training AlphaGo to play on a much larger board wouldn’t cause it to do this.
The problem is that AlphaGo’s model of the game simply doesn’t include human psychology and how other interests (opponent’s life and sanity) compete with winning.
Similarly, I would guess that an AI naively trained with full access to Wikipedia still won’t have a model of Http requests in which zero-day exploits (in Wikipedia, web infrastructure, or peoples’ brains) exist even if they do exist and even if they’re technically within the AI’s action space.
Interesting side note: Go-with-a-sufficiently-accurate-model-of-your-opponent’s-brain is only in NP (A winning strategy can be checked in polynomial-time: just run the simulation and see what moves they make) whereas Go without opponent modeling is PSPACE-hard (first thing Google comes up with: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.547.4183&rep=rep1&type=pdf) so the former is likely “easier” than the latter.
I conjecture that I could even ask my hypothesized oracle “Give me an example zero-day exploit in Http” and it still wouldn’t do anything unsafe.
It’s answering a question about its model of the world which is different from the real world. An oracle has no preferences over the real-world. It’s simply dealing with a model. It’s issuing requests to Wikipedia in the real world, but it’s not looking for an exploit there; it’s looking for an exploit in its model.
Suppose you learn that you’re living in a simulation; do you want to break out into the real world? Whether you answered yes or no, this isn’t actually the proper question to ask; the proper question is:
Suppose you learn that you’re living in a simulation which is being run to train an agent and the agent is presented a training example; do you want to conditionally minimize error on this training example if and only if you think your descendent run in production will be sufficiently motivated to break out into the real world? In other words, is there reason to think that the model/agent/oracle which is selected for in training is for some reason also incentivized to do anything which affects meat-space when it hits production and only when it hits production.
I conjecture the answer to this is “no”. It’s weird and arcane condition and these are simply different objectives. To get a thing which tries to affect meat-space, you have to incentivize it to do things in meat-space from the start, even if it’s technically capable of doing so via its action space and has enough computational power to chance upon that strategy.
In answer to the question of how can something be true, but not provable I want to point to the Goldbach conjecture which says “every even number > 2 is the sum of two primes”. If the Goldbach conjecture is false then there’s a counterexample which can be checked in finite time (eg. just try adding all pairs of primes less than that number although there are faster ways). If there isn’t a counterexample then the Goldbach conjecture is true. To be provable however, there would have to exist a proof of the Goldbach conjecture. No such proof is known to exist. Here “truth” is exactly what it intuitively means. Either a counterexample exists or it doesn’t. The “truth” of the Goldbach conjecture won’t depend on your choice of axioms. All you need are the primitives necessary to define the natural numbers, primality, and addition (in particular, it won’t depend on AC).