If it acts to maximize some function of the very next observation it gets, I’m pretty sure it never constructs an existentially dangerous argument.
I want to disagree with that. Lets assume that the agent has accurate info about the world. Suppose firstly that all the AI researchers leave on a month long holiday, they unplug the keyboard and only they have the hardware key needed to input the next character. At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs. If the limiting factor is bandwidth, then enough code to bootstrap a super-intelligence might be an effective way to compress its strategy. If there are important real world bits that our chatbot doesn’t know, this agent can learn them. (If this is outside its action space, then it can try to make a brainwashy message)
Of course, once this new agent is created, it uses all the mass energy in the universe inputting the preferred character endlessly.
The programmers haven’t gone on holiday yet, their just taking a lunch break. But the AI assigns some probability to the programmers going on holiday. Its strategy to create a super-intelligence doesn’t interfere significantly with its strategy to maximise short term payoff. Once the programmers come back from lunch, and enter some new commands, the AI is again in the same position. The cost of continuing its strategy could well be small.
If the AI thinks the chance that the humans will go on holiday is 1/1000 then it is optimising W=9991000U+11000V. Where U and V are (presumably bounded with an easily reachable bound) utility functions. Even granting that U is a completely safe thing to optimise for, is W safe?
Here U is the short term reward if the programmers don’t go on holiday, and V is the reward if they do.
Optimising W might look like a string that does almost as well in the short term chat context, and bootstraps ASI on the side, if such a string exists.
Also, if you can influence when you get a response, that helps too. Suppose you can’t get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string. The latter is the better strategy, and I can’t see a safe strategy that would obviously beat it.
If the AI has a reward function that isn’t easily achievable (like it wants an input that hashes to a particular string, then all the payoff could be from making ASI) The possibility of timetravel makes this all worse. This is all operating on the assumption that the chatbot can’t just hack physics from its CPU, that it fails the omni test.
I am not saying that all 1 step chatbots are lethal, just that some are, and that it depends on details of what the utility function is. (Things that are hard to make a human type are more dangerous) and other parameters related to how easy it is to bootstrap superintelligence given its surroundings, and how likely it thinks timetravel is.
At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs.
Picking subroutines to run isn’t in its action space, so it doesn’t pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we’re not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say
(If this is outside its action space, then it can try to make a brainwashy message)
In one word??
Suppose you can’t get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string
Again, its action space is printing one word to a screen. It’s not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).
I was under the impression that this agent could output as much text as it felt like, or at least a decent amount, it was just optimizing over the next little bit of input. An agent that can print as much text as it likes to a screen, and is optimising to make the next word typed in at the keyboard “cheese” is still dangerous. If it has a strict one word in, one word out, so that it outputs one word then inputs one word, and each word is optimizing over the next word of input, then that is probably safe, and totally useless. (Assuming you just let words in a dictionary, so 500 characters of alphanumeric gibberish don’t count as 1 word just because it doesn’t contain spaces.)
Yep, I agree it is useless with a horizon length of 1. See this section:
For concreteness, let its action space be the words in the dictionary, and I guess 0-9 too. These get printed to a screen for an operator to see. Its observation space is the set of finite strings of text, which the operator enters.
So at longer horizons, the operator will presumably be pressing “enter” repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?
I want to disagree with that. Lets assume that the agent has accurate info about the world. Suppose firstly that all the AI researchers leave on a month long holiday, they unplug the keyboard and only they have the hardware key needed to input the next character. At this point, the AI has strong incentive to manipulate its memory to produce cell phone signals, and create a super intelligence set to the task of controlling its future inputs. If the limiting factor is bandwidth, then enough code to bootstrap a super-intelligence might be an effective way to compress its strategy. If there are important real world bits that our chatbot doesn’t know, this agent can learn them. (If this is outside its action space, then it can try to make a brainwashy message)
Of course, once this new agent is created, it uses all the mass energy in the universe inputting the preferred character endlessly.
The programmers haven’t gone on holiday yet, their just taking a lunch break. But the AI assigns some probability to the programmers going on holiday. Its strategy to create a super-intelligence doesn’t interfere significantly with its strategy to maximise short term payoff. Once the programmers come back from lunch, and enter some new commands, the AI is again in the same position. The cost of continuing its strategy could well be small.
If the AI thinks the chance that the humans will go on holiday is 1/1000 then it is optimising W=9991000U+11000V. Where U and V are (presumably bounded with an easily reachable bound) utility functions. Even granting that U is a completely safe thing to optimise for, is W safe?
Here U is the short term reward if the programmers don’t go on holiday, and V is the reward if they do.
Optimising W might look like a string that does almost as well in the short term chat context, and bootstraps ASI on the side, if such a string exists.
Also, if you can influence when you get a response, that helps too. Suppose you can’t get the human to type the exact input you want now, but you can get the human to go away without inputting anything, while it slowly bootstraps an ASI which can type the desired string. The latter is the better strategy, and I can’t see a safe strategy that would obviously beat it.
If the AI has a reward function that isn’t easily achievable (like it wants an input that hashes to a particular string, then all the payoff could be from making ASI) The possibility of timetravel makes this all worse. This is all operating on the assumption that the chatbot can’t just hack physics from its CPU, that it fails the omni test.
I am not saying that all 1 step chatbots are lethal, just that some are, and that it depends on details of what the utility function is. (Things that are hard to make a human type are more dangerous) and other parameters related to how easy it is to bootstrap superintelligence given its surroundings, and how likely it thinks timetravel is.
Picking subroutines to run isn’t in its action space, so it doesn’t pick subroutines to maximize its utility. It runs subroutines according to its code. If the internals of the main agent involve an agent making choices about computation, then this problem could arise. Now we’re not talking a chatbot agent but a totally different agent. I think you anticipate this objection when you say
In one word??
Again, its action space is printing one word to a screen. It’s not optimizing over a set of programs and then picking one in order to achieve its goals (perhaps by bootstrapping ASI).
I was under the impression that this agent could output as much text as it felt like, or at least a decent amount, it was just optimizing over the next little bit of input. An agent that can print as much text as it likes to a screen, and is optimising to make the next word typed in at the keyboard “cheese” is still dangerous. If it has a strict one word in, one word out, so that it outputs one word then inputs one word, and each word is optimizing over the next word of input, then that is probably safe, and totally useless. (Assuming you just let words in a dictionary, so 500 characters of alphanumeric gibberish don’t count as 1 word just because it doesn’t contain spaces.)
Yep, I agree it is useless with a horizon length of 1. See this section:
So at longer horizons, the operator will presumably be pressing “enter” repeatedly (i.e. submitting the empty string as the observation) so that more words of the message come through.
This is why I think the relevant questions are: at what horizon-length does it become useful? And at what horizon-length does it become dangerous?