To be clear I was actually not interpreting the output “at face-value”. Quite the contrary: I was saying that ChatGPT gave this answer because it simply predicts the most likely answer between (next token prediction) between a human and agent, and given that it was trained on AI risk style arguments (or sci-fi ) this is the most likely output.
But this made me think of the longer-term question “what could be the consequences of training an AI on those arguments”. Usually, the “instrumental goal” argument supposes that the AI is so “smart” that it would learn that “not being turned off” is necessary. If it is trained on these types of arguments, it could “realize” this much sooner.
Btw even though GPT doesn’t “mean” what it says, it could still lead to actions that interpret exactly what it says. For example, many current RL algorithms use the LM’s output as high-level planning. This might continue in the near future …
To be clear I was actually not interpreting the output “at face-value”. Quite the contrary: I was saying that ChatGPT gave this answer because it simply predicts the most likely answer between (next token prediction) between a human and agent, and given that it was trained on AI risk style arguments (or sci-fi ) this is the most likely output.
But this made me think of the longer-term question “what could be the consequences of training an AI on those arguments”. Usually, the “instrumental goal” argument supposes that the AI is so “smart” that it would learn that “not being turned off” is necessary. If it is trained on these types of arguments, it could “realize” this much sooner.
Btw even though GPT doesn’t “mean” what it says, it could still lead to actions that interpret exactly what it says. For example, many current RL algorithms use the LM’s output as high-level planning. This might continue in the near future …