Doomimir: But you claim to understand that LLMs that emit plausibly human-written text aren’t human. Thus, the AI is not the character it’s playing. Similarly, being able to predict the conversation in a bar, doesn’t make you drunk. What’s there not to get, even for you?
So what?
You seem to have an intuition that if you don’t understand all the mechanisms for how something works, then it is likely to have some hidden goal and be doing its observed behaviour for instrumental reasons. E.g. the “Alien Actress”.
And that makes sense from an evolutionary perspective where you encounter some strange intelligent creature doing some mysterious actions on the savannah. I do not think it make sense if you specifically trained the system to have that particular behaviour by gradient descent.
I think, if you trained something by gradient descent to have some particular behaviour, the most likely thing that resulted from that training is a system tightly tuned to have that particular behaviour, with the simplest arrangement that leads to the trained behaviour.
And if the behaviour you are training something to do is something that doesn’t necessarily involve actually trying to pursue some long-range goal, it would be very strange, in my view, for it to turn out that the simplest arrangement to provide that behaviour calculates the effects of the output on the long-range future in order to determine what output to select.
Moreover even if you tried to train it to want to have some effect on the future, I expect you would find it more difficult than expected, since it would learn various heuristics and shortcuts long before actually learning the very complicated algorithm of generating a world model, projecting it forward given the system’s outputs, and selecting the output that steers the future to the particular goal. (To others: This is not an invitation to try that. Please don’t).
That doesn’t mean that an AI trained by gradient descent on a task that usually doesn’t involve trying to pursue a long range goal can never be dangerous, or that it can never have goals.
But it does mean that the danger and the goals of such a usually-non-long-range-task-trained AI, if it has them, are downstreamof its behaviour.
For example, an extremely advanced text predictor might predict the text output of a dangerous agent through an advanced simulation that is itself a dangerous agent.
And if someone actually manages to train a system by gradient descent to do real-world long range tasks (which probably is a lot easier than making a text predictor that advanced), well then...
BTW all the above is specific to gradient descent. I do expect self-modifying agents, for example, to be much more likely to be dangerous, because actual goals lead to wanting to enhance one’s ability and inclination to pursue those goals, whereas non-goal-oriented behaviour will not be self-preserving in general.
So what?
You seem to have an intuition that if you don’t understand all the mechanisms for how something works, then it is likely to have some hidden goal and be doing its observed behaviour for instrumental reasons. E.g. the “Alien Actress”.
And that makes sense from an evolutionary perspective where you encounter some strange intelligent creature doing some mysterious actions on the savannah. I do not think it make sense if you specifically trained the system to have that particular behaviour by gradient descent.
I think, if you trained something by gradient descent to have some particular behaviour, the most likely thing that resulted from that training is a system tightly tuned to have that particular behaviour, with the simplest arrangement that leads to the trained behaviour.
And if the behaviour you are training something to do is something that doesn’t necessarily involve actually trying to pursue some long-range goal, it would be very strange, in my view, for it to turn out that the simplest arrangement to provide that behaviour calculates the effects of the output on the long-range future in order to determine what output to select.
Moreover even if you tried to train it to want to have some effect on the future, I expect you would find it more difficult than expected, since it would learn various heuristics and shortcuts long before actually learning the very complicated algorithm of generating a world model, projecting it forward given the system’s outputs, and selecting the output that steers the future to the particular goal. (To others: This is not an invitation to try that. Please don’t).
That doesn’t mean that an AI trained by gradient descent on a task that usually doesn’t involve trying to pursue a long range goal can never be dangerous, or that it can never have goals.
But it does mean that the danger and the goals of such a usually-non-long-range-task-trained AI, if it has them, are downstream of its behaviour.
For example, an extremely advanced text predictor might predict the text output of a dangerous agent through an advanced simulation that is itself a dangerous agent.
And if someone actually manages to train a system by gradient descent to do real-world long range tasks (which probably is a lot easier than making a text predictor that advanced), well then...
BTW all the above is specific to gradient descent. I do expect self-modifying agents, for example, to be much more likely to be dangerous, because actual goals lead to wanting to enhance one’s ability and inclination to pursue those goals, whereas non-goal-oriented behaviour will not be self-preserving in general.