I really don’t expect “goals” to be explicitly written down in the network. There will very likely not be a thing that says “I want to predict the next token” or “I want to make paperclips” or even a utility function of that. My mental image of goals is that they are put “on top” of the model/mind/agent/person. Whatever they seem to pursue, independently of their explicit reasoning.
I’m sure that I don’t understand you. GPT most likely doesn’t have “I want to predict next token” written somewhere, because it doesn’t want to predict next token. There’s nothing in there that will actively try to predict next token no matter what. It’s just the thing it does when it runs.
Is it possible to have a system that just “actively try to make paperclips no matter what” when it runs, but it doesn’t reflect it in its reasoning and planning? I have a feeling that it requires God-level sophistication and knowledge of the universe to create a device that can act like that, when the device just happens to act in a way that robustly maximizes paperclips while not containing anything that can be interpreted as that goal.
I found that I can’t precisely formulate why I feel that. Maybe I’ll be able to express that in a few weeks (or I’ll find that the feeling is misguided).
No, I said that GPT does predict next token, while probably not containing anything that can be interpreted as “I want to predict next token”. Like a bacterium does divide (with possible adaptive mutations), while not containing “be fruitful and multiply” written somewhere inside.
No, I certainly didn’t mean that. If the extended Church—Turing thesis holds for macroscopic behavior of our bodies, we can indeed be represented as Turing-machine algorithms (with polynomial multiplier on efficiency).
What I feel, but can’t precisely convey, is that there’s a huge gulf (in computational complexity maybe) between agentic systems (that do have explicit internal representation of, at least, some of their goals) and “zombie-agentic” systems (that act like agents with goals, but have no explicit internal representation of those goals).
How do you define the goal (or utility function) of an agent? Is it something that actually happens when universe containing the agent evolves in its usual physical fashion? Or is it something that was somehow intended to happen when the agent is run (but may not actually happen due to circumstances and agent’s shortcomings)?