Take a standard language model trained by minimisation of the loss function L. Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of L]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”
Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by the model, to train a new model.
Here, we have:
started from a model whose objective function is L;
used that model’s learnt reasoning to answer an ethics-related question;
used that answer to obtain a model whose objective is different from L.
If we view this interaction between the language model and the human as part of a single agent, the three bullet points above are an example of an evaluation update.
In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.
I am saying that thinking in terms of changing utility functions might be a better framework.
The point about learning a safe utility function is similar. I am saying that using the agent’s reasoning to solve the agent’s problem of what to do (not only how to carry out tasks) might be a better framework.
It’s possible that there is an elegant mathematical model which would make you think: “Oh, now I get the difference between free and non-free” or “Ok, now it makes more sense to me”. Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.
Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of “Why are we even considering different frameworks for agency? Let’s see everything in terms of loss minimisation”, and this latter statement throws away too much potentially useful information, in my opinion.
Let’s consider the added example:
In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.
I am saying that thinking in terms of changing utility functions might be a better framework.
The point about learning a safe utility function is similar. I am saying that using the agent’s reasoning to solve the agent’s problem of what to do (not only how to carry out tasks) might be a better framework.
It’s possible that there is an elegant mathematical model which would make you think: “Oh, now I get the difference between free and non-free” or “Ok, now it makes more sense to me”. Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.
Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of “Why are we even considering different frameworks for agency? Let’s see everything in terms of loss minimisation”, and this latter statement throws away too much potentially useful information, in my opinion.