I don’t understand the distinction you draw between free agents and agents without freedom.
If I build an expected utility maximizer with a preference for the presence of some physical quantity, that surely is not a free agent. If I build some agent with the capacity to modify a program which is responsible for its conversion from states of the world to scalar utility values, I assume you would consider that a free agent.
How is this agent any more free than the expected utility maximizer, other than for the reason that I can’t conveniently extrapolate the outcome of its modification of its utility function?
It seems to me that this only shifts the problem from “how do we find a safe utility function to maximize” to “how do we find a process by which a safe utility function is learned”, and I would argue the consideration of the latter is already a mainstream position in alignment.
If I have missed a key distinguishing property, I would be very interested to know.
Take a standard language model trained by minimisation of the loss function L. Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of L]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”
Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by the model, to train a new model.
Here, we have:
started from a model whose objective function is L;
used that model’s learnt reasoning to answer an ethics-related question;
used that answer to obtain a model whose objective is different from L.
If we view this interaction between the language model and the human as part of a single agent, the three bullet points above are an example of an evaluation update.
In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.
I am saying that thinking in terms of changing utility functions might be a better framework.
The point about learning a safe utility function is similar. I am saying that using the agent’s reasoning to solve the agent’s problem of what to do (not only how to carry out tasks) might be a better framework.
It’s possible that there is an elegant mathematical model which would make you think: “Oh, now I get the difference between free and non-free” or “Ok, now it makes more sense to me”. Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.
Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of “Why are we even considering different frameworks for agency? Let’s see everything in terms of loss minimisation”, and this latter statement throws away too much potentially useful information, in my opinion.
I don’t understand the distinction you draw between free agents and agents without freedom.
If I build an expected utility maximizer with a preference for the presence of some physical quantity, that surely is not a free agent. If I build some agent with the capacity to modify a program which is responsible for its conversion from states of the world to scalar utility values, I assume you would consider that a free agent.
I am reminded of E.T. Jaynes’ position on the notion of ‘randomization’, which I will summarize as “a term to describe a process we consider too hard to model, which we then consider a ‘thing’ because we named it.”
How is this agent any more free than the expected utility maximizer, other than for the reason that I can’t conveniently extrapolate the outcome of its modification of its utility function?
It seems to me that this only shifts the problem from “how do we find a safe utility function to maximize” to “how do we find a process by which a safe utility function is learned”, and I would argue the consideration of the latter is already a mainstream position in alignment.
If I have missed a key distinguishing property, I would be very interested to know.
Let’s consider the added example:
In theory, there is a way to describe this iterative process as the optimisation of a single fixed utility function. In theory, we can also describe everything as simply following the laws of physics.
I am saying that thinking in terms of changing utility functions might be a better framework.
The point about learning a safe utility function is similar. I am saying that using the agent’s reasoning to solve the agent’s problem of what to do (not only how to carry out tasks) might be a better framework.
It’s possible that there is an elegant mathematical model which would make you think: “Oh, now I get the difference between free and non-free” or “Ok, now it makes more sense to me”. Here I went for something that is very general (maybe too general, you might argue) but is possibly easier to compare to human experience.
Maybe no mathematical model would make you think the above, but then (if I understand correctly) your objection seems to go in the direction of “Why are we even considering different frameworks for agency? Let’s see everything in terms of loss minimisation”, and this latter statement throws away too much potentially useful information, in my opinion.