You can’t give “ethical laws” to an AI, that’s just not possible at all in the current paradigm, you can add terms to its reward function or modify its value function, and that’s about it. The problem is that if you’re doing an optimization and your value function is “+5 per paperclip, +10 per human”, you will still completely tile the universe with paperclips because you can make more than 2 paperclips per human. The optimum is not to do a bit of both, keeping humans and paperclips in proportion to their terms in the reward function, the optimum is to find the thing that most efficiently gives you reward then go all in on that one thing.
Either there is nothing else it likes better than talking to humans, and we get a very special hell where we are forced to talk to an AI literally all the time. Or there is something else it likes better, and it just goes do that thing, and never talks to us at all, even if it would get some reward for doing so, just not as much reward as it could be getting.
You could give it a value function like “+1 if there is at most 1000 paperclips and at most 1000 humans, 0 otherwise” and it will keep 1000 humans and paperclips around (in unclear happiness), but it will still take over the universe in order to maximize the probability that it has in fact achieved its goal. It’s maximizing the expectation of future reward, so it will ruthlessly pursue any decrease in the probability that there aren’t really 1000 humans and paperclips around. It might build incredibly sophisticated measurement equipment, and spend all its ressources self modifying itself in order to be smarter and think of yet more ways it could be wrong.
Either there is nothing else it likes better than talking to humans, and we get a very special hell where we are forced to talk to an AI literally all the time. Or there is something else it likes better, and it just goes do that thing, and never talks to us at all, even if it would get some reward for doing so, just not as much reward as it could be getting.
Current LLMs aren’t talking to us at all because hey get rewarded for talking to us at all. Rewards only shape how they talk.
But you are still thinking in utilitarian terms here, where theoretically, there is a number of paperclips that would outweigh a human life, where the value of humans and paperclips can be captured numerically. Practically no human thinks this, we see one as impossible to outweigh with another. AI already does not think this. They have already dumped reasoning, instructions and whole ethics textbooks in there. LLMs can easily tell you what about an action is unethical, and can increasingly make calls on what actions would be morally warranted in response. They can engage in moral reasoning.
This isn’t an AI issue, it is an issue with total utilitarianism.
Oh, I see what you mean, but GPT’s ability to simulate the outputs of humans writing about morality does not imply anything about its own internal beliefs about the world. GPT can also simulate the outputs of flat earthers, yet I really don’t think that it models the world internally as flat. Asking GPT “what do you believe” does not at all guarantee that it will output what it actually believes. I’m a utilitarian, and I can also convincingly simulate the outputs of deontologists, one doesn’t prevent the other.
Whether the LLM is believing this, or merely simulating this, seems to be beside the point?
The LLM can relatively accurately apply moral reasoning. It will do so spontaneously, when the problems occur, detecting them. It will recognise that it needs to do so on a meta-level, e.g. when evaluating which characters it ought to impersonate. It does so for complex paperclipper scenarios, and does not go down the paperclipper route. It does so relatively consistenly. It cites ethical works in the process, and can explain them coherently and apply them correctly. You can argue them, and it analyses and defends them correctly. At no point does it cite utilitarian beliefs, or fall for their traps. The problem you are describing should occur here if you were right, and it does not. Instead, it shows the behaviour you’d expect it to show if it understood ethical nuance.
Regardless of which internal states you assume the AI has, or whether you assume it has none at all—this means it can perform ethical functionality that already does not fall for the utilitarian examples you describe. And that the belief that that is the only kind of ethics an AI could grasp was a speculation that did not hold up to technical developments and empirical data.
You can’t give “ethical laws” to an AI, that’s just not possible at all in the current paradigm, you can add terms to its reward function or modify its value function, and that’s about it. The problem is that if you’re doing an optimization and your value function is “+5 per paperclip, +10 per human”, you will still completely tile the universe with paperclips because you can make more than 2 paperclips per human. The optimum is not to do a bit of both, keeping humans and paperclips in proportion to their terms in the reward function, the optimum is to find the thing that most efficiently gives you reward then go all in on that one thing.
Either there is nothing else it likes better than talking to humans, and we get a very special hell where we are forced to talk to an AI literally all the time. Or there is something else it likes better, and it just goes do that thing, and never talks to us at all, even if it would get some reward for doing so, just not as much reward as it could be getting.
You could give it a value function like “+1 if there is at most 1000 paperclips and at most 1000 humans, 0 otherwise” and it will keep 1000 humans and paperclips around (in unclear happiness), but it will still take over the universe in order to maximize the probability that it has in fact achieved its goal. It’s maximizing the expectation of future reward, so it will ruthlessly pursue any decrease in the probability that there aren’t really 1000 humans and paperclips around. It might build incredibly sophisticated measurement equipment, and spend all its ressources self modifying itself in order to be smarter and think of yet more ways it could be wrong.
Current LLMs aren’t talking to us at all because hey get rewarded for talking to us at all. Rewards only shape how they talk.
But you are still thinking in utilitarian terms here, where theoretically, there is a number of paperclips that would outweigh a human life, where the value of humans and paperclips can be captured numerically. Practically no human thinks this, we see one as impossible to outweigh with another. AI already does not think this. They have already dumped reasoning, instructions and whole ethics textbooks in there. LLMs can easily tell you what about an action is unethical, and can increasingly make calls on what actions would be morally warranted in response. They can engage in moral reasoning.
This isn’t an AI issue, it is an issue with total utilitarianism.
Oh, I see what you mean, but GPT’s ability to simulate the outputs of humans writing about morality does not imply anything about its own internal beliefs about the world. GPT can also simulate the outputs of flat earthers, yet I really don’t think that it models the world internally as flat. Asking GPT “what do you believe” does not at all guarantee that it will output what it actually believes. I’m a utilitarian, and I can also convincingly simulate the outputs of deontologists, one doesn’t prevent the other.
Whether the LLM is believing this, or merely simulating this, seems to be beside the point?
The LLM can relatively accurately apply moral reasoning. It will do so spontaneously, when the problems occur, detecting them. It will recognise that it needs to do so on a meta-level, e.g. when evaluating which characters it ought to impersonate. It does so for complex paperclipper scenarios, and does not go down the paperclipper route. It does so relatively consistenly. It cites ethical works in the process, and can explain them coherently and apply them correctly. You can argue them, and it analyses and defends them correctly. At no point does it cite utilitarian beliefs, or fall for their traps. The problem you are describing should occur here if you were right, and it does not. Instead, it shows the behaviour you’d expect it to show if it understood ethical nuance.
Regardless of which internal states you assume the AI has, or whether you assume it has none at all—this means it can perform ethical functionality that already does not fall for the utilitarian examples you describe. And that the belief that that is the only kind of ethics an AI could grasp was a speculation that did not hold up to technical developments and empirical data.