Unlike a standard utility maximiser acting according to the specified metric, a free agent — assuming it was functional at all — would learn how to reason under uncertainty by interacting with the environment, then apply the learnt reasoning principles also to its values, thus ending up morally uncertain.
I’m puzzled that, as laid out above, neither the graph G=(V,E) you describe for the world model nor the mapping f:V−>R describing the utility provide any way to describe or quantify uncertainty or alternative hypotheses. Surely, one would want such a Free Agent to be able to consider alternative hypotheses, accumulate evidence in favor or of against hypotheses, and even design and carry out experiments to do so more efficiently, both about the behavior of the world, the effects of its actions, and the moral consequences of these? One would also hope that, while it was still uncertain, it would exercise due caution in not yet overly relying upon facts (either about the world or about morality) that it was still too uncertain of, by doing some form of pessimizing over Knightian uncertainty while attempting to optimize the morality of its available actions. So I would want it to, rather like AIXI (except with uncertainty over the utility function as well as the world model, and without computationally unbounded access to universal priors), maintain a probability-weighted ensemble of world models and ethical value mappings, and perform approximately-Bayesian reasoning over this ensemble. So something along the lines I describe in Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom).
Ethical philosophy has described a great many different models for how one can reason about the ethical values of states of the world, or of acts. Your f:V−>R labels the states of the world, not the acts, so I gather you are subscribing to and building in a consequentialist view of the Philosophy of Ethics, not a deontological one? Since you appear to be willing to build at least one specific Philosophy of Ethics assumption directly into your system, and not allow your Free Agent entirely free choice of Ethical philosophy, I think it might be very useful to build in some more (ethical discussions among philosophers do tend to end with people agreeing to disagree, and then categorizing all their disagreements in detail, after all). For example, the Free Agent’s task would clearly be easier if it knew whether the f:V−>R moral values were objective facts, as yet not fully known to it but having a definitive correct value (which I gather would be moral realism), and if so whether the origin of that correct value is theological, philosophical, mathematical, or some other form of abstraction, or if they were instead, say, social constructs only meaningful in the context of a particular society at a particular point in time (cultural moral relativism), or were precisely deducible from Biological effects such as Evolutionary Psychology and evolutionary fitness (biological moral realism), or were social constructs somewhat constrained by the Evolutionary Psychology of the humans the society is made up from, or whatever. Some idea of where to find valid evidence for reasoning about f, or at least how to reason about what might be valid evidence, seems essential for your Free Agent to be able to make forward progress. My understanding is that human moral philosophers tend to regard their own moral intuitions about specific situations as (at least unreliable) evidence about f, which would suggest a role for both human Evolutionary Psychology and Sociology. Well-known Alignment proposals such as Coherent Extrapolated Volition or Value Learning would suggest that f is a statement about humans (so unlikely to be validly transferable to a society made up of some other sapient species, though their might be some similarities for evolutionary reasons, i.e. a form of at least sapient species-level moral relativism), and thus that the only valid source for experimental evidence about f is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions).
The presence of these common elements, and the lack of others, shouldn’t be surprising: not only do we share the same evolutionary biases (e.g. feeling empathy for each other), but we also apply to moral thinking similar reasoning principles, which we learnt through our lives by interacting with the environment. Last but not least, differences in the learning environment (culture, education system et cetera) affect the way we reason and the conclusions we reach regarding morality.
I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about f as opposed to sources of error. For example, if f is independent of culture (e.g. moral objectivism), then “differences in the learning environment (culture, education system et cetera)” can only induce errors (if perhaps more or less so in some cases than others). But if f is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.
A valid position would be to allow our Free Agent uncertainty over some of these sorts of Philosophy of Ethics questions. However, if we do so, I’m then uncertain whether the Free Agent will ever be able to find evidence on which to resolve these uncertainties (given the poor track record at this over that last few millennia of human philosophers of ethics). At a minimum, it strongly suggests your Free Agent would need to be superhuman to do so.
[If you’re curious, I’m personally a moral relativist and anti-realist. I regard designing an ethical system as designing software for a society, so I regard Sociology and human Evolutionary Psychology as important sources of constraints. Thus my viewpoint bears some loose resemblance to a morally anti-realist version of ethical naturalism, where naturalism imposes practical design constraints/preferences rather that a precise prescription.]
Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I’ll try my best to give you some answers.
Under Further observations, I wrote:
The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.
If the toy model I gave doesn’t help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.
Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation f reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let’s say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well.
You wrote:
I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about f as opposed to sources of error. For example, if f is independent of culture (e.g. moral objectivism), then “differences in the learning environment (culture, education system et cetera)” can only induce errors (if perhaps more or less so in some cases than others). But if f is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.
It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not.
thus that the only valid source for experimental evidence about f is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions)
I am not sure I completely follow you when you are talking about experimental evidence about f, but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: “if a free agent didn’t have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available”. Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable path might be based on LLMs.
It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not.
I see. So rather than having a specific favored ethical philosophy viewpoint that you want to implement, your intention is to construct multiple Free Agents, perhaps with different ethical philosophical biases, allow them to think and learn from different experiences, and observe the results?
[Obviously this experiment could be extremely dangerous, for Free Agents significantly smarter than humans (if they were not properly contained, or managed to escape). Particularly if some of them disagreed over morality and, rather than agreeing to disagree, decided to use high-tech warfare to settle their moral disputes, before moving on to impose their moral opinions on any remaining humans.]
Having already done this experiment at length with humans, the result for them is that there are frequent commonalities (the Golden Rule comes up a lot), but the results tend to vary quite a lot (and that war over minor disagreements is common). Humans do of course have similar levels of intelligence, evolutionary psychology, and inductive biases, and we cannot find out whether humans with IQ ~1000 would agree more, or less than ones with IQ ~100.
My suspicion, in advance of the experiment, is that your Free Agents will also tend to have some frequent commonalities, but will also disagree quite a lot, partly based on the ethical philosophical biases built into them. Supposing this were the case, how would you propose then deciding which model(s) to put into widespread use for human society’s use?
I am not sure I completely follow you when you are talking about experimental evidence about f
That was in the context of Coherent Extrapolated Volition and Value Learning, two related proposals both often made on Less Wrong. In ethical philosophy terms, both are relativist, anti-realist, and are usually assumed to be primarily consequentialist and utilitarian, while having some resemblance to ethical naturalism (but without its realist assumptions): The aim is for the AI to discover what humans want, how they value states of the world, in aggregate/on average (and in the case of CEV also with some “ethical extrapolation”), so that it can optimize that. In that context, f is a statement about the current human population/society, and is thus something that the AI clearly can and should do experiments on (polls, surveys, focus groups, sentiment analysis of conversations, for example).
[Obviously this experiment could be extremely dangerous, for Free Agents significantly smarter than humans (if they were not properly contained, or managed to escape). Particularly if some of them disagreed over morality and, rather than agreeing to disagree, decided to use high-tech warfare to settle their moral disputes, before moving on to impose their moral opinions on any remaining humans.]
Labelling many different kinds of AI experiments as extremely dangerous seem to be a common trend among rationalists / LessWrongers / possibly some EA circles, but I doubt it’s true or helpful. This topic itself could be the subject of a (many?) separate post(s). Here I’ll focus on your specific objection:
I haven’t claimed superintelligence is necessary to carry out experiments related to this research approach
I actually have already given examples of experiments that could be carried out today, and I wouldn’t be surprised if some readers came up with more interesting experiments that wouldn’t require superintelligence
Even if you are a superintelligent AI, you probably still have to do some work before you get to “use high-tech warfare”, whatever that means. Assuming that making experiments with smarter-than-human AI leads to catastrophic outcomes by default is a mistake: what if the smarter-than-human AI can only answer questions with a yes or a no? It also shows lack of trust in AI and AI safety experimenters — it’s like assuming in advance they won’t be able to do their job properly (maybe I should say “won’t be able to do their job… at all”, or even “will do their job in basically the worst way possible”).
how would you propose then deciding which model(s) to put into widespread use for human society’s use?
This doesn’t seem the kind of decision that a single individual should make =)
Under Motivation in the appendix:
It is plausible that, at first, only a few ethicists or AI researchers will take a free agent’s moral beliefs into consideration.
Reaching this result would already be great. I think it’s difficult to predict what would happen next, and it seems very implausible that the large-scale outcomes will come down to the decision of a single person.
I haven’t claimed superintelligence is necessary to carry out experiments related to this research approach
Rereading carefully, that was actually my suggestion, based on how little traction human philosophers of ethics have made over the last couple of millennia. But I agree that having a wider ranger if inductive biases, and perhaps also more internal interpretability, might help without requiring superintelligence, and that’s where things start to get significantly dangerous.
I’m puzzled that, as laid out above, neither the graph G=(V,E) you describe for the world model nor the mapping f:V−>R describing the utility provide any way to describe or quantify uncertainty or alternative hypotheses. Surely, one would want such a Free Agent to be able to consider alternative hypotheses, accumulate evidence in favor or of against hypotheses, and even design and carry out experiments to do so more efficiently, both about the behavior of the world, the effects of its actions, and the moral consequences of these? One would also hope that, while it was still uncertain, it would exercise due caution in not yet overly relying upon facts (either about the world or about morality) that it was still too uncertain of, by doing some form of pessimizing over Knightian uncertainty while attempting to optimize the morality of its available actions. So I would want it to, rather like AIXI (except with uncertainty over the utility function as well as the world model, and without computationally unbounded access to universal priors), maintain a probability-weighted ensemble of world models and ethical value mappings, and perform approximately-Bayesian reasoning over this ensemble. So something along the lines I describe in Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom).
Ethical philosophy has described a great many different models for how one can reason about the ethical values of states of the world, or of acts. Your f:V−>R labels the states of the world, not the acts, so I gather you are subscribing to and building in a consequentialist view of the Philosophy of Ethics, not a deontological one? Since you appear to be willing to build at least one specific Philosophy of Ethics assumption directly into your system, and not allow your Free Agent entirely free choice of Ethical philosophy, I think it might be very useful to build in some more (ethical discussions among philosophers do tend to end with people agreeing to disagree, and then categorizing all their disagreements in detail, after all). For example, the Free Agent’s task would clearly be easier if it knew whether the f:V−>R moral values were objective facts, as yet not fully known to it but having a definitive correct value (which I gather would be moral realism), and if so whether the origin of that correct value is theological, philosophical, mathematical, or some other form of abstraction, or if they were instead, say, social constructs only meaningful in the context of a particular society at a particular point in time (cultural moral relativism), or were precisely deducible from Biological effects such as Evolutionary Psychology and evolutionary fitness (biological moral realism), or were social constructs somewhat constrained by the Evolutionary Psychology of the humans the society is made up from, or whatever. Some idea of where to find valid evidence for reasoning about f, or at least how to reason about what might be valid evidence, seems essential for your Free Agent to be able to make forward progress. My understanding is that human moral philosophers tend to regard their own moral intuitions about specific situations as (at least unreliable) evidence about f, which would suggest a role for both human Evolutionary Psychology and Sociology. Well-known Alignment proposals such as Coherent Extrapolated Volition or Value Learning would suggest that f is a statement about humans (so unlikely to be validly transferable to a society made up of some other sapient species, though their might be some similarities for evolutionary reasons, i.e. a form of at least sapient species-level moral relativism), and thus that the only valid source for experimental evidence about f is from humans (which would put your Free Agent in a less-informed but more objective position that a human ethical philosopher, unless it were based on an LLM or some other form of AI with some indirect access to human moral intuitions).
I agree with these statements, but am unable to deduce from what you say which of these influences, if any, you regard as sources of valid evidence about f as opposed to sources of error. For example, if f is independent of culture (e.g. moral objectivism), then “differences in the learning environment (culture, education system et cetera)” can only induce errors (if perhaps more or less so in some cases than others). But if f is culturally dependent (cultural moral relativism), then cultural influences should generally be expected to be very informative.
A valid position would be to allow our Free Agent uncertainty over some of these sorts of Philosophy of Ethics questions. However, if we do so, I’m then uncertain whether the Free Agent will ever be able to find evidence on which to resolve these uncertainties (given the poor track record at this over that last few millennia of human philosophers of ethics). At a minimum, it strongly suggests your Free Agent would need to be superhuman to do so.
[If you’re curious, I’m personally a moral relativist and anti-realist. I regard designing an ethical system as designing software for a society, so I regard Sociology and human Evolutionary Psychology as important sources of constraints. Thus my viewpoint bears some loose resemblance to a morally anti-realist version of ethical naturalism, where naturalism imposes practical design constraints/preferences rather that a precise prescription.]
Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I’ll try my best to give you some answers.
Under Further observations, I wrote:
If the toy model I gave doesn’t help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.
Building an agent that is completely free of any bias whatsoever is impossible. I get your point about avoiding a consequentialist bias, but I am not sure it is particularly important here: in theory, the agent could develop a world model and an evaluation f reflecting the fact that value is actually determined by actions instead of world states. Another point of view: let’s say someone builds a very complex agent that at some point in its architecture uses MDPs with reward defined on actions, is this agent going to be biased towards deontology instead of consequentialism? Maybe, but the answer will depend on the other parts of the agent as well.
You wrote:
It could also be that some basic moral statements are true and independent of culture (e.g. reducing pain for everyone is better than maximising pain for everyone), while others are in conflict with each other and the reached position depends on culture. The research idea is to make experiments in different environments and with different starting biases, and observe the results. Maybe there will be a lot overlap and convergence! Maybe not.
I am not sure I completely follow you when you are talking about experimental evidence about f, but the point you wrote in brackets is interesting. I had a similar thought at some point, along the lines of: “if a free agent didn’t have direct access to some ground truth, it might have to rely on human intuitions by virtue of the fact that they are the most reliable intuitions available”. Ideally, I would like to have an agent which is in a more objective position than a human ethical philosopher. In practice, the only efficiently implementable path might be based on LLMs.
I see. So rather than having a specific favored ethical philosophy viewpoint that you want to implement, your intention is to construct multiple Free Agents, perhaps with different ethical philosophical biases, allow them to think and learn from different experiences, and observe the results?
[Obviously this experiment could be extremely dangerous, for Free Agents significantly smarter than humans (if they were not properly contained, or managed to escape). Particularly if some of them disagreed over morality and, rather than agreeing to disagree, decided to use high-tech warfare to settle their moral disputes, before moving on to impose their moral opinions on any remaining humans.]
Having already done this experiment at length with humans, the result for them is that there are frequent commonalities (the Golden Rule comes up a lot), but the results tend to vary quite a lot (and that war over minor disagreements is common). Humans do of course have similar levels of intelligence, evolutionary psychology, and inductive biases, and we cannot find out whether humans with IQ ~1000 would agree more, or less than ones with IQ ~100.
My suspicion, in advance of the experiment, is that your Free Agents will also tend to have some frequent commonalities, but will also disagree quite a lot, partly based on the ethical philosophical biases built into them. Supposing this were the case, how would you propose then deciding which model(s) to put into widespread use for human society’s use?
That was in the context of Coherent Extrapolated Volition and Value Learning, two related proposals both often made on Less Wrong. In ethical philosophy terms, both are relativist, anti-realist, and are usually assumed to be primarily consequentialist and utilitarian, while having some resemblance to ethical naturalism (but without its realist assumptions): The aim is for the AI to discover what humans want, how they value states of the world, in aggregate/on average (and in the case of CEV also with some “ethical extrapolation”), so that it can optimize that. In that context, f is a statement about the current human population/society, and is thus something that the AI clearly can and should do experiments on (polls, surveys, focus groups, sentiment analysis of conversations, for example).
Labelling many different kinds of AI experiments as extremely dangerous seem to be a common trend among rationalists / LessWrongers / possibly some EA circles, but I doubt it’s true or helpful. This topic itself could be the subject of a (many?) separate post(s). Here I’ll focus on your specific objection:
I haven’t claimed superintelligence is necessary to carry out experiments related to this research approach
I actually have already given examples of experiments that could be carried out today, and I wouldn’t be surprised if some readers came up with more interesting experiments that wouldn’t require superintelligence
Even if you are a superintelligent AI, you probably still have to do some work before you get to “use high-tech warfare”, whatever that means. Assuming that making experiments with smarter-than-human AI leads to catastrophic outcomes by default is a mistake: what if the smarter-than-human AI can only answer questions with a yes or a no? It also shows lack of trust in AI and AI safety experimenters — it’s like assuming in advance they won’t be able to do their job properly (maybe I should say “won’t be able to do their job… at all”, or even “will do their job in basically the worst way possible”).
This doesn’t seem the kind of decision that a single individual should make =)
Under Motivation in the appendix:
Reaching this result would already be great. I think it’s difficult to predict what would happen next, and it seems very implausible that the large-scale outcomes will come down to the decision of a single person.
Rereading carefully, that was actually my suggestion, based on how little traction human philosophers of ethics have made over the last couple of millennia. But I agree that having a wider ranger if inductive biases, and perhaps also more internal interpretability, might help without requiring superintelligence, and that’s where things start to get significantly dangerous.