I think the thought experiment that you propose is interesting, but doesn’t isolate different factors that may contribute to people’s intuitions. For example, it doesn’t distinquish between worries about making individual people powerful because of their values (e.g. they are selfish or sociopathic) vs. worries due to their decision-making processes. I think this is important because it seems likely that “amplifying” someone won’t fix value-based issues, but possibly will fix decision-making issues. If I had to propose a candidate crux, it would probably be more along the lines of how much of alignment can be solved through using a learning algorthm to help learn solutions vs. how much of the problem needs to be solved “by hand” and understood on a deep level rather than learned. Along those lines, I found the postscript to Paul Christiano’s article on corrigibility interesting.
I think values and decision-making processes can’t be disentangled, because I think people’s values often stem from their decision-making processes. For example, someone might be selfish because they perceive the whole world to be selfish and uncaring, and in return acts selfish and uncaring by default. This default behavior might cause the world to act selfishly and uncaringly toward him, further reinforcing his perception. If he fully understood this was happening (rather than the world just being fundamentally selfish), he might experiment with acting more generously with the rest of the world, and observe the rest of the world to act more generously in return, and in turn stop being selfish entirely.
In general I expect amplification to improve decision-making processes substantially, but in most cases to not improve them enough. For example, it’s not clear that amplifying someone will cause them to observe that their own policy of selfishness is locking them into a fixed point that they could “Löb out of” into a more preferable fixed point. I expect this to be particularly unlikely if e.g. they believe their object-level values to be fixed and immutable, which migh result in a fairly pernicious epistemic pit.
My intuition is that most decision-making processes have room for subtle but significant improvements, that most people won’t realize these improvements upon amplification, and that failing to make these improvements will result in catastrophic amounts of waste. As another example, it seems quite plausible to me that:
the vast majority of human-value-satisfaction (e.g. human flourishing, or general reduction of suffering) comes from acausally trading with distant superintelligences.
most people will never care about (or even realize) acausal trade, even upon amplification.
I think decision making can have an impact on values, but that this depends on the design of the agent. In my comment, by values, I had in mind something like “the thing that the agent is maximizing”. We can imagine an agent like the paperclip maximizer for which the “decision making” ability of the agent doesn’t change the agent’s values. Is this agent in an “epistemic pit”? I think the agent is in a “pit” from our perspective, but it’s not clear that the pit is epistemic. One could model the paperclip maximizer as an agent whose epistemology is fine but that simply values different things than we do. In the same way, I think people could be worried about amplifying humans because they are worried that those humans will get stuck in non-epistemic pits rather than epistemic pits. For example, I think the discussion in other comments related to slavery partly have to due with this issues.
The extent to which a human’s “values” will improve as a result of improvements in “decision making” to me seems to depend on having a psychological model of humans, which won’t necessarily be a good model of an AI. As a result, people may agree/disagree in terms of their intuitions about amplification as applied to humans due to similarities/differences in their psychological models of those humans, while their agreement/disagreement on amplification as applied to AI may result from different factors. In this sense, I’m not sure that different intuitions about amplifying humans is necessarily a crux for differences in terms of amplifying AIs.
In general I expect amplification to improve decision-making processes substantially, but in most cases to not improve them enough.
To me, this seems like a good candidate for something close to a crux between “optimistic” alignment strategies (similar to amplification/distillation) vs. “pessimistic” alignment strategies (similar to agent foundations). I see it like this. The “optimistic” approach is more optimistic that certain aspects of metaphilosophical competence can be learned during the learning process or else addressed by fairly simple adjustment to the design of the learning process, whereas the “pessimistic” approach is pessimistic that solutions to certain issues can be learned and so thinks that we need to do a lot of hard work to get solutions to these problems that we understand on a deep level before we can align an agent. I’m not sure which is correct, but I do think this is a critical difference in the rationales underlying these approaches.
I think the thought experiment that you propose is interesting, but doesn’t isolate different factors that may contribute to people’s intuitions. For example, it doesn’t distinquish between worries about making individual people powerful because of their values (e.g. they are selfish or sociopathic) vs. worries due to their decision-making processes. I think this is important because it seems likely that “amplifying” someone won’t fix value-based issues, but possibly will fix decision-making issues. If I had to propose a candidate crux, it would probably be more along the lines of how much of alignment can be solved through using a learning algorthm to help learn solutions vs. how much of the problem needs to be solved “by hand” and understood on a deep level rather than learned. Along those lines, I found the postscript to Paul Christiano’s article on corrigibility interesting.
I think values and decision-making processes can’t be disentangled, because I think people’s values often stem from their decision-making processes. For example, someone might be selfish because they perceive the whole world to be selfish and uncaring, and in return acts selfish and uncaring by default. This default behavior might cause the world to act selfishly and uncaringly toward him, further reinforcing his perception. If he fully understood this was happening (rather than the world just being fundamentally selfish), he might experiment with acting more generously with the rest of the world, and observe the rest of the world to act more generously in return, and in turn stop being selfish entirely.
In general I expect amplification to improve decision-making processes substantially, but in most cases to not improve them enough. For example, it’s not clear that amplifying someone will cause them to observe that their own policy of selfishness is locking them into a fixed point that they could “Löb out of” into a more preferable fixed point. I expect this to be particularly unlikely if e.g. they believe their object-level values to be fixed and immutable, which migh result in a fairly pernicious epistemic pit.
My intuition is that most decision-making processes have room for subtle but significant improvements, that most people won’t realize these improvements upon amplification, and that failing to make these improvements will result in catastrophic amounts of waste. As another example, it seems quite plausible to me that:
the vast majority of human-value-satisfaction (e.g. human flourishing, or general reduction of suffering) comes from acausally trading with distant superintelligences.
most people will never care about (or even realize) acausal trade, even upon amplification.
I think decision making can have an impact on values, but that this depends on the design of the agent. In my comment, by values, I had in mind something like “the thing that the agent is maximizing”. We can imagine an agent like the paperclip maximizer for which the “decision making” ability of the agent doesn’t change the agent’s values. Is this agent in an “epistemic pit”? I think the agent is in a “pit” from our perspective, but it’s not clear that the pit is epistemic. One could model the paperclip maximizer as an agent whose epistemology is fine but that simply values different things than we do. In the same way, I think people could be worried about amplifying humans because they are worried that those humans will get stuck in non-epistemic pits rather than epistemic pits. For example, I think the discussion in other comments related to slavery partly have to due with this issues.
The extent to which a human’s “values” will improve as a result of improvements in “decision making” to me seems to depend on having a psychological model of humans, which won’t necessarily be a good model of an AI. As a result, people may agree/disagree in terms of their intuitions about amplification as applied to humans due to similarities/differences in their psychological models of those humans, while their agreement/disagreement on amplification as applied to AI may result from different factors. In this sense, I’m not sure that different intuitions about amplifying humans is necessarily a crux for differences in terms of amplifying AIs.
To me, this seems like a good candidate for something close to a crux between “optimistic” alignment strategies (similar to amplification/distillation) vs. “pessimistic” alignment strategies (similar to agent foundations). I see it like this. The “optimistic” approach is more optimistic that certain aspects of metaphilosophical competence can be learned during the learning process or else addressed by fairly simple adjustment to the design of the learning process, whereas the “pessimistic” approach is pessimistic that solutions to certain issues can be learned and so thinks that we need to do a lot of hard work to get solutions to these problems that we understand on a deep level before we can align an agent. I’m not sure which is correct, but I do think this is a critical difference in the rationales underlying these approaches.