I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for various reasons, but it’s one of the things I’ve been thinking about. (My “plan” here, such as it is, doesn’t really look like that.)
Hmm, reading between the lines, I wonder if your intuitions are sensing a more general asymmetry where rewards are generally more likely to lead to reflectively-endorsed preferences than punishments? If so, that seems pretty plausible to me, at least other things equal.
Mechanistically: If “nanotech” has positive valence, than “I am working on nanotech” would inherit some positive valence too (details—see the paragraph that starts “slightly more detail”), as would brainstorming how to get nanotech, etc. Whereas if “deception” has negative valence (a.k.a. is aversive), then the very act of brainstorming whether something might be deceptive would itself be somewhat aversive, again for reasons mentioned here.
This is kinda related to confirmation bias. If the idea “my plan will fail” or “I’m wrong” is aversive, then “brainstorming how my plan might fail” or “brainstorming why I’m wrong” is somewhat aversive too. So people don’t do it. It’s just a deficiency of this kind of algorithm. It’s obviously not a fatal deficiency—at least some humans, sometimes, avoid confirmation bias. Basically, I think the trained model can learn a meta-heuristic that recognizes these situations (at least sometimes) and strongly votes to brainstorm anyway.
By the same token, I think it is true that the human brain RL algorithm has a default behavior of being less effective at avoiding punishments than seeking out rewards, because, again, brainstorming how to avoid punishments is aversive, and brainstorming how to get rewards is pleasant. (And the reflectively-endorsed-desire thing is a special case of that, or at least closely related.)
This deficiency in the algorithm might get magically patched over by a learned meta-heuristic, in which case maybe a set of punishments could lead to a reflectively-endorsed preference despite the odds stacked against it. We can also think about how to mitigate that problem by “rewarding the algorithm for acting virtuously” rather than punishing it for acting deceptively, or whatever.
(NB: I called that aspect of the brain algorithm a “deficiency” rather than “flaw” or “bug” because I don’t think it’s fixable without losing essential aspects of intelligence. I think the only way to get a “rational” AGI without confirmation bias etc. is to have the AGI read the Sequences, or rediscover the same ideas, or whatever, same as us humans, thus patching over all the algorithmic quirks with learned meta-heuristics. I think this is an area where I disagree with Nate & Eliezer.)
Yeah, thanks for engaging with me! You’ve definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don’t have fully put-together thoughts on that yet.)
I mean, I’m not making a strong claim that we should punish an AGI for being deceptive and that will definitely indirectly lead to an AGI with an endorsed desire to be non-deceptive. There are a lot of things that can go wrong there. To pick one example, we’re also simultaneously punishing the AGI for “getting caught”. I hope we can come up with a better plan than that, e.g. a plan that “finds” the AGI’s self-concept using interpretability tools, and then intervenes on meta-preferences directly. I don’t have any plan for that, and it seems very hard for various reasons, but it’s one of the things I’ve been thinking about. (My “plan” here, such as it is, doesn’t really look like that.)
Hmm, reading between the lines, I wonder if your intuitions are sensing a more general asymmetry where rewards are generally more likely to lead to reflectively-endorsed preferences than punishments? If so, that seems pretty plausible to me, at least other things equal.
Mechanistically: If “nanotech” has positive valence, than “I am working on nanotech” would inherit some positive valence too (details—see the paragraph that starts “slightly more detail”), as would brainstorming how to get nanotech, etc. Whereas if “deception” has negative valence (a.k.a. is aversive), then the very act of brainstorming whether something might be deceptive would itself be somewhat aversive, again for reasons mentioned here.
This is kinda related to confirmation bias. If the idea “my plan will fail” or “I’m wrong” is aversive, then “brainstorming how my plan might fail” or “brainstorming why I’m wrong” is somewhat aversive too. So people don’t do it. It’s just a deficiency of this kind of algorithm. It’s obviously not a fatal deficiency—at least some humans, sometimes, avoid confirmation bias. Basically, I think the trained model can learn a meta-heuristic that recognizes these situations (at least sometimes) and strongly votes to brainstorm anyway.
By the same token, I think it is true that the human brain RL algorithm has a default behavior of being less effective at avoiding punishments than seeking out rewards, because, again, brainstorming how to avoid punishments is aversive, and brainstorming how to get rewards is pleasant. (And the reflectively-endorsed-desire thing is a special case of that, or at least closely related.)
This deficiency in the algorithm might get magically patched over by a learned meta-heuristic, in which case maybe a set of punishments could lead to a reflectively-endorsed preference despite the odds stacked against it. We can also think about how to mitigate that problem by “rewarding the algorithm for acting virtuously” rather than punishing it for acting deceptively, or whatever.
(NB: I called that aspect of the brain algorithm a “deficiency” rather than “flaw” or “bug” because I don’t think it’s fixable without losing essential aspects of intelligence. I think the only way to get a “rational” AGI without confirmation bias etc. is to have the AGI read the Sequences, or rediscover the same ideas, or whatever, same as us humans, thus patching over all the algorithmic quirks with learned meta-heuristics. I think this is an area where I disagree with Nate & Eliezer.)
Anyway, I appreciate your comment!!
Yeah, thanks for engaging with me! You’ve definitely given me some food for thought, which I will probably go and chew on for a bit, instead of immediately replying with anything substantive. (The thing about rewards being more likely to lead to reflectively endorsed preferences feels interesting to me, but I don’t have fully put-together thoughts on that yet.)