In a similar sense to how the agency you can currently write down about your system is probably not the real agency, if you do manage to write down a system whose agency really is pointed in the direction that the agency of a human wants, but that human is still a part of the current organizational structures in society, those organizational structures implement supervisor trees and competition networks which mean that there appears to be more success available if they try to use their ai to participate in the competition networks better—and thus goodhart whatever metrics are being competed at, probably related to money somehow.
If your AI isn’t able to provide the necessary wisdom to get a human from “inclined to accidentally use an obedient powerful ai to destroy the world despite this human’s verbal statements of intention to themselves” to “inclined to successfully execute on good intentions and achieve interorganizational behaviors that make things better”, then I claim you’ve failed at the technical problem anyway, even though you succeeded at obedient AI.
If everyone tries to win at the current games (in the technical sense of the word), everyone loses, including the highest scoring players; current societal layout has a lot of games where it seems to me the only long-term winning move is not to play and to instead try to invent a way to jump into another game, but where to some degree you can win short-term. Unfortunately it seems to me that humans are RLed pretty hard by doing a lot of playing of these games, and so having a powerful AI in front of them is likely to get most humans trying to win at those games. Pick an organization that you expect to develop powerful AGI; do you expect the people in that org to be able to think outside the framework of current society enough for their marginal contribution to push towards a better world when the size of their contribution suddenly gets very large?
Because I find it so interesting and want to understand it: What does the “RLed” in “Unfortunately it seems to me that humans are RLed pretty hard by doing a lot of playing of these games” mean? That term is not familiar to me.
Like seth said, I just mean reinforcement learning. Described in more typical language, people take their feelings of success from whether they’re winning at the player-vs-environment and player-vs-player contests one encounters in everyday life; opportunities to change what contests are possible are unfamiliar. I also think there are decision theory issues[1] humans have. and then of course people do in fact have different preferences and moral values. but even among people where neither issue is in play, I think people have pretty bad self-misalignment as a result of taking what-feels-good-to-succeed-at feedback from circumstances that train them into habits that work well in the original context, and which typically badly fail to produce useful behavior in contexts like “you can massively change things for the better”. Being prepared for unreasonable success is a common phrase referring to this issue, I think.
[1] in case this is useful context: a decision theory is a small mathematical expression which roughly expresses “what part of past, present, and future do you see as you-which-decides-together”, or stated slightly more technically, what’s the expression that defines how you consider counterfactuals when evaluating possible actions you “could [have] take[n]”; I’m pretty sure humans have some native one, and it’s not exactly any of the ones that are typically discussed but rather some thing vaguely in the direction of active inference, though people vary between approximating the typically discussed ones. The commonly discussed ones around these parts are stuff like EDT/CDT/LDTs { FDT, UDT, LIDT, … }
[edit: pinned to profile]
In a similar sense to how the agency you can currently write down about your system is probably not the real agency, if you do manage to write down a system whose agency really is pointed in the direction that the agency of a human wants, but that human is still a part of the current organizational structures in society, those organizational structures implement supervisor trees and competition networks which mean that there appears to be more success available if they try to use their ai to participate in the competition networks better—and thus goodhart whatever metrics are being competed at, probably related to money somehow.
If your AI isn’t able to provide the necessary wisdom to get a human from “inclined to accidentally use an obedient powerful ai to destroy the world despite this human’s verbal statements of intention to themselves” to “inclined to successfully execute on good intentions and achieve interorganizational behaviors that make things better”, then I claim you’ve failed at the technical problem anyway, even though you succeeded at obedient AI.
If everyone tries to win at the current games (in the technical sense of the word), everyone loses, including the highest scoring players; current societal layout has a lot of games where it seems to me the only long-term winning move is not to play and to instead try to invent a way to jump into another game, but where to some degree you can win short-term. Unfortunately it seems to me that humans are RLed pretty hard by doing a lot of playing of these games, and so having a powerful AI in front of them is likely to get most humans trying to win at those games. Pick an organization that you expect to develop powerful AGI; do you expect the people in that org to be able to think outside the framework of current society enough for their marginal contribution to push towards a better world when the size of their contribution suddenly gets very large?
I found your reply really interesting.
Because I find it so interesting and want to understand it: What does the “RLed” in “Unfortunately it seems to me that humans are RLed pretty hard by doing a lot of playing of these games” mean? That term is not familiar to me.
Like seth said, I just mean reinforcement learning. Described in more typical language, people take their feelings of success from whether they’re winning at the player-vs-environment and player-vs-player contests one encounters in everyday life; opportunities to change what contests are possible are unfamiliar. I also think there are decision theory issues[1] humans have. and then of course people do in fact have different preferences and moral values. but even among people where neither issue is in play, I think people have pretty bad self-misalignment as a result of taking what-feels-good-to-succeed-at feedback from circumstances that train them into habits that work well in the original context, and which typically badly fail to produce useful behavior in contexts like “you can massively change things for the better”. Being prepared for unreasonable success is a common phrase referring to this issue, I think.
[1] in case this is useful context: a decision theory is a small mathematical expression which roughly expresses “what part of past, present, and future do you see as you-which-decides-together”, or stated slightly more technically, what’s the expression that defines how you consider counterfactuals when evaluating possible actions you “could [have] take[n]”; I’m pretty sure humans have some native one, and it’s not exactly any of the ones that are typically discussed but rather some thing vaguely in the direction of active inference, though people vary between approximating the typically discussed ones. The commonly discussed ones around these parts are stuff like EDT/CDT/LDTs { FDT, UDT, LIDT, … }
Reinforcement learning.