[note: my reply is to the parts of your point that are not sensitive to the topic of the OP.]
ultimately it’s a generalization quality question. will the weights indicated by gradients of the typical losses we use and which generalize well from train to test and initial deployment also be resilient to weird future scenarios where there’s an opportunity for the model to do something the original designers did not expect, or which they did expect but would have liked to be able to guarantee the model would take real-life-game-theoretically-informed actions to avert, such as is the case for the youtube recommender, which makes google employees less productive due to its addictive qualities but which cannot simply be made not-addictive due to the game theoretic landscape of having to compete against tiktok, which is heavily optimized for addictiveness.
Just to add something to this: both YouTube and Tiktok are forced by moloch into this “max addictiveness” loop. Meaning if say one of the companies has an ulterior motive—perhaps Google wants to manipulate future legislation or tik tok wants sympathy for the politics of their host government—serving this motive costs these companies revenue. How much revenue can the company afford to lose? Their “profit margin” tells you that.
It makes me wonder if there were some way to make moloch work for us, or to measure how able an AI is able to betray by checking its “profit margin”.
right. in some sense, the societal-scale version of this problem is effectively the moloch alignment problem. the problem is that to a significant extent no one is in control of moloch at all; many orgs try to set limits on it, but those approaches are mostly not working. at a societal scale, individual sacrifices to moloch are like a mold growing around attempts to contain it, and attempts to contain it are themselves usually also sort of sacrifices to moloch just aimed a little differently. and the core of the societal-alignment-in-the-face-of-ai problem is, what happens if that mold’s growth rate speeds up a fuckton?
work on how to coordinate would be good, but then the issue is that the natural coordination groups tend to be the most powerful against the powerless (price fixing collusion, and we call the coordination group a cartel). we need a whole-of-humanity coordination group, and there are various philosophies around about how to achieve that, but mostly it seems like we actually just don’t know how to do it and need to, you know, like, solve incentive balancing for good real quick now. I’m a fan of the thing the wishful thinkers on the topic wish for, but I don’t think they know how to actually stop the growth of the mold without simply cutting away the good thing the mold grows on top of.
[note: my reply is to the parts of your point that are not sensitive to the topic of the OP.]
ultimately it’s a generalization quality question. will the weights indicated by gradients of the typical losses we use and which generalize well from train to test and initial deployment also be resilient to weird future scenarios where there’s an opportunity for the model to do something the original designers did not expect, or which they did expect but would have liked to be able to guarantee the model would take real-life-game-theoretically-informed actions to avert, such as is the case for the youtube recommender, which makes google employees less productive due to its addictive qualities but which cannot simply be made not-addictive due to the game theoretic landscape of having to compete against tiktok, which is heavily optimized for addictiveness.
Just to add something to this: both YouTube and Tiktok are forced by moloch into this “max addictiveness” loop. Meaning if say one of the companies has an ulterior motive—perhaps Google wants to manipulate future legislation or tik tok wants sympathy for the politics of their host government—serving this motive costs these companies revenue. How much revenue can the company afford to lose? Their “profit margin” tells you that.
It makes me wonder if there were some way to make moloch work for us, or to measure how able an AI is able to betray by checking its “profit margin”.
right. in some sense, the societal-scale version of this problem is effectively the moloch alignment problem. the problem is that to a significant extent no one is in control of moloch at all; many orgs try to set limits on it, but those approaches are mostly not working. at a societal scale, individual sacrifices to moloch are like a mold growing around attempts to contain it, and attempts to contain it are themselves usually also sort of sacrifices to moloch just aimed a little differently. and the core of the societal-alignment-in-the-face-of-ai problem is, what happens if that mold’s growth rate speeds up a fuckton?
work on how to coordinate would be good, but then the issue is that the natural coordination groups tend to be the most powerful against the powerless (price fixing collusion, and we call the coordination group a cartel). we need a whole-of-humanity coordination group, and there are various philosophies around about how to achieve that, but mostly it seems like we actually just don’t know how to do it and need to, you know, like, solve incentive balancing for good real quick now. I’m a fan of the thing the wishful thinkers on the topic wish for, but I don’t think they know how to actually stop the growth of the mold without simply cutting away the good thing the mold grows on top of.