One experiment is worth more than all the opinions.
IMHO, no, there is not a coherent argument for the human worth hypothesis. My money is on it being disproven.
But, I assert the human worth hypothesis is the explicit belief of smart people like Scott Aaronson and the implicit belief of a lot of other people who think AI will be just fine. As Scott says Orthogonality is “a central linchpin” of the doom argument.
Can we be more clear about what people do believe at get at it with experiments?? That’s the question I’m asking.
It’s hard to construct experiments to prove all kinds of minds are possible, that is, to prove Orthogonality.
I think it may be less hard to quantify what an agent values. (Deception, yes. Still...)
It’s hard to construct experiments to prove all kinds of minds are possible, that is, to prove Orthogonality.
You can simply make a reinforcement learning environment that does not reward being nice to “humans” in grid world and “prove” Orthogonality.
We don’t even have to do the experiment, I know if we made a grid world crazy taxi environment where there is no penalty for running over pedestrians, and use any RL algorithm, the algorithm will...run over pedestrians once it finds the optimal solution.
We also know the “gridworld” of real physics we exist in, big picture, doesn’t penalize murdering your ancestors and taking their stuff, because our ancestors were the winners of such contests. Hence why we know at a cosmic scale this is the ultimate solution any optimizing algorithm will find.
It’s just that it is difficult to imagine a true SOTA model that humans actually use for anything that doesn’t care about human values, empirically in the output.
Meaning it doesn’t have to “really care” but any model that consistently advised humans to suicide in chats, or crashes autonomous cars will never make it out of training. (Having it fail sometimes, or with specific inputs, is expected behavior with current tech)
[note: my reply is to the parts of your point that are not sensitive to the topic of the OP.]
ultimately it’s a generalization quality question. will the weights indicated by gradients of the typical losses we use and which generalize well from train to test and initial deployment also be resilient to weird future scenarios where there’s an opportunity for the model to do something the original designers did not expect, or which they did expect but would have liked to be able to guarantee the model would take real-life-game-theoretically-informed actions to avert, such as is the case for the youtube recommender, which makes google employees less productive due to its addictive qualities but which cannot simply be made not-addictive due to the game theoretic landscape of having to compete against tiktok, which is heavily optimized for addictiveness.
Just to add something to this: both YouTube and Tiktok are forced by moloch into this “max addictiveness” loop. Meaning if say one of the companies has an ulterior motive—perhaps Google wants to manipulate future legislation or tik tok wants sympathy for the politics of their host government—serving this motive costs these companies revenue. How much revenue can the company afford to lose? Their “profit margin” tells you that.
It makes me wonder if there were some way to make moloch work for us, or to measure how able an AI is able to betray by checking its “profit margin”.
right. in some sense, the societal-scale version of this problem is effectively the moloch alignment problem. the problem is that to a significant extent no one is in control of moloch at all; many orgs try to set limits on it, but those approaches are mostly not working. at a societal scale, individual sacrifices to moloch are like a mold growing around attempts to contain it, and attempts to contain it are themselves usually also sort of sacrifices to moloch just aimed a little differently. and the core of the societal-alignment-in-the-face-of-ai problem is, what happens if that mold’s growth rate speeds up a fuckton?
work on how to coordinate would be good, but then the issue is that the natural coordination groups tend to be the most powerful against the powerless (price fixing collusion, and we call the coordination group a cartel). we need a whole-of-humanity coordination group, and there are various philosophies around about how to achieve that, but mostly it seems like we actually just don’t know how to do it and need to, you know, like, solve incentive balancing for good real quick now. I’m a fan of the thing the wishful thinkers on the topic wish for, but I don’t think they know how to actually stop the growth of the mold without simply cutting away the good thing the mold grows on top of.
Well, if it doesn’t really value humans, it could demonstrate good behavior, deceptively, to make it out of training. If it is as smart as a human, it will understand that.
I think there are a lot of people banking on the good behavior towards humans being intrinsic: Intelligence > Wisdom > Benevolence towards these sentient humans. That’s what I take Scott Aaronson to be arguing.
In addition to people like Scott who engage directly with the concept of Orthogonality, I feel like everyone saying things like “Those terminator sci-fi scenarios are crazy!” are expressing a version of the Human Worth Hypothesis. They are saying approximately: “Oh, cmon, we made it. It’s going to like us. Why would it hate us?”
I’m suggesting we try and put this Human Worth Hypothesis to the test.
importantly the concept of orthogonality needs to be in the context of a reasonable training set in order to avert the counterarguments of irrelevance that are typically deployed. the relevant orthogonality argument is not just that arbitrary minds don’t implement resilient seeking towards the things humans want—whether that’s true depends on your prior for “arbitrary”, and it’s hard to get a completely clean prior for something vague like “arbitrary minds”; it’s that even from the developmental perspective of actual AI tech, ie when you do one of {imitation learn/unsupervised train on human behavior; or, supervised train on specific target behavior; or, rl train on a reasonably representative reward}, that the actual weights that are locally discoverable by a training process do not have as much gradient pressure as expected to implement resilient seeking of the intended outcomes, and are likely to generalize in ways that are bad in practical usage.
I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking. How? What mechanism? No idea. But I believe they believe that. Hence my inclusion of ”...regardless of the process to create the intelligence.”
Most readers of Less Wrong believe Orthogonality.
But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis. (Put the cookies on the low shelf for the kids.)
And, its worth some creative effort to design experiments to test the Human Worth hypothesis.
Imagine the headline: “Experiments demonstrate that frontier AI models do not value humanity.”
If it were believable, a lot of people would update.
I don’t think one needs to believe the human worth hypothesis to disbelieve strong orthogonality, one only needs to believe that gradient descent is able to actually find representations that correctly represent the important parts of the things the training data was intended by the algorithm designer to represent, eg for the youtube recommender this would be “does this enrich the user’s life enough to keep them coming back”, but what’s actually measured is just “how long do they come back”.
One experiment is worth more than all the opinions.
IMHO, no, there is not a coherent argument for the human worth hypothesis. My money is on it being disproven.
But, I assert the human worth hypothesis is the explicit belief of smart people like Scott Aaronson and the implicit belief of a lot of other people who think AI will be just fine. As Scott says Orthogonality is “a central linchpin” of the doom argument.
Can we be more clear about what people do believe at get at it with experiments?? That’s the question I’m asking.
It’s hard to construct experiments to prove all kinds of minds are possible, that is, to prove Orthogonality.
I think it may be less hard to quantify what an agent values. (Deception, yes. Still...)
You can simply make a reinforcement learning environment that does not reward being nice to “humans” in grid world and “prove” Orthogonality.
We don’t even have to do the experiment, I know if we made a grid world crazy taxi environment where there is no penalty for running over pedestrians, and use any RL algorithm, the algorithm will...run over pedestrians once it finds the optimal solution.
We also know the “gridworld” of real physics we exist in, big picture, doesn’t penalize murdering your ancestors and taking their stuff, because our ancestors were the winners of such contests. Hence why we know at a cosmic scale this is the ultimate solution any optimizing algorithm will find.
It’s just that it is difficult to imagine a true SOTA model that humans actually use for anything that doesn’t care about human values, empirically in the output.
Meaning it doesn’t have to “really care” but any model that consistently advised humans to suicide in chats, or crashes autonomous cars will never make it out of training. (Having it fail sometimes, or with specific inputs, is expected behavior with current tech)
[note: my reply is to the parts of your point that are not sensitive to the topic of the OP.]
ultimately it’s a generalization quality question. will the weights indicated by gradients of the typical losses we use and which generalize well from train to test and initial deployment also be resilient to weird future scenarios where there’s an opportunity for the model to do something the original designers did not expect, or which they did expect but would have liked to be able to guarantee the model would take real-life-game-theoretically-informed actions to avert, such as is the case for the youtube recommender, which makes google employees less productive due to its addictive qualities but which cannot simply be made not-addictive due to the game theoretic landscape of having to compete against tiktok, which is heavily optimized for addictiveness.
Just to add something to this: both YouTube and Tiktok are forced by moloch into this “max addictiveness” loop. Meaning if say one of the companies has an ulterior motive—perhaps Google wants to manipulate future legislation or tik tok wants sympathy for the politics of their host government—serving this motive costs these companies revenue. How much revenue can the company afford to lose? Their “profit margin” tells you that.
It makes me wonder if there were some way to make moloch work for us, or to measure how able an AI is able to betray by checking its “profit margin”.
right. in some sense, the societal-scale version of this problem is effectively the moloch alignment problem. the problem is that to a significant extent no one is in control of moloch at all; many orgs try to set limits on it, but those approaches are mostly not working. at a societal scale, individual sacrifices to moloch are like a mold growing around attempts to contain it, and attempts to contain it are themselves usually also sort of sacrifices to moloch just aimed a little differently. and the core of the societal-alignment-in-the-face-of-ai problem is, what happens if that mold’s growth rate speeds up a fuckton?
work on how to coordinate would be good, but then the issue is that the natural coordination groups tend to be the most powerful against the powerless (price fixing collusion, and we call the coordination group a cartel). we need a whole-of-humanity coordination group, and there are various philosophies around about how to achieve that, but mostly it seems like we actually just don’t know how to do it and need to, you know, like, solve incentive balancing for good real quick now. I’m a fan of the thing the wishful thinkers on the topic wish for, but I don’t think they know how to actually stop the growth of the mold without simply cutting away the good thing the mold grows on top of.
Well, if it doesn’t really value humans, it could demonstrate good behavior, deceptively, to make it out of training. If it is as smart as a human, it will understand that.
I think there are a lot of people banking on the good behavior towards humans being intrinsic: Intelligence > Wisdom > Benevolence towards these sentient humans. That’s what I take Scott Aaronson to be arguing.
In addition to people like Scott who engage directly with the concept of Orthogonality, I feel like everyone saying things like “Those terminator sci-fi scenarios are crazy!” are expressing a version of the Human Worth Hypothesis. They are saying approximately: “Oh, cmon, we made it. It’s going to like us. Why would it hate us?”
I’m suggesting we try and put this Human Worth Hypothesis to the test.
It feels like a lot is riding on it.
importantly the concept of orthogonality needs to be in the context of a reasonable training set in order to avert the counterarguments of irrelevance that are typically deployed. the relevant orthogonality argument is not just that arbitrary minds don’t implement resilient seeking towards the things humans want—whether that’s true depends on your prior for “arbitrary”, and it’s hard to get a completely clean prior for something vague like “arbitrary minds”; it’s that even from the developmental perspective of actual AI tech, ie when you do one of {imitation learn/unsupervised train on human behavior; or, supervised train on specific target behavior; or, rl train on a reasonably representative reward}, that the actual weights that are locally discoverable by a training process do not have as much gradient pressure as expected to implement resilient seeking of the intended outcomes, and are likely to generalize in ways that are bad in practical usage.
I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking. How? What mechanism? No idea. But I believe they believe that. Hence my inclusion of ”...regardless of the process to create the intelligence.”
Most readers of Less Wrong believe Orthogonality.
But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis. (Put the cookies on the low shelf for the kids.)
And, its worth some creative effort to design experiments to test the Human Worth hypothesis.
Imagine the headline: “Experiments demonstrate that frontier AI models do not value humanity.”
If it were believable, a lot of people would update.
I don’t think one needs to believe the human worth hypothesis to disbelieve strong orthogonality, one only needs to believe that gradient descent is able to actually find representations that correctly represent the important parts of the things the training data was intended by the algorithm designer to represent, eg for the youtube recommender this would be “does this enrich the user’s life enough to keep them coming back”, but what’s actually measured is just “how long do they come back”.