Well, if it doesn’t really value humans, it could demonstrate good behavior, deceptively, to make it out of training. If it is as smart as a human, it will understand that.
I think there are a lot of people banking on the good behavior towards humans being intrinsic: Intelligence > Wisdom > Benevolence towards these sentient humans. That’s what I take Scott Aaronson to be arguing.
In addition to people like Scott who engage directly with the concept of Orthogonality, I feel like everyone saying things like “Those terminator sci-fi scenarios are crazy!” are expressing a version of the Human Worth Hypothesis. They are saying approximately: “Oh, cmon, we made it. It’s going to like us. Why would it hate us?”
I’m suggesting we try and put this Human Worth Hypothesis to the test.
importantly the concept of orthogonality needs to be in the context of a reasonable training set in order to avert the counterarguments of irrelevance that are typically deployed. the relevant orthogonality argument is not just that arbitrary minds don’t implement resilient seeking towards the things humans want—whether that’s true depends on your prior for “arbitrary”, and it’s hard to get a completely clean prior for something vague like “arbitrary minds”; it’s that even from the developmental perspective of actual AI tech, ie when you do one of {imitation learn/unsupervised train on human behavior; or, supervised train on specific target behavior; or, rl train on a reasonably representative reward}, that the actual weights that are locally discoverable by a training process do not have as much gradient pressure as expected to implement resilient seeking of the intended outcomes, and are likely to generalize in ways that are bad in practical usage.
I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking. How? What mechanism? No idea. But I believe they believe that. Hence my inclusion of ”...regardless of the process to create the intelligence.”
Most readers of Less Wrong believe Orthogonality.
But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis. (Put the cookies on the low shelf for the kids.)
And, its worth some creative effort to design experiments to test the Human Worth hypothesis.
Imagine the headline: “Experiments demonstrate that frontier AI models do not value humanity.”
If it were believable, a lot of people would update.
I don’t think one needs to believe the human worth hypothesis to disbelieve strong orthogonality, one only needs to believe that gradient descent is able to actually find representations that correctly represent the important parts of the things the training data was intended by the algorithm designer to represent, eg for the youtube recommender this would be “does this enrich the user’s life enough to keep them coming back”, but what’s actually measured is just “how long do they come back”.
Well, if it doesn’t really value humans, it could demonstrate good behavior, deceptively, to make it out of training. If it is as smart as a human, it will understand that.
I think there are a lot of people banking on the good behavior towards humans being intrinsic: Intelligence > Wisdom > Benevolence towards these sentient humans. That’s what I take Scott Aaronson to be arguing.
In addition to people like Scott who engage directly with the concept of Orthogonality, I feel like everyone saying things like “Those terminator sci-fi scenarios are crazy!” are expressing a version of the Human Worth Hypothesis. They are saying approximately: “Oh, cmon, we made it. It’s going to like us. Why would it hate us?”
I’m suggesting we try and put this Human Worth Hypothesis to the test.
It feels like a lot is riding on it.
importantly the concept of orthogonality needs to be in the context of a reasonable training set in order to avert the counterarguments of irrelevance that are typically deployed. the relevant orthogonality argument is not just that arbitrary minds don’t implement resilient seeking towards the things humans want—whether that’s true depends on your prior for “arbitrary”, and it’s hard to get a completely clean prior for something vague like “arbitrary minds”; it’s that even from the developmental perspective of actual AI tech, ie when you do one of {imitation learn/unsupervised train on human behavior; or, supervised train on specific target behavior; or, rl train on a reasonably representative reward}, that the actual weights that are locally discoverable by a training process do not have as much gradient pressure as expected to implement resilient seeking of the intended outcomes, and are likely to generalize in ways that are bad in practical usage.
I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking. How? What mechanism? No idea. But I believe they believe that. Hence my inclusion of ”...regardless of the process to create the intelligence.”
Most readers of Less Wrong believe Orthogonality.
But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis. (Put the cookies on the low shelf for the kids.)
And, its worth some creative effort to design experiments to test the Human Worth hypothesis.
Imagine the headline: “Experiments demonstrate that frontier AI models do not value humanity.”
If it were believable, a lot of people would update.
I don’t think one needs to believe the human worth hypothesis to disbelieve strong orthogonality, one only needs to believe that gradient descent is able to actually find representations that correctly represent the important parts of the things the training data was intended by the algorithm designer to represent, eg for the youtube recommender this would be “does this enrich the user’s life enough to keep them coming back”, but what’s actually measured is just “how long do they come back”.