I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking. How? What mechanism? No idea. But I believe they believe that. Hence my inclusion of ”...regardless of the process to create the intelligence.”
Most readers of Less Wrong believe Orthogonality.
But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis. (Put the cookies on the low shelf for the kids.)
And, its worth some creative effort to design experiments to test the Human Worth hypothesis.
Imagine the headline: “Experiments demonstrate that frontier AI models do not value humanity.”
If it were believable, a lot of people would update.
I don’t think one needs to believe the human worth hypothesis to disbelieve strong orthogonality, one only needs to believe that gradient descent is able to actually find representations that correctly represent the important parts of the things the training data was intended by the algorithm designer to represent, eg for the youtube recommender this would be “does this enrich the user’s life enough to keep them coming back”, but what’s actually measured is just “how long do they come back”.
I interpret people who disbelieve Orthogonality to think there is some cosmic guardrail that protects against such process failures like poor seeking. How? What mechanism? No idea. But I believe they believe that. Hence my inclusion of ”...regardless of the process to create the intelligence.”
Most readers of Less Wrong believe Orthogonality.
But, I think the term is confusing and we need to talk about it in simpler terms like Human Worth Hypothesis. (Put the cookies on the low shelf for the kids.)
And, its worth some creative effort to design experiments to test the Human Worth hypothesis.
Imagine the headline: “Experiments demonstrate that frontier AI models do not value humanity.”
If it were believable, a lot of people would update.
I don’t think one needs to believe the human worth hypothesis to disbelieve strong orthogonality, one only needs to believe that gradient descent is able to actually find representations that correctly represent the important parts of the things the training data was intended by the algorithm designer to represent, eg for the youtube recommender this would be “does this enrich the user’s life enough to keep them coming back”, but what’s actually measured is just “how long do they come back”.