We quantify the human’s local preferences by asking “Look at the person you actually became. How happy are you with that person? Quantitatively, how much of your value was lost by replacing yourself with that person?” This gives us a loss on a scale from 0% (perfect idealization, losing nothing) to 100% (where all of the value is gone). Most of the values will be exceptionally small, especially if we look at a short period like an hour.
Eventually once the human becomes wise enough to totally epistemically dominate the original AI, they can assign a score to the AI’s actions. To make life simple for now let’s ignore negative outcomes and just describe value as a scalar from 0% (barren universe) to 100% (all of the universe is used in an optimal way). Or we might use this “final scale” in a different way (e.g. to evaluate the AI’s actions rather than the actually assessing outcomes, assigning high scores to corrigible and efficient behavior and somehow quantifying deviations from that ideal).
The utility is the product of all of these numbers.
If I follow correctly, the first step requires the humans to evaluate the output of narrow value learning, until this output becomes good enough to become universal with regard to the original AI and supervise it? I’m not sure I get why the AI wouldn’t be incentivized to temper with the narrow value learning, à la Predict-o-matic? Depending on certain details, (like maybe the indescribable hellworld hypothesis), maybe the AI can introduce changes to the partial imitations/deliberations that end up hidden and compounding until the imitations epistemically dominates the AI, and then it ask it to do simple stuff.
The hope is that a tampering large enough to corrupt the human’s final judgment would get a score of ~0 in the local value learning. 0 is the “right” score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don’t need to “ask it to do simple stuff” you can just directly assign a score of 1.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that’s what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don’t have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn’t have to be so complex.)
(I’m not sure if this made too much sense, I have a draft of a related comment that I’ll probably post soon but overall expect to just leave this as not-making-much-sense for now.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that’s what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don’t have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn’t have to be so complex.)
So you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)?
If that’s about right, then I agree that having this would make your proposal work, but I still don’t know how to get it. I need to read your previous posts on reading questions honestly.
You basically just need full universality / epistemic competitiveness locally. This is just getting around “what are values?” not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don’t have any solution better than “actually figure everything out ‘ourselves’”).
Almost all of the time I’m thinking about how to get epistemic competitiveness for the local interaction. I think that’s the meat of the safety problem.
If I follow correctly, the first step requires the humans to evaluate the output of narrow value learning, until this output becomes good enough to become universal with regard to the original AI and supervise it? I’m not sure I get why the AI wouldn’t be incentivized to temper with the narrow value learning, à la Predict-o-matic? Depending on certain details, (like maybe the indescribable hellworld hypothesis), maybe the AI can introduce changes to the partial imitations/deliberations that end up hidden and compounding until the imitations epistemically dominates the AI, and then it ask it to do simple stuff.
The hope is that a tampering large enough to corrupt the human’s final judgment would get a score of ~0 in the local value learning. 0 is the “right” score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don’t need to “ask it to do simple stuff” you can just directly assign a score of 1.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that’s what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don’t have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn’t have to be so complex.)
(I’m not sure if this made too much sense, I have a draft of a related comment that I’ll probably post soon but overall expect to just leave this as not-making-much-sense for now.)
So you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)?
If that’s about right, then I agree that having this would make your proposal work, but I still don’t know how to get it. I need to read your previous posts on reading questions honestly.
You basically just need full universality / epistemic competitiveness locally. This is just getting around “what are values?” not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don’t have any solution better than “actually figure everything out ‘ourselves’”).
Almost all of the time I’m thinking about how to get epistemic competitiveness for the local interaction. I think that’s the meat of the safety problem.