The hope is that a tampering large enough to corrupt the human’s final judgment would get a score of ~0 in the local value learning. 0 is the “right” score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don’t need to “ask it to do simple stuff” you can just directly assign a score of 1.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that’s what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don’t have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn’t have to be so complex.)
(I’m not sure if this made too much sense, I have a draft of a related comment that I’ll probably post soon but overall expect to just leave this as not-making-much-sense for now.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that’s what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don’t have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn’t have to be so complex.)
So you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)?
If that’s about right, then I agree that having this would make your proposal work, but I still don’t know how to get it. I need to read your previous posts on reading questions honestly.
You basically just need full universality / epistemic competitiveness locally. This is just getting around “what are values?” not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don’t have any solution better than “actually figure everything out ‘ourselves’”).
Almost all of the time I’m thinking about how to get epistemic competitiveness for the local interaction. I think that’s the meat of the safety problem.
The hope is that a tampering large enough to corrupt the human’s final judgment would get a score of ~0 in the local value learning. 0 is the “right” score since the tampered human by hypothesis has lost all of the actual correlation with value. (Note that at the end you don’t need to “ask it to do simple stuff” you can just directly assign a score of 1.)
This hope does require the local oversight process to be epistemically competitive with the AI, in the sense that e.g. if the AI understands something subtle about the environment dynamics then the oversight process also needs to understand that. And that’s what we are trying to do with all of this business about training AIs to answer questions honestly. The point is just that you don’t have to clear up any of the ambiguity about what the human wants, you just have to be able to detect someone tampering with deliberation. (And the operationalization of tampering doesn’t have to be so complex.)
(I’m not sure if this made too much sense, I have a draft of a related comment that I’ll probably post soon but overall expect to just leave this as not-making-much-sense for now.)
So you want a sort of partial universality sufficient to bootstrap the process locally (while not requiring the understanding of our values in fine details), giving us enough time for a deliberation that would epistemically dominate the AI in a global sense (and get our values right)?
If that’s about right, then I agree that having this would make your proposal work, but I still don’t know how to get it. I need to read your previous posts on reading questions honestly.
You basically just need full universality / epistemic competitiveness locally. This is just getting around “what are values?” not the need for competitiveness. Then the global thing is also epistemically competitive, and it is able to talk about e.g. how our values interact with the alien concepts uncovered by our AI (which we want to reserve time for since we don’t have any solution better than “actually figure everything out ‘ourselves’”).
Almost all of the time I’m thinking about how to get epistemic competitiveness for the local interaction. I think that’s the meat of the safety problem.