I’ll add another one to the list: “Human-level knowledge/human simulator”
Max Nadeau helped clarify some ways in which this framing introduced biases into my and others’ models of ELK and scalable oversight. Knowledge is hard to define and our labels/supervision might be tamperable in ways that are not intuitively related to human difficulty.
Different measurements of human difficulty only correlate at about 0.05 to 0.3, suggesting that human difficulty might not be a very meaningful concept for AI oversight, or that our current datasets for experimenting with scalable oversight don’t contain large enough gaps in difficulty to make meaningful measurements.
I’ll add another one to the list: “Human-level knowledge/human simulator”
Max Nadeau helped clarify some ways in which this framing introduced biases into my and others’ models of ELK and scalable oversight. Knowledge is hard to define and our labels/supervision might be tamperable in ways that are not intuitively related to human difficulty.
Different measurements of human difficulty only correlate at about 0.05 to 0.3, suggesting that human difficulty might not be a very meaningful concept for AI oversight, or that our current datasets for experimenting with scalable oversight don’t contain large enough gaps in difficulty to make meaningful measurements.