Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).
If you want the best capability, you cannot force all cognition through the bottleneck of human-legible text with zero steganography or non-robust features or overloading, because short text transcripts are not capable of representing or computing most of the desired tasks. Let’s take a concrete example: chess endgame table data. There are no human-interpretable representations of arbitrary endgame table ‘decisions’ because there are far too many possible positions and there is no simple short explanation for all of them. (Similarly, the probes of AlphaZero’s superhuman chess knowledge show that there is some overlap, but there’s still a lot going on in AZ’s understanding of a position that is apparently not captured anywhere close to 100% by standard chess concepts. Which is not too surprising because if human grandmasters had the exact right concepts, they wouldn’t pick the wrong move half the time.) There is no interpretable ‘inner monologue’ of an endgame database anymore than there is an interpretable inner monologue of quadrillions of gas molecules bouncing around in a container; they are simply brute facts, and if you are unsatisfied with the vague, lossy, high-level abstraction of ‘it needs to get the rook into play’ or ‘it’s hot’, then there’s nothing you can do but follow the exact computation or physics for however many trillions of steps it takes.
So, if an AI can bootstrap from human-level ‘process supervision’ to more superhuman results, as one wants, there will be a tradeoff: it will get more human-like as it approaches the human-level, but somewhere around that it will start to diverge. (Another example: AlphaGo/Zero picks the same move as human experts in an inverted U-curve: it makes the same choice more often as it gets better, but then after a point, it starts picking the human moves less often, because it’s surpassed the humans and it’s better than them.)
achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two
I think we eventually want superhuman capabilities, but I don’t think it’s required in the near term and in particular it’s not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo.
(The situation where labs chose not to/ are forbidden to pursue superhuman capabilities—even though they could—is scary, but doesn’t seem impossible.)
Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).
If you want the best capability, you cannot force all cognition through the bottleneck of human-legible text with zero steganography or non-robust features or overloading, because short text transcripts are not capable of representing or computing most of the desired tasks. Let’s take a concrete example: chess endgame table data. There are no human-interpretable representations of arbitrary endgame table ‘decisions’ because there are far too many possible positions and there is no simple short explanation for all of them. (Similarly, the probes of AlphaZero’s superhuman chess knowledge show that there is some overlap, but there’s still a lot going on in AZ’s understanding of a position that is apparently not captured anywhere close to 100% by standard chess concepts. Which is not too surprising because if human grandmasters had the exact right concepts, they wouldn’t pick the wrong move half the time.) There is no interpretable ‘inner monologue’ of an endgame database anymore than there is an interpretable inner monologue of quadrillions of gas molecules bouncing around in a container; they are simply brute facts, and if you are unsatisfied with the vague, lossy, high-level abstraction of ‘it needs to get the rook into play’ or ‘it’s hot’, then there’s nothing you can do but follow the exact computation or physics for however many trillions of steps it takes.
So, if an AI can bootstrap from human-level ‘process supervision’ to more superhuman results, as one wants, there will be a tradeoff: it will get more human-like as it approaches the human-level, but somewhere around that it will start to diverge. (Another example: AlphaGo/Zero picks the same move as human experts in an inverted U-curve: it makes the same choice more often as it gets better, but then after a point, it starts picking the human moves less often, because it’s surpassed the humans and it’s better than them.)
I think we eventually want superhuman capabilities, but I don’t think it’s required in the near term and in particular it’s not required to do a huge amount of AI safety research. So if we can choose the last two, and get a safe human-level AI system that way, I think it might be a good improvement over the status quo.
(The situation where labs chose not to/ are forbidden to pursue superhuman capabilities—even though they could—is scary, but doesn’t seem impossible.)