I’m suggesting that we have an independent “auditor” process that’s capable of summarizing and answering questions about the AI’s reasoning at any level of detail, but that the “auditor” is not under direct control of the AI (if the AI gains control of the “auditor”, it’s escaped from its box). For example, the AI could generate detailed log files (as a fundamental part of its architecture, the AI would log all of its reasoning), and there could be a separate piece of software designed to interpret and summarize the log files. (This separate software might not need to be very intelligent.)
Well, if the auditor were not very intelligent, I would wonder if the oracle had cleverly followed a long chain of reasoning that would produce logs that would trick the auditor into answering your queries about the oracle’s reasoning to your satisfaction. Perhaps the chain of reasoning seems like a legitimate search for bacon, but due to subtleties inscrutable to anything less than a superintelligence, it actually leads to bacon that takes over the world.
If the auditor were sufficiently intelligent to see through such machinations, I would wonder if the auditor and oracle weren’t acausally cooperating.
I haven’t thought about this much, and I’m not confident at all. If you find my objections inadequate, they probably are.
I find this not a very satisfactory reply. The auditor could have access to knowledge about physical resources consumed (electrical usage, heat generated). An attempt to trick the auditor (whatever that means) would require a strictly larger search of the solution space, requiring more computational energy expended.
It’s not really clear what you mean by ‘trick the auditor’ anyway. The auditor would have access to logs of the computational substrate running the AGI, portions of which could be inspected without summation. Even if the AGI had such capability, it would have evolved such trickery, and that development would be observable.
Don’t worry, I haven’t thought about it much either. I hope I’m not doing anything to pollute discussion of this topic. I figure in the worst-case a more serious researcher can skim through our discussion to see if we thought of anything they didn’t. (Maybe we should have a separate discussion thread about reading one should definitely do before commenting on these topics, or about what we can do to explore the problem as thoroughly as possible before proposing solutions?)
The auditor would be able to see the oracle’s reasoning behind obscuring future chains of reasoning.
Ideally the Oracle wouldn’t have much awareness of the details of how things were logged or how the auditor worked. There’s no reason why there couldn’t be more than one auditor, btw, each using a different scheme for auditing.
I’m suggesting that we have an independent “auditor” process that’s capable of summarizing and answering questions about the AI’s reasoning at any level of detail, but that the “auditor” is not under direct control of the AI (if the AI gains control of the “auditor”, it’s escaped from its box). For example, the AI could generate detailed log files (as a fundamental part of its architecture, the AI would log all of its reasoning), and there could be a separate piece of software designed to interpret and summarize the log files. (This separate software might not need to be very intelligent.)
Well, if the auditor were not very intelligent, I would wonder if the oracle had cleverly followed a long chain of reasoning that would produce logs that would trick the auditor into answering your queries about the oracle’s reasoning to your satisfaction. Perhaps the chain of reasoning seems like a legitimate search for bacon, but due to subtleties inscrutable to anything less than a superintelligence, it actually leads to bacon that takes over the world.
If the auditor were sufficiently intelligent to see through such machinations, I would wonder if the auditor and oracle weren’t acausally cooperating.
I haven’t thought about this much, and I’m not confident at all. If you find my objections inadequate, they probably are.
I find this not a very satisfactory reply. The auditor could have access to knowledge about physical resources consumed (electrical usage, heat generated). An attempt to trick the auditor (whatever that means) would require a strictly larger search of the solution space, requiring more computational energy expended.
It’s not really clear what you mean by ‘trick the auditor’ anyway. The auditor would have access to logs of the computational substrate running the AGI, portions of which could be inspected without summation. Even if the AGI had such capability, it would have evolved such trickery, and that development would be observable.
Don’t worry, I haven’t thought about it much either. I hope I’m not doing anything to pollute discussion of this topic. I figure in the worst-case a more serious researcher can skim through our discussion to see if we thought of anything they didn’t. (Maybe we should have a separate discussion thread about reading one should definitely do before commenting on these topics, or about what we can do to explore the problem as thoroughly as possible before proposing solutions?)
The auditor would be able to see the oracle’s reasoning behind obscuring future chains of reasoning.
Ideally the Oracle wouldn’t have much awareness of the details of how things were logged or how the auditor worked. There’s no reason why there couldn’t be more than one auditor, btw, each using a different scheme for auditing.