I think you simplified the problem too much to be solvable. If the tuple (Secret, Observable) represents the entire problem-relevant state of the world, then there is not enough structure for you to be able to train the AI.
In the original ELK problem, the “secret” is a sequence of interactions with the world that are too complex to be processed by humans. But a simple sequence of interactions with the world would not be too complex to be processed by humans. So you could hope that you could somehow create a method to robustly extrapolate from the simple interactions to the complex ones.
By collapsing it down into a tuple of booleans, there is not much structure to learn, and therefore not much hope to extrapolate.
* Train a reporter that is useful to an auxiliary AI → The Secret is still unmanaged, so you have arbitrary information escape. Even if the reporter could only reply yes/no to exactly one question, you still have non-finite entropy using only the time to respond. Resolving this requires having knowledge of the structure (relax the “inner workings of the Agent are unknown” requirement).
* Require the reporter to be continuous → The Secret is still unmanaged. Any notion of continuity requires asserting a structure onto the reporter. In general, I like this as an area of exploration, but it demands relaxing that requirement. I wrote in my response about applying penalties to the structure of the computation rather than the result. Ex, ‘we could compromise its computations during training so that it has to build a “downward slope” everywhere’, which is a technique for “continuity”.
* Penalize depending on too many parts of the predictor → The Secret is still unmanaged. Any combination of the available parts of the predictor could be the Secret, so unless the human can also perform a perfect translation from the same parts, the reporter still has freedom to deceive. If the human can perform perfect translation, then this is a human imitator.
* Compress the predictor’s state → The Secret is still unmanaged, and you still have arbitrary information escape from the predictor to reporter.
* Use reporter to define causal interventions → I think this could work if we assume structure on both reporter and predictor. There are lots of things like this when you assume structure (relax the “inner workings of the Agent are unknown” requirement). For example, by applying cost to activations, the SmartVault has incentive to find minimal paths to achieve its goal. Corrupting even a small subset of its activations should introduce uncertainty into a Bayes net reporter. If the reporter is unresponsive to untargeted corruptions, it is more likely to be imitating or lying.
* Train a sequence of reporters → This strongly assumes both structure and training technique.
This is what I mean when I write ‘At some point, you have to stake a claim and say “we are defending against an AI that is built like X, how can you bend it to your will?” Without this, we can always handwave away any “solution”’.
I believe due to my restricted problem that the answer to “Can we elicit latent knowledge from an arbitrary AI?” is “No”.
I believe the answer to “Can we elicit latent knowledge from an AI built using this particular ML technique and running on this hardware?” is “Possibly, tell me more.”
I believe the answer to “Is it safe to assume that there will never be an AI built for which we do not understand its algorithms or control its hardware?” is “No”.
I think you simplified the problem too much to be solvable. If the tuple (Secret, Observable) represents the entire problem-relevant state of the world, then there is not enough structure for you to be able to train the AI.
In the original ELK problem, the “secret” is a sequence of interactions with the world that are too complex to be processed by humans. But a simple sequence of interactions with the world would not be too complex to be processed by humans. So you could hope that you could somehow create a method to robustly extrapolate from the simple interactions to the complex ones.
By collapsing it down into a tuple of booleans, there is not much structure to learn, and therefore not much hope to extrapolate.
My intent when simplifying the problem was to demonstrate that you must explicitly relax certain assumptions to make any progress.
The families of proposals in https://www.lesswrong.com/posts/zjMKpSB2Xccn9qi5t/elk-prize-results mostly fit into this lens:
* Train a reporter that is useful to an auxiliary AI → The Secret is still unmanaged, so you have arbitrary information escape. Even if the reporter could only reply yes/no to exactly one question, you still have non-finite entropy using only the time to respond. Resolving this requires having knowledge of the structure (relax the “inner workings of the Agent are unknown” requirement).
* Require the reporter to be continuous → The Secret is still unmanaged. Any notion of continuity requires asserting a structure onto the reporter. In general, I like this as an area of exploration, but it demands relaxing that requirement. I wrote in my response about applying penalties to the structure of the computation rather than the result. Ex, ‘we could compromise its computations during training so that it has to build a “downward slope” everywhere’, which is a technique for “continuity”.
* Penalize depending on too many parts of the predictor → The Secret is still unmanaged. Any combination of the available parts of the predictor could be the Secret, so unless the human can also perform a perfect translation from the same parts, the reporter still has freedom to deceive. If the human can perform perfect translation, then this is a human imitator.
* Compress the predictor’s state → The Secret is still unmanaged, and you still have arbitrary information escape from the predictor to reporter.
* Use reporter to define causal interventions → I think this could work if we assume structure on both reporter and predictor. There are lots of things like this when you assume structure (relax the “inner workings of the Agent are unknown” requirement). For example, by applying cost to activations, the SmartVault has incentive to find minimal paths to achieve its goal. Corrupting even a small subset of its activations should introduce uncertainty into a Bayes net reporter. If the reporter is unresponsive to untargeted corruptions, it is more likely to be imitating or lying.
* Train a sequence of reporters → This strongly assumes both structure and training technique.
This is what I mean when I write ‘At some point, you have to stake a claim and say “we are defending against an AI that is built like X, how can you bend it to your will?” Without this, we can always handwave away any “solution”’.
I believe due to my restricted problem that the answer to “Can we elicit latent knowledge from an arbitrary AI?” is “No”.
I believe the answer to “Can we elicit latent knowledge from an AI built using this particular ML technique and running on this hardware?” is “Possibly, tell me more.”
I believe the answer to “Is it safe to assume that there will never be an AI built for which we do not understand its algorithms or control its hardware?” is “No”.