One concept I kept going back to in my head while reading was interactive protocols in complexity theory. It would seem that some way to extract the inaccessible information would be to train the model to make interactive protocol works for verification of the queries. That being said, I don’t know how to do that, and it might be ridiculously difficult.
Another point I’d like to make is that sometimes, inaccessible information is a feature, not a bug. I would far prefer a model of me that can predict when I will need medical supply, but cannot be queried for getting my thoughts. Basically, I’d like some sort of zero-knowledge proof, to keep the intuition of interactive protocols.
This might seem irrelevant to the safety issue here, but I think it shows that there are incentives for inaccessibility, and sometimes “good” incentives. So the problem might be even harder to solve than expected, because even with a solution, people won’t want it to be implemented.
Thanks for the post!
One concept I kept going back to in my head while reading was interactive protocols in complexity theory. It would seem that some way to extract the inaccessible information would be to train the model to make interactive protocol works for verification of the queries. That being said, I don’t know how to do that, and it might be ridiculously difficult.
Another point I’d like to make is that sometimes, inaccessible information is a feature, not a bug. I would far prefer a model of me that can predict when I will need medical supply, but cannot be queried for getting my thoughts. Basically, I’d like some sort of zero-knowledge proof, to keep the intuition of interactive protocols.
This might seem irrelevant to the safety issue here, but I think it shows that there are incentives for inaccessibility, and sometimes “good” incentives. So the problem might be even harder to solve than expected, because even with a solution, people won’t want it to be implemented.