Trying to think of ways that experiments can set up where AI-systems use other AI-systems as oracles/genies (1 or more being superintelligent), perhaps sometimes in “adversarial” ways. Exploring methods for asking requests and maybe finding out things about how hard/easy it is to “trick” operator (seeming to provide what they want without providing what they want) given various domains/methods.
May e.g. involve one AI asking for code-pieces for some purpose, but where other AI is to try to “hide” ways in which delivered code isn’t quite what the other AI wants.
Superintelligence may realize what’s going on (or may be going on), and act accordingly. But nontheless maaybe some useful info could be gained? 🤔
Perhaps experiments could make use of encryption in some way that prevented AGIs from doing/verifying work themselves, making it so that they would need to align the other AGI/AGIs. Encryption keys that only one AGI has could be necessary for doing and/or verifying work.
Could maybe set things up in such a way that one AGI knows it can get more reward if it tricks the other into approving faulty output.
Similar to this, but not the same: Experiment with AGI where it is set to align other AGI. For example, maybe it needs to do some tasks to do reward, but those tasks need to be done by the other AGI, and it don’t know what the tasks will be beforehand. One goal being to see methods AGI might use to align other AGI (that may then be used to align AGI-systems that are sub-systems of AGI-system, and seeing if output from this AGI converges with results from AGIs aligned by other principles).
Don’t expect that this would be that fruitful, but haven’t thought about it that much and who knows.
Trying to think of ways that experiments can set up where AI-systems use other AI-systems as oracles/genies (1 or more being superintelligent), perhaps sometimes in “adversarial” ways. Exploring methods for asking requests and maybe finding out things about how hard/easy it is to “trick” operator (seeming to provide what they want without providing what they want) given various domains/methods.
May e.g. involve one AI asking for code-pieces for some purpose, but where other AI is to try to “hide” ways in which delivered code isn’t quite what the other AI wants.
Superintelligence may realize what’s going on (or may be going on), and act accordingly. But nontheless maaybe some useful info could be gained? 🤔
Perhaps experiments could make use of encryption in some way that prevented AGIs from doing/verifying work themselves, making it so that they would need to align the other AGI/AGIs. Encryption keys that only one AGI has could be necessary for doing and/or verifying work.
Could maybe set things up in such a way that one AGI knows it can get more reward if it tricks the other into approving faulty output.
Would need to avoid suffering sub-routines.
Similar to this, but not the same: Experiment with AGI where it is set to align other AGI. For example, maybe it needs to do some tasks to do reward, but those tasks need to be done by the other AGI, and it don’t know what the tasks will be beforehand. One goal being to see methods AGI might use to align other AGI (that may then be used to align AGI-systems that are sub-systems of AGI-system, and seeing if output from this AGI converges with results from AGIs aligned by other principles).
Don’t expect that this would be that fruitful, but haven’t thought about it that much and who knows.
Would need to avoid suffering sub-routines.