Similar to this, but not the same: Experiment with AGI where it is set to align other AGI. For example, maybe it needs to do some tasks to do reward, but those tasks need to be done by the other AGI, and it don’t know what the tasks will be beforehand. One goal being to see methods AGI might use to align other AGI (that may then be used to align AGI-systems that are sub-systems of AGI-system, and seeing if output from this AGI converges with results from AGIs aligned by other principles).
Don’t expect that this would be that fruitful, but haven’t thought about it that much and who knows.
Similar to this, but not the same: Experiment with AGI where it is set to align other AGI. For example, maybe it needs to do some tasks to do reward, but those tasks need to be done by the other AGI, and it don’t know what the tasks will be beforehand. One goal being to see methods AGI might use to align other AGI (that may then be used to align AGI-systems that are sub-systems of AGI-system, and seeing if output from this AGI converges with results from AGIs aligned by other principles).
Don’t expect that this would be that fruitful, but haven’t thought about it that much and who knows.
Would need to avoid suffering sub-routines.