Yes, it’s not a very satisfactory solution. Some alternative/complementary solutions:
Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well.
Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?)
Have the TAI learn from past data which wasn’t affected by the incentives created by the TAI. (But, is there enough information there?)
Shape the TAI’s prior about human values in order to rule out at least the most blatant lies.
Some clever mechanism design I haven’t thought of. The problem with this is, most mechanism designs rely on money and money that doesn’t seem applicable, whereas when you don’t have money there are many impossibility theorems.
Yes, it’s not a very satisfactory solution. Some alternative/complementary solutions:
Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well.
Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?)
Have the TAI learn from past data which wasn’t affected by the incentives created by the TAI. (But, is there enough information there?)
Shape the TAI’s prior about human values in order to rule out at least the most blatant lies.
Some clever mechanism design I haven’t thought of. The problem with this is, most mechanism designs rely on money and money that doesn’t seem applicable, whereas when you don’t have money there are many impossibility theorems.