I was thinking of the structure of Generative Adversarial Networks. Would that not apply in this case? It would involve 2 competing AGI’s in the end though. I’m not sure if they’d just collaborate to set both their reward functions to max, or if that will never happen due to possible game theoretic considerations.
In a GAN, one network tries to distinguish real images from fake. The other network tries to produce fake images that fool the first net. Both of these are simple formal tasks.
“exploits in the objective function” could be considered as “solutions that score highly that the programmers didn’t really intend”. The problem is that its hard to formalize what the programmers really intended. Given an evolutionary search for walking robots, a round robot that tumbles over might be a clever unexpected solution, or reward hacking, depending on the goals of the developers. Are the robots intended to transport anything fragile? Anything that can’t be spun and tossed upsidown? Whether the tumblebot is a clever unexpected design, or a reward hack depends on things that are implicit in the developers minds, not part of the program at all.
A lot of novice AI safety ideas look like “AI 1 has this simple specifiable reward function. AI 2 oversees AI 1. AI 2 does exactly what we want, however hard that is to specify and is powered by pure handwavium”
I was thinking of the structure of Generative Adversarial Networks. Would that not apply in this case? It would involve 2 competing AGI’s in the end though. I’m not sure if they’d just collaborate to set both their reward functions to max, or if that will never happen due to possible game theoretic considerations.
In a GAN, one network tries to distinguish real images from fake. The other network tries to produce fake images that fool the first net. Both of these are simple formal tasks.
“exploits in the objective function” could be considered as “solutions that score highly that the programmers didn’t really intend”. The problem is that its hard to formalize what the programmers really intended. Given an evolutionary search for walking robots, a round robot that tumbles over might be a clever unexpected solution, or reward hacking, depending on the goals of the developers. Are the robots intended to transport anything fragile? Anything that can’t be spun and tossed upsidown? Whether the tumblebot is a clever unexpected design, or a reward hack depends on things that are implicit in the developers minds, not part of the program at all.
A lot of novice AI safety ideas look like “AI 1 has this simple specifiable reward function. AI 2 oversees AI 1. AI 2 does exactly what we want, however hard that is to specify and is powered by pure handwavium”