Do you put actual humans into the simbox? If no, then isn’t that a pretty big OOD problem? Or if yes, how do you do that safely?
I think I’m skeptical that “learning to recognize agency in the world” and “maximization of other agents’ empowerment” actually exist in the form of “simple universal principles”. For example, when I see a simple animatronic robot, it gives me a visceral impression of agency, but it’s a false impression. Well, hmm, I guess that depends on your definitions. Well anyway, I’ll just say that if an AGI were maximizing the “empowerment” of any simple animatronic robots that it sees, I would declare that this AGI was doing the wrong thing.
It’s fine if you want to just finish your longer post instead of replying here. Either way, looking forward to that! :)
Humans in the simbox—perhaps in the early stages, but not required once it’s running (although human observers have a later role). But that’s mostly tangential.
One of the key ideas here—and perhaps divergent vs many other approaches—is that we want agents to robustly learn and optimize for other agents values: across a wide variety of agents, situations, and agent value distributions. The idea is to handle OOD by generalizing beyond specific human values. Then once we perfect these architectures and training regimes and are satisfied with their alignment evaluations we can deploy them safely in the real world where they will learn and optimize for our values (safe relative to deploying new humans).
I do have a rough sketch of the essence of the mechanism I think the brain is using for value learning and altruism, and I actually found one of your articles to link to that is related.
I suspect you’d agree that self-supervised prediction is a simple, powerful, and universal learning idea—strongly theoretically justified as in Solomonoff/Bayes and AIXI, etc, and clearly also a key brain mechanism. Generalized empowerment or self-improvement is similar—strongly theoretically justified, and also clearly a key brain mechanism. The former guides learning of the predictive world model, the latter guides learning of the action/planning system. Both are also optimal in a certain sense.
Human’s tendency to anthropomorphize, empathize with, and act altruistically towards various animals and even hypothetical non-humans is best explained as a side effect of a very general (arguably overly general!) alignment mechanism.
Do you put actual humans into the simbox? If no, then isn’t that a pretty big OOD problem? Or if yes, how do you do that safely?
I think I’m skeptical that “learning to recognize agency in the world” and “maximization of other agents’ empowerment” actually exist in the form of “simple universal principles”. For example, when I see a simple animatronic robot, it gives me a visceral impression of agency, but it’s a false impression. Well, hmm, I guess that depends on your definitions. Well anyway, I’ll just say that if an AGI were maximizing the “empowerment” of any simple animatronic robots that it sees, I would declare that this AGI was doing the wrong thing.
It’s fine if you want to just finish your longer post instead of replying here. Either way, looking forward to that! :)
Humans in the simbox—perhaps in the early stages, but not required once it’s running (although human observers have a later role). But that’s mostly tangential.
One of the key ideas here—and perhaps divergent vs many other approaches—is that we want agents to robustly learn and optimize for other agents values: across a wide variety of agents, situations, and agent value distributions. The idea is to handle OOD by generalizing beyond specific human values. Then once we perfect these architectures and training regimes and are satisfied with their alignment evaluations we can deploy them safely in the real world where they will learn and optimize for our values (safe relative to deploying new humans).
I do have a rough sketch of the essence of the mechanism I think the brain is using for value learning and altruism, and I actually found one of your articles to link to that is related.
I suspect you’d agree that self-supervised prediction is a simple, powerful, and universal learning idea—strongly theoretically justified as in Solomonoff/Bayes and AIXI, etc, and clearly also a key brain mechanism. Generalized empowerment or self-improvement is similar—strongly theoretically justified, and also clearly a key brain mechanism. The former guides learning of the predictive world model, the latter guides learning of the action/planning system. Both are also optimal in a certain sense.
Human’s tendency to anthropomorphize, empathize with, and act altruistically towards various animals and even hypothetical non-humans is best explained as a side effect of a very general (arguably overly general!) alignment mechanism.