My intuition is that the best way to build wise AI would be to train imitation learning agents on people who we consider to be wise. If we trained imitations of people with a variety of perspectives, we could then simulate discussions between them and try to figure out the best discussion formats between such agents. This could likely get us reasonably far.
The reason why I say imitation learning is because that would give us something that we could treat as an optimisation target which is what we require for training ML systems.
Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.
My intuition is that the best way to build wise AI would be to train imitation learning agents on people who we consider to be wise. If we trained imitations of people with a variety of perspectives, we could then simulate discussions between them and try to figure out the best discussion formats between such agents. This could likely get us reasonably far.
The reason why I say imitation learning is because that would give us something that we could treat as an optimisation target which is what we require for training ML systems.
Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.
You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.