Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.
You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.
Seems reasonable. I do still worry quite a bit about Goodharting, but perhaps this could be reasonably addressed with careful oversight by some wise humans to do the wisdom equivalent of red teaming.
You mean it might still Goodhart to what we think they might say? Ideally, the actual people would be involved in the process.