Figure it’s worth saying this is also very much where I’m trying to take my own research in AI safety. I’d phrase it as we are currently hopelessly confused about what it is we even think we want to do to approach alignment so much so that we can’t even state the problem formally (although I’ve made a start at it, and I view Stuart Armstrong’s research as doing the same even if I disagree with him on the specifics of his approach). I agree with you on what things seem to be pointing us towards something, and I think we even have tastes of what the formalism we need looks like already, but there’s also a lot to be done to get us there.
Figure it’s worth saying this is also very much where I’m trying to take my own research in AI safety. I’d phrase it as we are currently hopelessly confused about what it is we even think we want to do to approach alignment so much so that we can’t even state the problem formally (although I’ve made a start at it, and I view Stuart Armstrong’s research as doing the same even if I disagree with him on the specifics of his approach). I agree with you on what things seem to be pointing us towards something, and I think we even have tastes of what the formalism we need looks like already, but there’s also a lot to be done to get us there.