I guess I’m just like, no matter what you do, you’re going to need to translate human values into AI values. The methodology you’re proposing is some kind of steering sort of thing, where you have knobs, I’m assuming, which you can turn to emphasize or de-emphasize certain values inside of your ML system. And there’s a particular setting of these knobs, which get you human values, and your job is to figure out what that setting of the knobs is.
I think this works fine for worlds where alignment is pretty easy. This sounds a lot like Alex Turner’s current plan, but I don’t think it works well in worlds where alignment is hard. In worlds where alignment is hard, it’s not necessarily guaranteed that the AI will even have values which are close to your own values, and you may need to do some interventions into your AI or rethink how you train your AI so that it has values which are similar to yours.
I guess I’m just like, no matter what you do, you’re going to need to translate human values into AI values. The methodology you’re proposing is some kind of steering sort of thing, where you have knobs, I’m assuming, which you can turn to emphasize or de-emphasize certain values inside of your ML system. And there’s a particular setting of these knobs, which get you human values, and your job is to figure out what that setting of the knobs is.
I think this works fine for worlds where alignment is pretty easy. This sounds a lot like Alex Turner’s current plan, but I don’t think it works well in worlds where alignment is hard. In worlds where alignment is hard, it’s not necessarily guaranteed that the AI will even have values which are close to your own values, and you may need to do some interventions into your AI or rethink how you train your AI so that it has values which are similar to yours.