To add to the other comments: “the AI understands what you mean, it just doesn’t care” refers to a situation where we have failed to teach the AI to care about the things we care about. At that point, it’s likely that it can figure out what we actually wanted it to do, but it isn’t motivated to do what we wanted it to do.
This post describes a part of a strategy looking to figure out how the AI might be made to care about the same things as we do, by having an internal understanding of the world that’s similar to the human understanding and then having its goals grounded in terms of that understanding.
It’s not a complete solution (or even a complete subsolution), but rather hacking at the edges. As you mention, if things go badly it is e.g. possible for the AI to escape the box and rewire the reward function. The intended approach for avoiding that would be to program it to inherently care about the same things as humans do before letting it out of the box. At that point, it wouldn’t be primarily motivated by the programmer’s feedback anymore, but its own internalized values, which would hopefully be human-friendly.
To add to the other comments: “the AI understands what you mean, it just doesn’t care” refers to a situation where we have failed to teach the AI to care about the things we care about. At that point, it’s likely that it can figure out what we actually wanted it to do, but it isn’t motivated to do what we wanted it to do.
This post describes a part of a strategy looking to figure out how the AI might be made to care about the same things as we do, by having an internal understanding of the world that’s similar to the human understanding and then having its goals grounded in terms of that understanding.
It’s not a complete solution (or even a complete subsolution), but rather hacking at the edges. As you mention, if things go badly it is e.g. possible for the AI to escape the box and rewire the reward function. The intended approach for avoiding that would be to program it to inherently care about the same things as humans do before letting it out of the box. At that point, it wouldn’t be primarily motivated by the programmer’s feedback anymore, but its own internalized values, which would hopefully be human-friendly.