Here’s a place where I want one of those disagree buttons separate from the downvote button :P
Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don’t get represented much at all, and which ways get left out can depend how you’re training this AI (and a bit more subtly, how you’re interpreting its parameters as a world model).
Especially because of that second part, finding good goals in an AI’s world model isn’t satisfactory if you’re just training an fixed, arbitrary AI. Your process for finding good goals needs to interact with how the AI learns its mode of the world in the first place. In which case, world-model interpretability is not all we need.
I agree that the AI would only learn the abstraction layers it’d have a use for. But I wouldn’t take it as far as you do. I agree that with “human values” specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete.
The alternative would be a straight-up failure of the NAH, I think; your assertion that “abstractions can be on a continuum” seems directly at odds with it. Which isn’t impossible, but this post is premised on the NAH working.
Here’s a place where I want one of those disagree buttons separate from the downvote button :P
Given a world model that contains a bunch of different ways of modeling the same microphysical state (splitting up the same world into different parts, with different saliency connections to each other, like the discussion of job vs. ethnicity and even moreso), there can be multiple copies that coarsely match some human-intuitive criteria for a concept, given different weights by the AI. There will also be ways of modeling the world that don’t get represented much at all, and which ways get left out can depend how you’re training this AI (and a bit more subtly, how you’re interpreting its parameters as a world model).
Especially because of that second part, finding good goals in an AI’s world model isn’t satisfactory if you’re just training an fixed, arbitrary AI. Your process for finding good goals needs to interact with how the AI learns its mode of the world in the first place. In which case, world-model interpretability is not all we need.
I agree that the AI would only learn the abstraction layers it’d have a use for. But I wouldn’t take it as far as you do. I agree that with “human values” specifically, the problem may be just that muddled, but with none of the other nice targets — moral philosophy, corrigibility, DWIM, they should be more concrete.
The alternative would be a straight-up failure of the NAH, I think; your assertion that “abstractions can be on a continuum” seems directly at odds with it. Which isn’t impossible, but this post is premised on the NAH working.