Yes, I too agree that planning using a model of the world does a pretty good job of capturing what we mean when we say “caring about things.”
Of course, AIs with bad goals can also use model-based planning.
Some other salient features:
Local search rather than global. Alternatively could be framed as regularization on plans to be close to some starting distribution. This isn’t about low impact because we still want the AI to search well enough to find clever and novel plans, instead it’s about avoiding extrema that are really far from the starting distribution.
Generation of plans (or modifications of plans) using informative heuristics rather than blind search. Almost like MCTS is useful. These heuristics might be blind to certain ways of getting reward, especially in novel contexts they weren’t trained on, which is another sort of effective regularization.
Having a world-model that is really good at self-reflection, e.g. “If I start talking about topic X I’ll get distracted,” and connects predictions about the self to its predicted reward.
Having the goal of the AI’s search process be a good thing that we actually want, within the context we want to make happen in the real world.
These can be mixed and matched, and are all matters of degree. I think you do a disservice by saying things like “actually, humans really care about their goals but grader-optimizers don’t,” because it sets up this supposed natural category of “grader optimizers” that are totally different from “value executers,” and it actually seems like it makes it harder to reason about what mechanistic properties are producing the change you care about.
Alternatively could be framed as regularization on plans to be close to some starting distribution. This isn’t about low impact because we still want the AI to search well enough to find clever and novel plans, instead it’s about avoiding extrema that are really far from the starting distribution.
I don’t think it’s naturally framed in terms of distance metrics I can think of. I think a values-agent can also end up considering some crazy impressive plans (as you might agree).
I think you do a disservice by saying things like “actually, humans really care about their goals but grader-optimizers don’t,” because it sets up this supposed natural category of “grader optimizers” that are totally different from “value executers,” and it actually seems like it makes it harder to reason about what mechanistic properties are producing the change you care about.
I both agree and disagree. I think that reasoning about mechanisms and not words is vastly underused in AI alignment, and endorse your pushback in that sense. Maybe I should write future essays with exhortations to track mechanisms and examples while following along.
But also I do perceive a natural category here, and I want to label it. I think the main difference between “grader optimizers” and “value executers” is that grader optimizers are optimizing plans to get high evaluations, whereas value executers find high-evaluating plans as a side effect of cognition. That does feel pretty natural to me, although I don’t have a good intensional definition of “value-executers” yet.
Yes, I too agree that planning using a model of the world does a pretty good job of capturing what we mean when we say “caring about things.”
Of course, AIs with bad goals can also use model-based planning.
Some other salient features:
Local search rather than global. Alternatively could be framed as regularization on plans to be close to some starting distribution. This isn’t about low impact because we still want the AI to search well enough to find clever and novel plans, instead it’s about avoiding extrema that are really far from the starting distribution.
Generation of plans (or modifications of plans) using informative heuristics rather than blind search. Almost like MCTS is useful. These heuristics might be blind to certain ways of getting reward, especially in novel contexts they weren’t trained on, which is another sort of effective regularization.
Having a world-model that is really good at self-reflection, e.g. “If I start talking about topic X I’ll get distracted,” and connects predictions about the self to its predicted reward.
Having the goal of the AI’s search process be a good thing that we actually want, within the context we want to make happen in the real world.
These can be mixed and matched, and are all matters of degree. I think you do a disservice by saying things like “actually, humans really care about their goals but grader-optimizers don’t,” because it sets up this supposed natural category of “grader optimizers” that are totally different from “value executers,” and it actually seems like it makes it harder to reason about what mechanistic properties are producing the change you care about.
I don’t think it’s naturally framed in terms of distance metrics I can think of. I think a values-agent can also end up considering some crazy impressive plans (as you might agree).
I both agree and disagree. I think that reasoning about mechanisms and not words is vastly underused in AI alignment, and endorse your pushback in that sense. Maybe I should write future essays with exhortations to track mechanisms and examples while following along.
But also I do perceive a natural category here, and I want to label it. I think the main difference between “grader optimizers” and “value executers” is that grader optimizers are optimizing plans to get high evaluations, whereas value executers find high-evaluating plans as a side effect of cognition. That does feel pretty natural to me, although I don’t have a good intensional definition of “value-executers” yet.