My main advice to avoid this failure mode is to leverage your Pareto frontier. Apply whatever knowledge, or combination of knowledge, you have which others in the field don’t.
This makes sense if you already have knowledge which other people don’t, but what about if you don’t? How much should “number of people in the alignment community who already know X thing” factor into what you decide to study, relative to other factors like “how useful is X thing, when you ignore what everyone else is doing?” For instance, there are probably fewer people who know a lot about geology than who know a lot about economics, but I would expect that learning about economics would still be more valuable for doing agent foundations research.
(My guess is that the answer is “don’t worry much at all about the Pareto frontier stuff when deciding what to study,” especially because there aren’t that many alignment researchers anyways, but I’m not actually sure.)
I agree that motivation should reduce to low-level, primitive things, and also that changing the agent’s belief about where the cheese is lets you retarget behavior. However, I don’t expect edits to beliefs to let you scalably control what the agent does, in that if it’s smart enough and making sufficiently complicated plans you won’t have a reliable mapping from (world model state) to (abstract class of behavior executed by the agent), where when I say “abstract class of behavior” I mean things like “put the red balls in the blue basket” or “pet all the cats in the environment.”
It also seems plausible to me that there exist parts of the agent that do allow for scalable control through modification, and this is what I would refer to as “the values” (the classic example here is a utility function, though things like RL agents might not have those).
But maybe you’re studying the structure of motivational circuitry with a downstream objective other than “scalable control,” in which case this objection doesn’t necessarily apply.