One key to this whole thing seems to be that “helpfulness” is not something that we can write an objective for. But I think the reason that we can’t write an objective for it is better captured by inaccessible information than by Goodhart’s law.
By “other-izer problem”, do you mean the satisficer and related ideas? I’d be interested in pointers to more “other-izers” in this cluster.
But isn’t it the case that these approaches are still doing something akin to search in the sense that they look for any element of a hypothesis space meeting some conditions (perhaps not a local optima, but still some condition)? If so then I think these ideas are quite different from what humans do when we design things. I don’t think we’re primarily evaluating whole elements of some hypothesis space looking for one that meets certain conditions, but are instead building things up piece-by-piece.
Well, any process that picks actions ends up equivalent to some criterion, even if only “actions likely to be picked by this process.” The deal with agents and agent-like things is that they pick actions based on their modeled consequences. Basically anything that picks actions in different way (or, more technically, a way that’s complicated to explain in terms of planning) is an other-izer to some degree. Though maybe this is drift from the original usage, which wanted nice properties like reflective stability etc.
The example of the day is language models. GPT doesn’t pick its next sentence by modeling the world and predicting the consequences. Bam, other-izer. Neither design nor search.
Anyhow, back on topic, I agree that “helpfulness to humans” is a very complicated thing. But maybe there’s some simpler notion of “helpful to the AI” that results in design-like other-izing that loses some of the helpfulness-to-humans properties, but retains some of the things that make design seem safer than search even if you never looked at the “stories.”
One key to this whole thing seems to be that “helpfulness” is not something that we can write an objective for. But I think the reason that we can’t write an objective for it is better captured by inaccessible information than by Goodhart’s law.
By “other-izer problem”, do you mean the satisficer and related ideas? I’d be interested in pointers to more “other-izers” in this cluster.
But isn’t it the case that these approaches are still doing something akin to search in the sense that they look for any element of a hypothesis space meeting some conditions (perhaps not a local optima, but still some condition)? If so then I think these ideas are quite different from what humans do when we design things. I don’t think we’re primarily evaluating whole elements of some hypothesis space looking for one that meets certain conditions, but are instead building things up piece-by-piece.
Well, any process that picks actions ends up equivalent to some criterion, even if only “actions likely to be picked by this process.” The deal with agents and agent-like things is that they pick actions based on their modeled consequences. Basically anything that picks actions in different way (or, more technically, a way that’s complicated to explain in terms of planning) is an other-izer to some degree. Though maybe this is drift from the original usage, which wanted nice properties like reflective stability etc.
The example of the day is language models. GPT doesn’t pick its next sentence by modeling the world and predicting the consequences. Bam, other-izer. Neither design nor search.
Anyhow, back on topic, I agree that “helpfulness to humans” is a very complicated thing. But maybe there’s some simpler notion of “helpful to the AI” that results in design-like other-izing that loses some of the helpfulness-to-humans properties, but retains some of the things that make design seem safer than search even if you never looked at the “stories.”