One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like
[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly “useful/capable”.
Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model’s weights to 0.0; thereby guaranteeing the non-entrainment of any (“bad”) circuits.
I’m curious: what do you think would be a good (...useful?) operationalization of “useful/capable”?
Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model’s outputs might cause catastrophe. [2]
I think writing one’s thoughts/intuitions out like this is valuable—for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best).
Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly “localized/concentrated” in some sense. (OTOH, that seems likely to at least eventually be the case?)
Upvoted and disagreed. [1]
One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like
Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model’s weights to 0.0; thereby guaranteeing the non-entrainment of any (“bad”) circuits.
I’m curious: what do you think would be a good (...useful?) operationalization of “useful/capable”?
Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model’s outputs might cause catastrophe. [2]
I think writing one’s thoughts/intuitions out like this is valuable—for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best).
Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly “localized/concentrated” in some sense. (OTOH, that seems likely to at least eventually be the case?)