rvnnt comments on TurnTrout’s shortform feed

rvnnt 14 Aug 2024 11:21 UTC
5 points
0
Upvoted and disagreed. ^[1]

One thing in particular that stands out to me: The whole framing seems useless unless Premise 1 is modified to include a condition like

[...] we can select a curriculum and reinforcement signal which [...] and which makes the model highly “useful/capable”.

Otherwise, Premise 1 is trivially true: We could (e.g.) set all the model’s weights to 0.0; thereby guaranteeing the non-entrainment of any (“bad”) circuits.

I’m curious: what do you think would be a good (...useful?) operationalization of “useful/capable”?

Another issue: K and epsilon might need to be unrealistically small: Once the model starts modifying itself (or constructing successor models) (and possibly earlier), a single strategically-placed sign-flip in the model’s outputs might cause catastrophe. ^[2]
1. ↩︎
  I think writing one’s thoughts/intuitions out like this is valuable—for sharing frames/ideas, getting feedback, etc. Thus: thanks for writing it up. Separately, I think the presented frame/case is probably confused, and almost useless (at best).
2. ↩︎
  Although that might require the control structures (be they Shards or a utility function or w/e) of the model to be highly “localized/concentrated” in some sense. (OTOH, that seems likely to at least eventually be the case?)