Donald Hobson comments on What I mean by “alignment is in large part about making cognition aimable at all”

Donald Hobson 4 Feb 2023 19:56 UTC
LW: 3 AF: 2
0
AF
Given a sufficiently Kludgy pile of heuristics, it won’t make another AI, unless it has a heuristic towards making AI. (In which case the kind of AI it makes depend on its AI making heuristics. ) GPT5 won’t code an AI to minimize predictive error on text. It will code some random AI that looks like something in the training dataset. And will care more about what the variable names are than what the AI actually does.
Big piles of kludges usually arise from training a kludge finding algorithm (like deep learning). So the only ways agents could get AI building kludges is from making dumb AI’s or reading human writings.
Alternately, maybe the AI has sophisticated self reflection. It is looking at its own kludges and trying to figure out what it values. In which case, does the AI’s metaethics contain a simplicity prior? With a strong simplicity prior, an agent with a bunch of kludges that mostly maximized diamond could turn into an actual crystaline diamond maximizer. If it doesn’t have that simplicity prior, I would guess it ended up optimizing some complicated utility function. (But probably producing a lot of diamond as it did so, diamond isn’t the only component of it’s utility, but it is a big one.)
- Steven Byrnes 6 Feb 2023 15:14 UTC
  LW: 2 AF: 2
  0
  AF Parent
  For my part, I expect a pile of kludges (learned via online model-based RL) to eventually guide the AI into doing self-reflection. (Self-reflection is, after all, instrumentally convergent.) If I’m right, then it would be pretty hard to reason about what will happen during self-reflection in any detail. Likewise, it would be pretty hard to intervene in how the self-reflection will work.
  E.g. we can’t just “put in” or “not put in” a simplicity prior. The closest thing that we could do is try to guess whether or not a “simplicity kludge” would have emerged, and to what extent that kludge would be active in the particular context of self-reflection, etc.—which seems awfully fraught.
  To be clear, while I think it would be pretty hard to intervene on the self-reflection process, I don’t think it’s impossible. I don’t have any great ideas right now but it’s one of the things I’m working on.