I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”
I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”.
Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and ctrl+C ctrl+V for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.
I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
I think there are a range of disagreements here, but also one man’s modus ponens is another’s modus tollens: High variance in heroin-propensity implies we can optimize heroin-propensity down to negligible values with relatively few bits of optimization (if we knew what we were doing, at least).
This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and
ctrl+C ctrl+V
for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.