First, I think most of the individual pieces of this story are basically right, so good job overall. I do think there’s at least one fatal flaw and a few probably-smaller issues, though.
The main fatal flaw is this assumption:
Since “IF diamond”-style predicates do in fact “perfectly classify” the positive/negative approach/don’t-approach decision contexts...
This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.
I’m mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it’s very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that’s still a pretty difficult problem if the dataset is to be reasonably large and diverse.
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).
Probably-smaller issues:
“Acquiring” things is a… tricky concept in an embedded setting, which I expect to require some specific environmental features. I very strongly doubt that rewarding an agent for just approaching diamonds (especially in the absence of other agents) would induce it, and even the lottery training’s shard-impact depends on exactly how the diamond is “given” to the agent.
Similarly, it seems like arguably zero of the proposed pieces of training would reward the agent for causing more diamond to exist, which does not bode well for a diamond-production shard showing up at all. Seems like the agent will mostly just want to be near lots of diamonds, and plausibly will not even consider the idea of creating more diamonds.
In particular, this training scheme could easily make the agent develop a shard which dislikes the existence of diamonds far away from the agent, which would ultimately push against large-scale diamond-creation.
The easy way to patch these is to forget about approach-rewards altogether, and just reward the agent for causing more diamond to exist (or for total amount of diamond which exists in its environment). That’s more directly what we want from a diamond-optimizer anyway.
Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You’ve been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction.
I don’t think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like “infer the existence / true nature of distant latent generators that explain your observations” are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor, rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans).
Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.
I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too “distant”/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans’ labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It’s like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it’s grabbing the ball.
I think something like what you’re describing does occur, but my view of SGD is that it’s more “ensembly” than that. Rather than “the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard”, I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).
Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.
I could imagine a story where it matters—e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that’s a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn’t put much confidence in that argument.
… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
On the other hand, consider a more traditional “ensemble”, in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”, so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of “iteration” system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
Well so you’re obviously pretraining using imitation learning, so I’ve got that part down.
If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.
So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.
if every shard has a veto over plans, and the shards are individually quite intelligent subagents
I think this won’t happen FWIW.
and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it’s engaging with what you had in mind)
I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”
What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.
We aren’t in the prediction regime, insofar as that is supposed to be relevant for your argument. Let’s talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.)
Can you give me a concrete example of an “exploiting shard” in this situation which is learnable early on, relative to the actual diamond-shards?
And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.
The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there’s a chance it exists at low weight or something.
… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
I read this as “the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual ‘Goodhart’ problem where highly rated plans are systematically bad and not what you wanted.” I disagree with the conclusion, at least for many kinds of “imperfections.”
Below is one shot at instantiating the failure mode you’re describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn’t meant as a “slam dunk case closed”, but hopefully something which helps you understand how I’m thinking about the issue and why I don’t anticipate “and then the shards get Goodharted.”
Example shard-Goodharting scenario. The AI bids for plans which it thinks lead to diamonds, except that also, the subcircuit of the policy network which computes the relevant diamond abstraction—this is only a “proxy” for a reliable diamond abstraction. Historically unknown to the AI until the end of its training, that subcircuit (for some reason) activates very strongly for plans which lead to certain diamond-shaped formations of bacteria on the third Tuesday of the year.
Then this shard can be “goodharted” by actions which involve the creation of these bacteria diamonds at that time. There’s a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn’t pursue that plan.
Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don’t have to be “globally robust” or “perfect.”
Values steer optimization; they are not optimized against. The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).
Since values are not the optimization target of the agent with those values, the values don’t have to be adversarially robust.
Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values. In self-reflective agents which can think about their own thinking, values steer e.g. what plans get considered next. Therefore, these agents convergently avoid adversarial inputs to their currently activated values (e.g. learning), because adversarial inputs would impede fulfillment of those values (e.g. lead to less learning).
This suggests “and so what is an ‘adversarial input’ to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?”. I haven’t answered that question yet on an intensional basis, but it seems tractable.
Since “IF diamond”-style predicates do in fact “perfectly classify” the positive/negative approach/don’t-approach decision contexts...
This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.
Not crucial on my model.
I’m mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it’s very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that’s still a pretty difficult problem if the dataset is to be reasonably large and diverse.
I’m imagining us watching the agent and seeing whether it approaches an object or not. Those are the “labels.” I’m imagining this taking place between 50-1000 times. Before seeing this comment, I edited the post to add:
We probably also reinforce other kinds of cognition, but that’s OK in this story. Maybe we even give the agent some false positive reward because our hand slipped while the agent wasn’t approaching a diamond, but that’s fine as long as it doesn’t happen too often. That kind of reward event will weakly reinforce some contingent non-diamond-centric cognition (like “IF near wall, THEN turn around”). In the end, we want an agent which has a powerful diamond-shard, but not necessarily an agent which only has a diamond-shard.
So, probably I shouldn’t have written “perfectly”, since that isn’t actually load-bearing on my model. I think that there’s a rather smooth relationship between “how good you are at labelling” and “the strength of desired value you get out” (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all). On that model, I don’t really understand the following:
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).
The agent already has the diamond abstractionfrom SSL+IL, but not the labelling process (due to IID training, and it having never seen our “labelling” before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)
“Acquiring” things is a… tricky concept in an embedded setting, which I expect to require some specific environmental features. I very strongly doubt that rewarding an agent for just approaching diamonds (especially in the absence of other agents) would induce it, and even the lottery training’s shard-impact depends on exactly how the diamond is “given” to the agent. [...]
Seems like the agent will mostly just want to be near lots of diamonds, and plausibly will not even consider the idea of creating more diamonds.
I agree that “diamond synthesis” is not directly rewarded, and if we wanted to ensure that happens, we could add that to the curriculum, as you note. But I think it would probably happen anyways, due to the expected-by-me “grabby” nature of the acquire-subshard. (Consider that I think it’d be cool to make dyson swarms, but I’ve never been rewarded for making dyson swarms.) So maybe the crux here is that I don’t yet share your doubt of the acquisition-shard.
Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You’ve been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.
I think that “are we directly rewarding the behavior which we want the desired shards to exemplify?” is a reasonable heuristic. I think that “What happens if the agent optimizes its reward function?” is not a reasonable heuristic.
The agent already has the diamond abstractionfrom SSL+IL, but not the labelling process (due to IID training, and it having never seen our “labelling” before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)
I think there’s a few different errors in this reasoning.
First: the agent probably has the concept of diamond from SSL+IL, but that’s different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it’s controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. “production” and “diamond”); the actual goals or behaviors encoded in a shard have to be built up in whatever “internal language” the agent has from the SSL/IL training.
So the question isn’t “does the agent have the concept of diamond/label?”, the question is how short the relevant “sentences” are in terms of the concepts it has. Neither will be just one “word”.
Second: as with Quintin’s comment, the AI does not need to fully model the entire labelling process in order for this problem to apply. If there’s any simple, predictable pattern to the humans’ label-errors (which of course there usually is in practice), then the AI can pick that up. (It’s not just a question of hand-slips; humans make systematic errors which will strongly activate shards very similar to the intended shards.)
So the question isn’t “is the entire labelling process a short ‘sentence’ in the AI’s internal language?” (though even that is not implausible), but rather “do any systematic errors in the labelling process have a short ‘sentence’ in the AI’s internal language?”.
Now put those two together. The intended shards are quite a bit more complicated than you suggested, because they don’t just depend on the concept of “diamond”, they depend on constructing a bunch of other concepts about what to do involving diamonds. And the unintended shards are quite a bit less complicated than you suggested, because they can exploit simple systematic errors in the labels.
I think I have a complaint like “You seem to be comparing to a ‘perfect’ reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn’t make sense. A good reward schedule will put diamond-aligned cognition in the agent. It seems like, for you to be saying there’s a ‘fatal’ flaw here due to ‘errors’, you need to make an argument about the cognition which trains into the agent, and how the AI’s cognition-formation behaves differently in the presence of ‘errors’ compared to in the absence of ‘errors.’ And I don’t presently see that story in your comments thus far. I don’t understand what ‘perfect labeling’ is the thing to talk about, here, or why it would ensure your shard-formation counterarguments don’t hold.”
(Will come by for lunch and so we can probably have a higher-context discussion about this! :) )
I think I have a complaint like “You seem to be comparing to a ‘perfect’ reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn’t make sense.
I think this is close to our most core crux.
It seems to me that there are a bunch of standard arguments which you are ignoring because they’re formulated in an old frame that you’re trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you’ve instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.
Like, if I have a reward signal that rewards X, then the old frame would say “alright, so the agent will optimize for X”. And you’re like “nope, that whole form of argument is invalid, hit ignore button”. But in fact it is usually very easy to take that argument and unpack it into something like “X has a short description in terms of natural abstractions, so starting from a base model and giving a feedback signal we should rapidly see some X-shards show up, and then the shards which best match X will be reinforced to exponentially higher weight (with respect to the bit-divergence between their proxy X’ and the actual X)”. And it seems like you are not even attempting to perform that translation, which I find very frustrating because I’m pretty sure you know this stuff plenty well to do it.
It seems to me that there are a bunch of standard arguments which you are ignoring because they’re formulated in an old frame that you’re trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you’ve instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.
I agree that we may need to be quite skillful in providing “good”/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it’s possible we have substantial degrees of freedom there.) In this sense, we might need to give “robustly” good feedback.
However, one intuition which I hadn’t properly communicated was: to make OP’s story go well, we don’t need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn’t just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this “robust grading” problem doesn’t just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows “nonrobust” decision-influences and doesn’t require robust grading.)
And so I might have been saying “Hey isn’t this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame” while thinking of the above intuition (but not communicating it explicitly, because I didn’t have that sufficient clarity as yet). But maybe you reacted ”??? how does this avoid the need to reliably grade on-distribution situations, it’s totally nontrivial to do that and it seems quite probable that we have to.” Both seem true to me!
(I’m not saying this was the whole of our disagreement, but it seems like a relevant guess.)
When I first read this comment, I incorrectly understood it to say somehing like “If you were actually trying, you’d have generated the exponential error model on your own; the fact that you didn’t shows that you aren’t properly thinking about old arguments.” I now don’t think that’s what you meant. I think I finally[1] understand what you did mean, and I think you misunderstood what my original comment was trying to say because I wrote poorly and stream-of-consciousness.
Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points:
I don’t know what a “perfect” reward function is in the absence of outer alignment, else I would know how to solve diamond alignment. But I’m happy to just discuss deviations from a proposed labelling scheme. (This is probably what we were already discussing, so this wasn’t meant to be a devastating rejoinder or anything.)
I’m not sure what you mean by the “exponential” model you mentioned elsewhere, or why it would be a fatal flaw if true. Please say more? (Hopefully in a way which makes it clear why your argument behaves differently in the presence of errors, because that would be one way to make your arguments especially legible to how I’m currently thinking about the situation.)
Given my best guess at your model (the exponential error model), I think your original comment seems too optimistic about my specific story (sure seems like exponential weighting would probably just break it, label errors or no) but too pessimistic about the story template (why is it a fatal flaw that can’t be fixed with a bit of additional thinking?).
I meant to ask something like “I don’t fully understand what you’re arguing re: points 1 and 2 (but I have some guesses), and think I disagree about 3; please clarify?” But instead (e.g. by saying things like “my complaint is...”) I perhaps communicated something like “because I don’t understand 2 in my native frame, your argument sucks.” And you were like “Come on, you didn’t even try, you could have totally translated 2. Worrying that you apparently didn’t.”
I think that I left an off-the-cuff comment which might have been appropriate as a Discord message (with real-time clarification), but not as a LessWrong comment. Oops.
Elaborating points 1 and 3 above:
Point 1. In outer/inner, if you “perfectly label” reward events based on whether the agent approaches the diamond, you’re “done” as far as the outer alignment part goes. In order to make the agent actually care about approaching diamonds, we would then turn to inner alignment techniques / ideas. It might make sense to call this labelling “perfect” as far as specifying the outer objective for those scenarios (e.g. when that objective is optimized, the agent actually approaches the diamond).
But if we aren’t aiming for outer/inner alignment, and instead are just considering the (reward schedule) → (inner value composition) mapping, then I worry that my post’s original usage of “perfect” was misleading. On my current frame, a perfect reward schedule would be one which actually gets diamond-values into the agent. The schedule I posted is probably not the best way to do that, even if all goes as planned. I want to be careful not to assume the “perfection” of “+1 when it does in fact approach a real diamond which it can see”, even if I can’t currently point to better alternative reward schedules (e.g. “+x reward in some weird situation”). (This is what I was getting at with “I don’t understand what ‘perfect labeling’ is the thing to talk about, here.”)
What you probably meant by “errors” was “divergences from the reward function outlined in the original post.” This is totally reasonable and important to talk about, but at least I want to clarify for myself and other readers that this is what we’re talking about, and not assuming that my intended reward function was actually “perfect.” (Probably it’s fine to keep talking about “perfect labelling” as long as this point has been made explicit.)
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake. (But I’m tentative about all this, haven’t sketched out a concrete failure scenario yet given exponential model! Just a hunch I remember having.)
Again, it was very silly of me to expect my original comment to communicate these points. At the time of writing, I was trying to unpack some promising-seeming feelings and elaborate them over lunch.
My original guess at your complaint was “How could you possibly have not generated the exponential weight hypothesis on your own?”, and I was like what the heck, it’s a hypothesis, sure… but why should I have pinned down that one? What’s wrong with my “linear in error proportion for that kind of situation, exponential in ontology-distance at time of update” hypothesis, why doesn’t that count as a thing-to-have-generated? This was a big part of why I was initially so confused about your complaint.
And then several people said they thought your comment was importantly correct-seeming, and I was like “no way, how can everyone else already have such a developed opinion on exponential vs linear vs something-else here? Surely this is their first time considering the question? Why am I getting flak about not generating that particular hypothesis, how does that prove I’m ‘not trying’ in some important way?”
To be clear, I don’t think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don’t think they’re all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect “reward for proxy, get an agent which cares about the proxy”; there’s lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don’t perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
You gestured at some intuitions about that in this comment (which I’m copying below to avoid scrolling to different parts of the thread-tree), and I’d be interested to see more of those intuitions extracted.
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing)
I have multiple different disagreements with this, and I’m not sure which are relevant yet, so I’ll briefly state a few:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it’s just a particular family of distributed algorithms for argmaxing.
It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there’s Eliezer’s argument from the sequences that just removing boredom results in a dystopia.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
My values are also risk-averse (I’d much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in “shard strength” after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
(This isn’t fully expressing my intuition, here, but I figured I’d say at least a little something to your comment right now)
I’m not going to go into most of the rest now, but:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
I’m also fine considering “A person who is OK with other people drinking coffee” and anti-C: “a person with otherwise the same values but who isn’t OK with other people drinking coffee.” I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn’t become bitter enemies, that anti-C wouldn’t kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
Possibly the anti-coffee value would even be squashed by the rest of anti-C’s values, because the anti-coffee value wasn’t reflectively endorsed by the rest of anti-C’s values. That’s another way in which I think anti-C can be “close enough” and things work out fine.
EDIT 2: The original comment was too harsh. I’ve struck the original below. Here is what I think I should have said:
I think you raise a valuable object-level point here, which I haven’t yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I’d appreciate if you wouldn’t speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.
Warning: This comment, and your previous comment, violate my comment section guidelines: “Reign of terror // Be charitable.” You have made and publicly stated a range of unnecessary, unkind, and untrue inferences about my thinking process. You have also made non-obvious-to-me claims of questionable-to-me truth value, which you also treat as exceedingly obvious. Please edit these two comments to conform to my civility guidelines.
(EDIT: Thanks. I look forward to resuming object-level discussion!)
After more reflection, I now think that this moderation comment was too harsh. First, the parts I think I should have done differently:
Realized that who reads commenting guidelines anyways, let alone expects them to be enforced?
Realized that it’s probably ambiguous what counts as “charitable” or not, even though (illusion of transparency) it felt so obvious to me that this counted as “not that.”
Realized that predictably I would later consider the incident to be less upsetting than in the moment, and that John may not have been aware that I find this kind of situation unusually upsetting.
Therefore, I should have said something like “I think you raise a valuable object-level point here, which I haven’t yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I’d appreciate if you wouldn’t speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.”
I’m striking the original warning, putting in (4), and I encourage John to unredact his comments (but that’s up to him).
I’ve thought more about what my policy should be going forward. What kind of space do I want my comment section to be? First, I want to be able to say “This seems wrong, and here’s why”, and other people can say the same back to me, and one or more of us can end up at the truth faster. Second, it’s also important that people know that, going forward, engaging with me in (what feels to them like) good-faith will not be randomly slapped with a moderation warning because they annoyed me.
Third, I want to feel comfortable in my interactions in my comment section. My current plan is:
If someone comment something which feels personally uncharitable to me (a rather rare occurrence, what with the hundreds of comments in the last year since this kind of situation last happened), then I’ll privately message them, explain my guidelines, and ask that they tweak tone / write more on the object-level / not do the particular thing.[1]
If necessary, I’ll also write a soft-ask (like (4) above) as a comment.
In cases where this is just getting ignored and the person is being antagonistic, I will indeed post a starker warning and then possibly just delete comments.
Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It’s currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.
I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of past arguments this post is ignoring, but am now unlikely to do so).
Again, I think that’s fine, but I think posts with idiosyncratic norm enforcement should get less exposure, or at least not be canonical references. Historically we’ve decided to not put posts on frontpage when they had particularly idiosyncratic norm enforcement. I think that’s the wrong call here, but not confident.
Sorry, I’m confused; for my own education, can you explain why these civility guidelines aren’t epistemically suicidal? Personally, I want people like John Wentworth to comment on my posts to tell me their inferences about my thinking process; moreover, controlling for quality, “unkind” inferences are better, because I learn more from people telling me what I’m doing wrong, than from people telling me what I’m already doing right. What am I missing? Please be unkind.
First: the agent probably has the concept of diamond from SSL+IL, but that’s different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it’s controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. “production” and “diamond”); the actual goals or behaviors encoded in a shard have to be built up in whatever “internal language” the agent has from the SSL/IL training.
So the question isn’t “does the agent have the concept of diamond/label?”, the question is how short the relevant “sentences” are in terms of the concepts it has. Neither will be just one “word”.
This is already my model and was intended as part of my communicated reasoning. Why do you think it’s an error in my reasoning? You’ll notice I argued “If diamond”, and about hooking that diamond predicate into its approach-subroutines (learned via IL). (ETA: I don’t think you need a self-model to approach a diamond, or to “value” that in the appropriate sense. To value diamonds being near you, you can have representations of the space nearby, so you need a nearby representation, perhaps.)
label-errors
I think this is not the right term to use, and I think it might be skewing your analysis. This is not a supervised learning regime with exact gradients towards a fixed label. The question is what gets upweighted by the batch PG gradients, batching over the reward events. Let me exaggerate the kind of “error rates” I think you’re anticipating:
Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
What’s supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
Suppose I’m grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What’s supposed to happen next?
(If these errors aren’t representative, can you please provide a concrete and plausible scenario?)
Let me exaggerate the kind of “error rates” I think you’re anticipating:
Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
What’s supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
Suppose I’m grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What’s supposed to happen next?
(If these errors aren’t representative, can you please provide a concrete and plausible scenario?)
Both of these examples are are focused on one error type: the agent does not receive a reward in a situation which we like. That error type is, in general, not very dangerous.
The error type which is dangerous is for an agent to receive a reward in a situation which we don’t like. For instance, receiving reward in a situation involving a convincing-looking fake diamond. And then a shard which hooks up its behavior to things-which-look-like-diamonds (which is probably at least as natural an abstraction as diamond) gets more weight relative to the diamond-shard, and so when those two shards disagree later the things-which-look-like-diamonds shard wins.
Note that it would not be at all surprising for the AI to have a prior concept of real-diamonds-or-fake-diamonds-which-are-good-enough-to-fool-most-humans, because that is a cluster of stuff which behaves similarly in many places in the real world—e.g. they’re both used for similar jewelry.
And sure, you try to kinda patch that by including some correctly-labelled things-which-look-like-diamonds in training, but that only works insofar as they’re sufficiently-obviously-not-diamond that the human labeller can tell (and depends on the ratio of correct to incorrect labels, etc).
(Also, some moderately uncharitable psychologizing, and I apologize if it’s wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I’d expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)
(Also, some moderately uncharitable psychologizing, and I apologize if it’s wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I’d expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)
I want to talk about several points related to this topic. I don’t mean to claim that you were making points directly related to all of the below bullet points. This just seems like a good time to look back and assess and see what’s going on for me internally, here. This seems like the obvious spot to leave the analysis.
At the time of writing, I wasn’t particularly worried about the errors you brought up.
I am a little more worried now in expectation, both under the currently low-credence worlds where I end up agreeing with your exponential argument, and in the ~linear hypothesis worlds, since I think I can still search harder for worrying examples which IMO neither of us have yet proposed. Therefore I’ll just get a little more pessimistic immediately, in the latter case.
If I had been way more worried about “reward behavior we should have penalized”, I would have indeed just been less likely to raise the more worrying failure points, but not super less likely. I do assess myself as flawed, here, but not as that flawed.
I think the typical outcome would be something like “TurnTrout starts typing a list full of weak flaws, notices a twinge of motivated reasoning, has half a minute of internal struggle and then types out the more worrisome errors, and, after a little more internal conflict, says that John has a good point and that he wants to think about it more.”
I could definitely buy that I wouldn’t be that virtuous, though, and that I would need a bit of external nudging to consider the errors, or else a few more days on my own for the issue to get raised to cognitive-housekeeping. After that happened a few times, I’d notice the overall problem and come up with a plan to fix it.
Obviously, I have at this point noticed (at least) my counterfactual mistake in the nearby world where I already agreed with you, and therefore have a plan to fix and remove that flaw.
I think you are right in guessing that I could use more outer/inner heuristics to my advantage, that I am missing a few tools on my belt. Thanks for pointing that out.
I don’t think that motivated cognition has caused me to catastrophically miss key considerations from e.g. “standard arguments” in a way which has predictably doomed key parts of my reasoning.
Why I think this: I’ve spent a little while thinking about what the catastrophic error would be, conditional on it existing, and nothing’s coming up for the moment.
I’d more expect there to be some sequence of slight ways I ignored important clues that other people gave, and where I motivatedly underupdated. But also this is a pretty general failure mode, and I think it’d be pretty silly to call a halt without any positive internal evidence that I actually have done this. (EDIT: In a specific situation which I remember and can correct, as opposed to having a vague sense that yeah I’ve probably done this several times in the last few months. I’ll just keep an eye out.)
Rather, I think that if I spend three or so days typing up a document, and someone like John Wentworth thinks carefully about it, then that person will surface at least a few considerations I’d missed, more probably using tools not native to my current frame.
This feels “fine” in that that’s part of the point of sharing my ideas with other people—that smart people will surface new considerations or arguments. This feels “not fine” in the sense that I’d like to not miss considerations, of course.
This also feels “fine” in that, yes, I wanted to get this essay out before never arrives, and usually I take too long to hit “publish”, and I’m still very happy with the essay overall. I’m fine with other people finding new considerations (e.g. the direct reward for diamond synthesis, or zooming in on how much perfect labelling is required).
I think that if it turns out there was some crucial existing argument which I did miss, I think I’ll go “huh” but not really be like “wow that hovered at the edge of my cognition but I denied it for motivated reasons.”
I am way more worried about how much of my daily cognition is still socially motivated, and I do consider that to be a “stop drop and roll”-level fuckup on my part.
I think there’s not just now-obvious things here like “I get very defensive in public settings in specific situations”, but a range of situations in which I subconsciously aim to persuade or justify my positions, instead of just explaining what I think and why, what I disagree with and why; that some subconscious parts of me look for ways to look good or win an argument; that I have rather low trust in certain ways and that makes it hard for me sometimes; etc.
I think that I am above-average here, but I have very high standards for myself and consider my current skill in this area to be very inadequate.
For the record: I welcome well-meaning private feedback on what I might be biased about or messing up. On the other hand, having the feedback be public just pushes some of my buttons in a way which makes the situation hard for me to handle. I aspire for this not to be the case about me. That aspiration is not yet realized.
I’ve worked hard to make this analysis honest and not optimized to make me look good or less silly. Probably I’ve still failed at least a little. Possibly I’ve missed something important. But this is what I’ve got.
Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought “it’s Turner, if he’s actually motivatedly cognitating here he’ll notice once it’s pointed out”. (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren’t. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.
For the record: I welcome well-meaning private feedback on what I might be biased about or messing up.
Fair point, that part of my comment probably should have been private. Mea culpa for that.
This doesn’t seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won’t really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I’m not adding these now, I was imagining this kind of curriculum before, to be clear—see the “game” shard.)
So maybe there’s a shard with predicates like “would be sensory-perceived by naive people to be a diamond” that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way → not a very substantial update. Not sure why that’s a big problem.
But I’ll think more and see if I can’t salvage your argument in some form.
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).
This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.
Sequence 1:
The agent develops diamond-shard
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it
The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard
So the agent’s values drift away from what we intended.
Sequence 2:
The agent develops diamond-shard
The diamond-shard becomes part of the agent’s endorsed preferences (the goal-content it foresightedly plans to preserve)
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift.
So agent continues to value diamonds in spite of the imperfect labeling process
These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.
Yup, that’s a valid argument. Though I’d expect that gradient hacking to the point of controlling the reinforcement on one’s own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).
I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.
On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
(You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)
Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver’s seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.
First, I think most of the individual pieces of this story are basically right, so good job overall. I do think there’s at least one fatal flaw and a few probably-smaller issues, though.
The main fatal flaw is this assumption:
This assumes that the human labellers (or automated labellers created by humans) have perfectly labelled the training examples.
I’m mostly used to thinking about this in the context of alignment with human values (or corrigibility etc), where it’s very obvious that human labellers will make mistakes. In the case of diamonds, it is maybe plausible that we could get a dataset with zero incorrect labels, but that’s still a pretty difficult problem if the dataset is to be reasonably large and diverse.
If the labels are not perfect, then the major failure mode is that the AI ends up learning the actual labelling process rather than the intended natural abstraction. Once the AI has an internal representation of the actual labelling process, that proto-shard will be reinforced more than the proto-diamond shard, because it will match the label in cases where the diamond-concept doesn’t (and the reverse will not happen, or at least will happen less often and only due to random noise).
Probably-smaller issues:
“Acquiring” things is a… tricky concept in an embedded setting, which I expect to require some specific environmental features. I very strongly doubt that rewarding an agent for just approaching diamonds (especially in the absence of other agents) would induce it, and even the lottery training’s shard-impact depends on exactly how the diamond is “given” to the agent.
Similarly, it seems like arguably zero of the proposed pieces of training would reward the agent for causing more diamond to exist, which does not bode well for a diamond-production shard showing up at all. Seems like the agent will mostly just want to be near lots of diamonds, and plausibly will not even consider the idea of creating more diamonds.
In particular, this training scheme could easily make the agent develop a shard which dislikes the existence of diamonds far away from the agent, which would ultimately push against large-scale diamond-creation.
The easy way to patch these is to forget about approach-rewards altogether, and just reward the agent for causing more diamond to exist (or for total amount of diamond which exists in its environment). That’s more directly what we want from a diamond-optimizer anyway.
Note that all of these issues are much more obvious if we start from the standard heuristic that the trained agent will end up optimizing for whatever generated its reward, and then pay attention to how well-aligned that reward-generator is with whatever we actually want. You’ve been quite vocal about how that heuristic leads to some incorrect conclusions, but it does highlight real and important considerations which are easy to miss without it.
I don’t think this is true. For example, humans do not usually end up optimizing for the activations of their reward circuitry, not even neuroscientists. Also note that humans do not infer the existence of their reward circuitry simply from observing the sequence of reward events. They have to learn about it by reading neuroscience. I think that steps like “infer the existence / true nature of distant latent generators that explain your observations” are actually incredibly difficult for neural learning processes (human or AI). Empirically, SGD is perfectly willing to memorize deviations from a simple predictor, rather than generalize to a more complex predictor. Current ML would look very different if inferences like that were easy to make (and science would be much easier for humans).
Even when a distant latent generator is inferred, it is usually not the correct generator, and usually just memorizes observations in a slightly more efficient way by reusing current abstractions. E.g., religions which suppose that natural disasters are the result of a displeased, agentic force.
I partly buy that, but we can easily adjust the argument about incorrect labels to circumvent that counterargument. It may be that the full label generation process is too “distant”/complex for the AI to learn in early training, but insofar as there are simple patterns to the humans’ labelling errors (which of course there usually are, in practice) the AI will still pick up those simple patterns, and shards which exploit those simple patterns will be more reinforced than the intended shard. It’s like that example from the RLHF paper where the AI learns to hold a grabber in front of a ball to make it look like it’s grabbing the ball.
I think something like what you’re describing does occur, but my view of SGD is that it’s more “ensembly” than that. Rather than “the diamond shard is replaced by the pseudo-diamond-distorted-by-mislabeling shard”, I expect the agent to have both such shards (really, a giant ensemble of shards each representing slightly different interpretations of what a diamond is).
Behaviorally speaking, this manifests as the agent having preferences for certain types of diamonds over others. E.g., one very simple example is that I expect the agent to prefer nicely cut and shiny diamonds over unpolished diamonds or giant slabs of pure diamond. This is because I expect human labelers to be strongly biased towards the human conception of diamonds as pieces of art, over any configuration of matter with the chemical composition of a diamond.
Why does the ensembling matter?
I could imagine a story where it matters—e.g. if every shard has a veto over plans, and the shards are individually quite intelligent subagents, then the shards bargain and the shard-which-does-what-we-intended has to at least gain over the current world-state (otherwise it would veto). But that’s a pretty specific story with a lot of load-bearing assumptions, and in particular requires very intelligent shards. I could maybe make an argument that such bargaining would be selected for even at low capability levels (probably by something like Why Subagents?), but I wouldn’t put much confidence in that argument.
… and even then, that argument would only work if at least one shard exactly matches what we intended. If all of the shards are imperfect proxies, then most likely there are actions which can Goodhart all of them simultaneously. (After all, proxy failure is not something we’d expect to be uncorrelated—conditions which cause one proxy to fail probably cause many to fail in similar ways.)
On the other hand, consider a more traditional “ensemble”, in which our ensemble of shards votes (with weights) or something. Typically, I expect training dynamics will increase the weight on a component exponentially w.r.t. the number of bits it correctly “predicts”, so exploiting even a relatively small handful of human-mislabellings will give the exploiting shards much more weight. And on top of that, a mix of shards does not imply a mix of behavior; if a highly correlated subset of the shards controls a sufficiently large chunk of the weight, then they’ll have de-facto control over the agent’s behavior.
I think there’s something like “why are human values so ‘reasonable’, such that [TurnTrout inference alert!] someone can like coffee and another person won’t and that doesn’t mean they would extrapolate into bitter enemies until the end of Time?”, and the answer seems like it’s gonna be because they don’t have one criterion of Perfect Value that is exactly right over which they argmax, but rather they do embedded, reflective heuristic search guided by thousands of subshards (shiny objects, diamonds, gems, bright objects, objects, power, seeing diamonds, knowing you’re near a diamond, …), such that removing a single subshard does not catastrophically exit the regime of Perfect Value.
I think this is one proto-intuition why Goodhart-arguments seem Wrong to me, like they’re from some alien universe where we really do have to align a non-embedded argmax planner with a crisp utility function. (I don’t think I’ve properly communicated my feelings in this comment, but hopefully it’s better than nothing))
My intuition is that in order to go beyond imitation learning and random exploration, we need some sort of “iteration” system (a la IDA), and the cases of such systems that we know of tend to either literally be argmax planners with crisp utility functions, or have similar problems to argmax planners with crisp utility functions.
What about this post?
Well so you’re obviously pretraining using imitation learning, so I’ve got that part down.
If I understand your post right, the rest of the policy training is done by policy gradients on human-induced rewards? As I understand it, policy gradient is close to a macimally sample-hungry method, because it does not do any modelling. At one level I would class this as random exploration, but on another level the humans are allowed to provide reinforcement based on methods rather than results, so I suppose this also gives it an element of imitation learning.
So I guess my expectation is that your training method is too sample inefficient to achieve much beyond human imitation.
I think this won’t happen FWIW.
Can you provide a concrete instantiation of this argument? (ETA: struck this part, want to hear your response first to make sure it’s engaging with what you had in mind)
What about your argument behaves differently in the presence of humans and AI? This is clearly not how shard dynamics work in people, as I understand your argument.
We aren’t in the prediction regime, insofar as that is supposed to be relevant for your argument. Let’s talk about the batch update, and not make analogies to predictions. (Although perhaps I was the one who originally brought it up in OP, I should rewrite that.)
Can you give me a concrete example of an “exploiting shard” in this situation which is learnable early on, relative to the actual diamond-shards?
The point I am arguing (ETA and I expect Quintin is as well, but maybe not) is that this will be one of the primary shards produced, not that there’s a chance it exists at low weight or something.
I read this as “the activations and bidding behaviors of the shards will itself be imperfect, so you get the usual ‘Goodhart’ problem where highly rated plans are systematically bad and not what you wanted.” I disagree with the conclusion, at least for many kinds of “imperfections.”
Below is one shot at instantiating the failure mode you’re describing. I wrote this story so as to (hopefully) contain the relevant elements. This isn’t meant as a “slam dunk case closed”, but hopefully something which helps you understand how I’m thinking about the issue and why I don’t anticipate “and then the shards get Goodharted.”
Then this shard can be “goodharted” by actions which involve the creation of these bacteria diamonds at that time. There’s a question, though, of whether the AI will actually consider these plans (so that it then actually bids on this plan, which is rated spuriously highly from our perspective). The AI knows, abstractly, that considering this plan would lead it to bid for that plan. But it seems to me like, since generating that plan is reflectively predicted to not lead to diamonds (nor does it activate the specific bidding-behavior edge case the agent abstractly knows about), the agent doesn’t pursue that plan.
This was one of the main ideas I discussed in Alignment allows “nonrobust” decision-influences and doesn’t require robust grading:
This suggests “and so what is an ‘adversarial input’ to the values, then? What intensional rule governs the kinds of high-scoring plans which internal reasoning will decide to not evaluate in full?”. I haven’t answered that question yet on an intensional basis, but it seems tractable.
Not crucial on my model.
I’m imagining us watching the agent and seeing whether it approaches an object or not. Those are the “labels.” I’m imagining this taking place between 50-1000 times. Before seeing this comment, I edited the post to add:
So, probably I shouldn’t have written “perfectly”, since that isn’t actually load-bearing on my model. I think that there’s a rather smooth relationship between “how good you are at labelling” and “the strength of desired value you get out” (with a few discontinuities at the low end, where perhaps a sufficiently weak shard ends up non-reflective, or not plugging into the planning API, or nonexistent at all). On that model, I don’t really understand the following:
The agent already has the diamond abstraction from SSL+IL, but not the labelling process (due to IID training, and it having never seen our “labelling” before—in the sense of us watching it approach the objects in real time). And this is very early in the RL training, at the very beginning. So why would the agent learn the labelling abstraction during the labelling and hook that in to decision-making, in the batch PG updates, instead of just hooking in the diamond abstraction it already has? (Edit: I discussed this a bit in this footnote.)
I agree that “diamond synthesis” is not directly rewarded, and if we wanted to ensure that happens, we could add that to the curriculum, as you note. But I think it would probably happen anyways, due to the expected-by-me “grabby” nature of the acquire-subshard. (Consider that I think it’d be cool to make dyson swarms, but I’ve never been rewarded for making dyson swarms.) So maybe the crux here is that I don’t yet share your doubt of the acquisition-shard.
I think that “are we directly rewarding the behavior which we want the desired shards to exemplify?” is a reasonable heuristic. I think that “What happens if the agent optimizes its reward function?” is not a reasonable heuristic.
I think there’s a few different errors in this reasoning.
First: the agent probably has the concept of diamond from SSL+IL, but that’s different from concepts like producing diamond, approaching diamond (which in turn requires a self-concept or at least a concept of the avatar it’s controlling), etc. During training, those sorts of more-complex concepts are probably built up out of their components (like e.g. “production” and “diamond”); the actual goals or behaviors encoded in a shard have to be built up in whatever “internal language” the agent has from the SSL/IL training.
So the question isn’t “does the agent have the concept of diamond/label?”, the question is how short the relevant “sentences” are in terms of the concepts it has. Neither will be just one “word”.
Second: as with Quintin’s comment, the AI does not need to fully model the entire labelling process in order for this problem to apply. If there’s any simple, predictable pattern to the humans’ label-errors (which of course there usually is in practice), then the AI can pick that up. (It’s not just a question of hand-slips; humans make systematic errors which will strongly activate shards very similar to the intended shards.)
So the question isn’t “is the entire labelling process a short ‘sentence’ in the AI’s internal language?” (though even that is not implausible), but rather “do any systematic errors in the labelling process have a short ‘sentence’ in the AI’s internal language?”.
Now put those two together. The intended shards are quite a bit more complicated than you suggested, because they don’t just depend on the concept of “diamond”, they depend on constructing a bunch of other concepts about what to do involving diamonds. And the unintended shards are quite a bit less complicated than you suggested, because they can exploit simple systematic errors in the labels.
I think I have a complaint like “You seem to be comparing to a ‘perfect’ reward function, and lamenting how we will deviate from that. But in the absence of inner/outer alignment, that doesn’t make sense. A good reward schedule will put diamond-aligned cognition in the agent. It seems like, for you to be saying there’s a ‘fatal’ flaw here due to ‘errors’, you need to make an argument about the cognition which trains into the agent, and how the AI’s cognition-formation behaves differently in the presence of ‘errors’ compared to in the absence of ‘errors.’ And I don’t presently see that story in your comments thus far. I don’t understand what ‘perfect labeling’ is the thing to talk about, here, or why it would ensure your shard-formation counterarguments don’t hold.”
(Will come by for lunch and so we can probably have a higher-context discussion about this! :) )
I think this is close to our most core crux.
It seems to me that there are a bunch of standard arguments which you are ignoring because they’re formulated in an old frame that you’re trying to avoid. And those arguments in fact carry over just fine to your new frame if you put a little effort into thinking about the translation, but you’ve instead thrown the baby out with the bathwater without actually trying to make the arguments work in your new frame.
Like, if I have a reward signal that rewards X, then the old frame would say “alright, so the agent will optimize for X”. And you’re like “nope, that whole form of argument is invalid, hit ignore button”. But in fact it is usually very easy to take that argument and unpack it into something like “X has a short description in terms of natural abstractions, so starting from a base model and giving a feedback signal we should rapidly see some X-shards show up, and then the shards which best match X will be reinforced to exponentially higher weight (with respect to the bit-divergence between their proxy X’ and the actual X)”. And it seems like you are not even attempting to perform that translation, which I find very frustrating because I’m pretty sure you know this stuff plenty well to do it.
I agree that we may need to be quite skillful in providing “good”/carefully considered reward signals on the data distribution actually fed to the AI. (I also think it’s possible we have substantial degrees of freedom there.) In this sense, we might need to give “robustly” good feedback.
However, one intuition which I hadn’t properly communicated was: to make OP’s story go well, we don’t need e.g. an outer objective which robustly grades every plan or sequence of events the AI could imagine, such that optimizing that objective globally produces good results. This isn’t just good reward signals on data distribution (e.g. real vs fake diamonds), this is non-upwards-error reward signals in all AI-imaginable situations, which seems thoroughly doomed to me. And this story avoids at least that problem, which I am relieved by. (And my current guess is that this “robust grading” problem doesn’t just reappear elsewhere, although I think there are still a range of other difficult problems remaining. See also my post Alignment allows “nonrobust” decision-influences and doesn’t require robust grading.)
And so I might have been saying “Hey isn’t this cool we can avoid the worst parts of Goodhart by exiting outer/inner as a frame” while thinking of the above intuition (but not communicating it explicitly, because I didn’t have that sufficient clarity as yet). But maybe you reacted ”??? how does this avoid the need to reliably grade on-distribution situations, it’s totally nontrivial to do that and it seems quite probable that we have to.” Both seem true to me!
(I’m not saying this was the whole of our disagreement, but it seems like a relevant guess.)
When I first read this comment, I incorrectly understood it to say somehing like “If you were actually trying, you’d have generated the exponential error model on your own; the fact that you didn’t shows that you aren’t properly thinking about old arguments.” I now don’t think that’s what you meant. I think I finally[1] understand what you did mean, and I think you misunderstood what my original comment was trying to say because I wrote poorly and stream-of-consciousness.
Most importantly, I wasn’t saying something like “‘errors’ can’t exist because outer/inner alignment isn’t my frame, ignore.” I meant to communicate the following points:
I don’t know what a “perfect” reward function is in the absence of outer alignment, else I would know how to solve diamond alignment. But I’m happy to just discuss deviations from a proposed labelling scheme. (This is probably what we were already discussing, so this wasn’t meant to be a devastating rejoinder or anything.)
I’m not sure what you mean by the “exponential” model you mentioned elsewhere, or why it would be a fatal flaw if true. Please say more? (Hopefully in a way which makes it clear why your argument behaves differently in the presence of errors, because that would be one way to make your arguments especially legible to how I’m currently thinking about the situation.)
Given my best guess at your model (the exponential error model), I think your original comment seems too optimistic about my specific story (sure seems like exponential weighting would probably just break it, label errors or no) but too pessimistic about the story template (why is it a fatal flaw that can’t be fixed with a bit of additional thinking?).
I meant to ask something like “I don’t fully understand what you’re arguing re: points 1 and 2 (but I have some guesses), and think I disagree about 3; please clarify?” But instead (e.g. by saying things like “my complaint is...”) I perhaps communicated something like “because I don’t understand 2 in my native frame, your argument sucks.” And you were like “Come on, you didn’t even try, you could have totally translated 2. Worrying that you apparently didn’t.”
I think that I left an off-the-cuff comment which might have been appropriate as a Discord message (with real-time clarification), but not as a LessWrong comment. Oops.
Elaborating points 1 and 3 above:
Point 1. In outer/inner, if you “perfectly label” reward events based on whether the agent approaches the diamond, you’re “done” as far as the outer alignment part goes. In order to make the agent actually care about approaching diamonds, we would then turn to inner alignment techniques / ideas. It might make sense to call this labelling “perfect” as far as specifying the outer objective for those scenarios (e.g. when that objective is optimized, the agent actually approaches the diamond).
But if we aren’t aiming for outer/inner alignment, and instead are just considering the (reward schedule) → (inner value composition) mapping, then I worry that my post’s original usage of “perfect” was misleading. On my current frame, a perfect reward schedule would be one which actually gets diamond-values into the agent. The schedule I posted is probably not the best way to do that, even if all goes as planned. I want to be careful not to assume the “perfection” of “+1 when it does in fact approach a real diamond which it can see”, even if I can’t currently point to better alternative reward schedules (e.g. “+x reward in some weird situation”). (This is what I was getting at with “I don’t understand what ‘perfect labeling’ is the thing to talk about, here.”)
What you probably meant by “errors” was “divergences from the reward function outlined in the original post.” This is totally reasonable and important to talk about, but at least I want to clarify for myself and other readers that this is what we’re talking about, and not assuming that my intended reward function was actually “perfect.” (Probably it’s fine to keep talking about “perfect labelling” as long as this point has been made explicit.)
Point 3. Under my best guess of what you mean (which did end up being roughly right, about the exponential bit-divergence), I think your original comment seemed too optimistic about the original story going well given “perfect” labelling. This is one thing I meant by “I don’t understand why ‘perfect labeling’ would ensure your shard-formation counterarguments don’t hold.”
If the situation value-distribution is actually exponential in bit-divergence, I’d expect way less wiggle room on value shard formation, because that’s going to mean that way more situations are controlled by relatively few subshards (or maybe even just one). Possibly the agent just ends up with fewer terms in its reflectively stable utility function, because fewer shards/subshards activate during the values handshake. (But I’m tentative about all this, haven’t sketched out a concrete failure scenario yet given exponential model! Just a hunch I remember having.)
Again, it was very silly of me to expect my original comment to communicate these points. At the time of writing, I was trying to unpack some promising-seeming feelings and elaborate them over lunch.
My original guess at your complaint was “How could you possibly have not generated the exponential weight hypothesis on your own?”, and I was like what the heck, it’s a hypothesis, sure… but why should I have pinned down that one? What’s wrong with my “linear in error proportion for that kind of situation, exponential in ontology-distance at time of update” hypothesis, why doesn’t that count as a thing-to-have-generated? This was a big part of why I was initially so confused about your complaint.
And then several people said they thought your comment was importantly correct-seeming, and I was like “no way, how can everyone else already have such a developed opinion on exponential vs linear vs something-else here? Surely this is their first time considering the question? Why am I getting flak about not generating that particular hypothesis, how does that prove I’m ‘not trying’ in some important way?”
To be clear, I don’t think the exponential asymptotics specifically are obvious (sorry for implying that), but I also don’t think they’re all that load-bearing here. I intended more to gesture at the general cluster of reasons to expect “reward for proxy, get an agent which cares about the proxy”; there’s lots of different sets of conditions any of which would be sufficient for that result. Maybe we just train the agent for a long time with a wide variety of data. Maybe it turns out that SGD is surprisingly efficient, and usually finds a global optimum, so shards which don’t perfectly fit the proxy die. Maybe the proxy is a more natural abstraction than the thing it was proxying for, and the dynamics between shards competing at decision-time are winner-take all. Maybe dynamics between shards are winner-take-all for some other reason, and a shard which captures the proxy will always have at least a small selective advantage. Etc.
It sounds like the difference between one or a few shards dominating each decision, vs a large ensemble, is very central and cruxy to you. And I still don’t see why that matters, so maybe that’s the main place to focus.
You gestured at some intuitions about that in this comment (which I’m copying below to avoid scrolling to different parts of the thread-tree), and I’d be interested to see more of those intuitions extracted.
I have multiple different disagreements with this, and I’m not sure which are relevant yet, so I’ll briefly state a few:
For the coffee/bitter enemies thing, this doesn’t seem to me like a phenomenon which has anything to do with shards, it’s just a matter of type-signatures. A person who “likes coffee” likes to drink coffee; they don’t particularly want to fill the universe with coffee, they don’t particularly care whether anyone else likes to drink coffee (and nobody else cares whether they like to drink coffee) so there’s not really much reason for that preference to generate conflict. It’s not a disagreement over what-the-world-should-look-like; that’s not the type-signature of the preference.
Embedded, reflective heuristic search is not incompatible with argmaxing over one (approximate, implicit) value function; it’s just a particular family of distributed algorithms for argmaxing.
It seems like, in humans, removing a single subshard does catastrophically exit the regime of value. For instance, there’s Eliezer’s argument from the sequences that just removing boredom results in a dystopia.
The extremely basic intuition is that all else equal, the more interests present at a bargaining table, the greater the chance that some of the interests are aligned.
My values are also risk-averse (I’d much rather take a 100% chance of 10% of the lightcone than a 20% chance of 100% of the lightcone), and my best guess is that internal values handshakes are ~linear in “shard strength” after some cutoff where the shards are at all reflectively endorsed (my avoid-spiders shard might not appreciably shape my final reflectively stable values). So more subshards seems like great news to me, all else equal, with more shard variety increasing the probability that part of the system is motivated the way I want it to be.
(This isn’t fully expressing my intuition, here, but I figured I’d say at least a little something to your comment right now)
I’m not going to go into most of the rest now, but:
I think that that does have to do with shards. Liking to drink coffee is the result of a shard, of a contextual influence on decision-making (the influence to drink coffee), and in particular activates in certain situations to pull me into a future in which I drank coffee.
I’m also fine considering “A person who is OK with other people drinking coffee” and anti-C: “a person with otherwise the same values but who isn’t OK with other people drinking coffee.” I think that the latter would inconvenience the former (to the extent that coffee was important to the former), but that they wouldn’t become bitter enemies, that anti-C wouldn’t kill the pro-coffee person because the value function was imperfectly aligned, that the pro-coffee person would still derive substantial value from that universe.
Possibly the anti-coffee value would even be squashed by the rest of anti-C’s values, because the anti-coffee value wasn’t reflectively endorsed by the rest of anti-C’s values. That’s another way in which I think anti-C can be “close enough” and things work out fine.
EDIT 2: The original comment was too harsh. I’ve struck the original below. Here is what I think I should have said:
I think you raise a valuable object-level point here, which I haven’t yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I’d appreciate if you wouldn’t speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.
Warning: This comment, andyour previous comment, violate my comment section guidelines: “Reign of terror // Be charitable.” You have made and publicly stated a range of unnecessary, unkind, and untrue inferences about my thinking process. You have also made non-obvious-to-me claims of questionable-to-me truth value, which you also treat as exceedingly obvious. Please edit these two comments to conform to my civility guidelines.(EDIT: Thanks. I look forward to resuming object-level discussion!)After more reflection, I now think that this moderation comment was too harsh. First, the parts I think I should have done differently:
Realized that who reads commenting guidelines anyways, let alone expects them to be enforced?
Realized that it’s probably ambiguous what counts as “charitable” or not, even though (illusion of transparency) it felt so obvious to me that this counted as “not that.”
Realized that predictably I would later consider the incident to be less upsetting than in the moment, and that John may not have been aware that I find this kind of situation unusually upsetting.
Therefore, I should have said something like “I think you raise a valuable object-level point here, which I haven’t yet made up my mind on. That said, I think this meta-level commentary is unpleasant and mostly wrong. I’d appreciate if you wouldn’t speculate on my thought process like that, and would appreciate if you could edit the tone-relevant parts.”
I’m striking the original warning, putting in (4), and I encourage John to unredact his comments (but that’s up to him).
I’ve thought more about what my policy should be going forward. What kind of space do I want my comment section to be? First, I want to be able to say “This seems wrong, and here’s why”, and other people can say the same back to me, and one or more of us can end up at the truth faster. Second, it’s also important that people know that, going forward, engaging with me in (what feels to them like) good-faith will not be randomly slapped with a moderation warning because they annoyed me.
Third, I want to feel comfortable in my interactions in my comment section. My current plan is:
If someone comment something which feels personally uncharitable to me (a rather rare occurrence, what with the hundreds of comments in the last year since this kind of situation last happened), then I’ll privately message them, explain my guidelines, and ask that they tweak tone / write more on the object-level / not do the particular thing.[1]
If necessary, I’ll also write a soft-ask (like (4) above) as a comment.
In cases where this is just getting ignored and the person is being antagonistic, I will indeed post a starker warning and then possibly just delete comments.
I had spoken with John privately before posting the warning comment. I think my main mistake was jumping to (3) instead of doing more of (1) and (2).
Oh, huh, I think this moderation action makes me substantially less likely to comment further on your posts, FWIW. It’s currently will within your rights to do so, and I am on the margin excited about more people moderating things, but I feel hesitant participating with the current level of norm-specification + enforcement.
I also turned my strong-upvote into a small-upvote, since I have less trust in the comment section surfacing counterarguments, which feels particularly sad for this post (e.g. I was planning to respond to your comment with examples of past arguments this post is ignoring, but am now unlikely to do so).
Again, I think that’s fine, but I think posts with idiosyncratic norm enforcement should get less exposure, or at least not be canonical references. Historically we’ve decided to not put posts on frontpage when they had particularly idiosyncratic norm enforcement. I think that’s the wrong call here, but not confident.
Sorry, I’m confused; for my own education, can you explain why these civility guidelines aren’t epistemically suicidal? Personally, I want people like John Wentworth to comment on my posts to tell me their inferences about my thinking process; moreover, controlling for quality, “unkind” inferences are better, because I learn more from people telling me what I’m doing wrong, than from people telling me what I’m already doing right. What am I missing? Please be unkind.
This is already my model and was intended as part of my communicated reasoning. Why do you think it’s an error in my reasoning? You’ll notice I argued “If
diamond
”, and about hooking that diamond predicate into its approach-subroutines (learned via IL). (ETA: I don’t think you need a self-model to approach a diamond, or to “value” that in the appropriate sense. To value diamonds being near you, you can have representations of the space nearby, so you need a nearby representation, perhaps.)I think this is not the right term to use, and I think it might be skewing your analysis. This is not a supervised learning regime with exact gradients towards a fixed label. The question is what gets upweighted by the batch PG gradients, batching over the reward events. Let me exaggerate the kind of “error rates” I think you’re anticipating:
Suppose I hit the reward 99% of the time for cut gems, and 90% of the time for uncut gems.
What’s supposed to go wrong? The agent somewhat more strongly steers towards cut gems?
Suppose I’m grumpy for the first 5 minutes and only hit the reward button 95% as often as I should otherwise. What’s supposed to happen next?
(If these errors aren’t representative, can you please provide a concrete and plausible scenario?)
Both of these examples are are focused on one error type: the agent does not receive a reward in a situation which we like. That error type is, in general, not very dangerous.
The error type which is dangerous is for an agent to receive a reward in a situation which we don’t like. For instance, receiving reward in a situation involving a convincing-looking fake diamond. And then a shard which hooks up its behavior to things-which-look-like-diamonds (which is probably at least as natural an abstraction as diamond) gets more weight relative to the diamond-shard, and so when those two shards disagree later the things-which-look-like-diamonds shard wins.
Note that it would not be at all surprising for the AI to have a prior concept of real-diamonds-or-fake-diamonds-which-are-good-enough-to-fool-most-humans, because that is a cluster of stuff which behaves similarly in many places in the real world—e.g. they’re both used for similar jewelry.
And sure, you try to kinda patch that by including some correctly-labelled things-which-look-like-diamonds in training, but that only works insofar as they’re sufficiently-obviously-not-diamond that the human labeller can tell (and depends on the ratio of correct to incorrect labels, etc).
(Also, some moderately uncharitable psychologizing, and I apologize if it’s wrong: I find it suspicious that the examples of label errors you generated are both of the non-dangerous type. This is a place where I’d expect you to already have some intuition for what kind of errors are the dangerous ones, especially when you put on e.g. your Eliezer hat. That smells like a motivated search, or at least a failure to actually try to look for the problems with your argument.)
I want to talk about several points related to this topic. I don’t mean to claim that you were making points directly related to all of the below bullet points. This just seems like a good time to look back and assess and see what’s going on for me internally, here. This seems like the obvious spot to leave the analysis.
At the time of writing, I wasn’t particularly worried about the errors you brought up.
I am a little more worried now in expectation, both under the currently low-credence worlds where I end up agreeing with your exponential argument, and in the ~linear hypothesis worlds, since I think I can still search harder for worrying examples which IMO neither of us have yet proposed. Therefore I’ll just get a little more pessimistic immediately, in the latter case.
If I had been way more worried about “reward behavior we should have penalized”, I would have indeed just been less likely to raise the more worrying failure points, but not super less likely. I do assess myself as flawed, here, but not as that flawed.
I think the typical outcome would be something like “TurnTrout starts typing a list full of weak flaws, notices a twinge of motivated reasoning, has half a minute of internal struggle and then types out the more worrisome errors, and, after a little more internal conflict, says that John has a good point and that he wants to think about it more.”
I could definitely buy that I wouldn’t be that virtuous, though, and that I would need a bit of external nudging to consider the errors, or else a few more days on my own for the issue to get raised to cognitive-housekeeping. After that happened a few times, I’d notice the overall problem and come up with a plan to fix it.
Obviously, I have at this point noticed (at least) my counterfactual mistake in the nearby world where I already agreed with you, and therefore have a plan to fix and remove that flaw.
I think you are right in guessing that I could use more outer/inner heuristics to my advantage, that I am missing a few tools on my belt. Thanks for pointing that out.
I don’t think that motivated cognition has caused me to catastrophically miss key considerations from e.g. “standard arguments” in a way which has predictably doomed key parts of my reasoning.
Why I think this: I’ve spent a little while thinking about what the catastrophic error would be, conditional on it existing, and nothing’s coming up for the moment.
I’d more expect there to be some sequence of slight ways I ignored important clues that other people gave, and where I motivatedly underupdated. But also this is a pretty general failure mode, and I think it’d be pretty silly to call a halt without any positive internal evidence that I actually have done this. (EDIT: In a specific situation which I remember and can correct, as opposed to having a vague sense that yeah I’ve probably done this several times in the last few months. I’ll just keep an eye out.)
Rather, I think that if I spend three or so days typing up a document, and someone like John Wentworth thinks carefully about it, then that person will surface at least a few considerations I’d missed, more probably using tools not native to my current frame.
I think a lot of the “Why didn’t you realize the ‘reward for proxy, get an agent which cares about the proxy’?” part is just that John and I just seem to have very different models of SGD dynamics, and that if I had his model, the reasoning which produced the post would have also produced the failure modes John has hypothesized.
This feels “fine” in that that’s part of the point of sharing my ideas with other people—that smart people will surface new considerations or arguments. This feels “not fine” in the sense that I’d like to not miss considerations, of course.
This also feels “fine” in that, yes, I wanted to get this essay out before never arrives, and usually I take too long to hit “publish”, and I’m still very happy with the essay overall. I’m fine with other people finding new considerations (e.g. the direct reward for diamond synthesis, or zooming in on how much perfect labelling is required).
I think that if it turns out there was some crucial existing argument which I did miss, I think I’ll go “huh” but not really be like “wow that hovered at the edge of my cognition but I denied it for motivated reasons.”
I am way more worried about how much of my daily cognition is still socially motivated, and I do consider that to be a “stop drop and roll”-level fuckup on my part.
I think there’s not just now-obvious things here like “I get very defensive in public settings in specific situations”, but a range of situations in which I subconsciously aim to persuade or justify my positions, instead of just explaining what I think and why, what I disagree with and why; that some subconscious parts of me look for ways to look good or win an argument; that I have rather low trust in certain ways and that makes it hard for me sometimes; etc.
I think that I am above-average here, but I have very high standards for myself and consider my current skill in this area to be very inadequate.
For the record: I welcome well-meaning private feedback on what I might be biased about or messing up. On the other hand, having the feedback be public just pushes some of my buttons in a way which makes the situation hard for me to handle. I aspire for this not to be the case about me. That aspiration is not yet realized.
I’ve worked hard to make this analysis honest and not optimized to make me look good or less silly. Probably I’ve still failed at least a little. Possibly I’ve missed something important. But this is what I’ve got.
Kudos for writing all that out. Part of the reason I left that comment in the first place was because I thought “it’s Turner, if he’s actually motivatedly cognitating here he’ll notice once it’s pointed out”. (And, corollary: since you have the skill to notice when you are motivedly cognitating, I believe you if you say you aren’t. For most people, I do not consider their claims about motivatedness of their own cognition to be much evidence one way or the other.) I do have a fairly high opinion of your skills in that department.
Fair point, that part of my comment probably should have been private. Mea culpa for that.
This doesn’t seem dangerous to me. So the agent values both, and there was an event which differentially strengthened the looks-like-diamond shard (assuming the agent could tell the difference at a visual remove, during training), but there are lots of other reward events, many of which won’t really involve that shard (like video games where the agent collects diamonds, or text rpgs where the agent quests for lots of diamonds). (I’m not adding these now, I was imagining this kind of curriculum before, to be clear—see the “game” shard.)
So maybe there’s a shard with predicates like “would be sensory-perceived by naive people to be a diamond” that gets hit by all of these, but I expect that shard to be relatively gradient starved and relatively complex in the requisite way → not a very substantial update. Not sure why that’s a big problem.
But I’ll think more and see if I can’t salvage your argument in some form.
I found this annoying.
Not the OP but this jumped out at me:
This failure mode seems plausible to me, but I can think of a few different plausible sequences of events that might occur, which would lead to different outcomes, at least in the shard lens.
Sequence 1:
The agent develops diamond-shard
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent exploits the gaps between the diamond-concept and the label-process-concept, which reinforces the label-process-shard within it
The label-process-shard drives the agent to continue exploiting the above gap, eventually (and maybe rapidly) overtaking the diamond-shard
So the agent’s values drift away from what we intended.
Sequence 2:
The agent develops diamond-shard
The diamond-shard becomes part of the agent’s endorsed preferences (the goal-content it foresightedly plans to preserve)
The agent develops an internal representation of the training process it is embedded in, including how labels are imperfectly assigned
The agent understands that if it exploited the gaps between the diamond-concept and the label-process-concept, it would be reinforced into developing a label-process-shard that would go against its endorsed preference for diamonds (ie. its diamond-shard), so it chooses not exploit that gap, in order to avoid value drift.
So agent continues to value diamonds in spite of the imperfect labeling process
These different sequences of events would seem to lead to different conclusions about whether imperfections in the labeling process are fatal.
Yup, that’s a valid argument. Though I’d expect that gradient hacking to the point of controlling the reinforcement on one’s own shards is a very advanced capability with very weak reinforcement, and would therefore come much later in training than picking up on the actual labelling process (which seems simpler and has much more direct and strong reinforcement).
I expect some form of gradient hacking to be convergantly learned much earlier than the details of the labeling process. Online SSL incentivizes the agent to model its own shard activations (so it can better predict future data) and the concept of human value drift (“addiction”) is likely accessible from pretraining in the same way “diamond” is.
On the other hand, the agent has little information about the labeling process, I expect it to be more complicated, and not have the convergent benefits of predicting future behavior that reflectivity has.
(You could even argue human error is good here, if it correlates stronger with the human “diamond” abstraction the agent has from pretraining. This probably doesn’t extend to the “human values” case we care about, but I thought I’d mention it as an interesting thought.)
(agreed, for the record. I do think the agent can gradient starve the label-shard in story 2, though, without fancy reflective capability.)
Possibly. Though I think it is extremely easy in a context like this. Keeping the diamond-shard in the driver’s seat mostly requires the agent to keep doing the things it was already doing (pursuing diamonds because it wants diamonds), rather than making radical changes to its policy.