“When LessWrong was ~dead”
Which year are you referring to here?
“When LessWrong was ~dead”
Which year are you referring to here?
A lot of people in AI Alignment I’ve talked to have found it pretty hard to have clear thoughts in the current social environment, and many of them have reported that getting out of Berkeley, or getting social distance from the core of the community has made them produce better thoughts.
What do you think is the mechanism behind this?
There is a general phenomenon where:
Person A has mental model X and tries to explain X with explanation Q
Person B doesn’t get model X from Q, thinks a bit, and then writes explanation P, reads P and thinks: P is how it should have been explained all along, and Q didn’t actually contain the insights, but P does.
Person C doesn’t get model X from P, thinks a bit, and then writes explanation R, reads R and thinks: …
It seems to me quite likely that you are person B, thinking they explained something because THEY think their explanation is very good and contains all the insights that the previous ones didn’t. Some of the evidence for this is in fact contained in your very comment:
“1. Pointing out the “reward chisels computation” point. 2. Having some people tell me it’s obvious, or already known, or that they already invented it. 3. Seeing some of the same people continue making similar mistakes (according to me)”
So point 3 basically almost definitively proves that your mental model is not conveyed to those people in your post, does it not? I think a similar thing happened where that mental model was not conveyed to you from RFLO, even though we tried to convey it. (btw not saying the models that RFLO tried to explain are the same as this post, but the basic idea of this post definitely is a part of RFLO).
BTW, it could in fact be that person B’s explanation is clearer. (otoh, I think some things are less clear, e.g. you talk about “the” optimization target, which I would say is referring to that of the mesa-optimizer, without clearly assuming there is a mesa-optimizer. We stated the terms mesa- and base-optimizer to clearly make the distinction. There are a bunch of other things that I think are just imprecise, but let’s not get into it).
“Continuing (AFAICT) to correct people on (what I claim to be) mistakes around reward and optimization targets, and (for a while) was ~the only one doing so.”
I have been correcting people for a while on stuff like that (though not on LW, I’m not often on LW), such as that in the generic case we shouldn’t expect wireheading from RL agents unless the option of wireheading is in the training environment, for basically these reasons. I would also have expected people to just get this after reading RFLO, but many didn’t (others did), so your points 1/2/3 also apply to me.
“I do totally buy that you all had good implicit models of the reward-chiseling point”. I don’t think we just “implicitly” modeled it, we very explicitly understood it and it ran throughout our whole thinking about the topic. Again, explaining stuff is hard though, I’m not claiming we conveyed everything well to everyone (clearly you haven’t either).
Very late reply, sorry.
“even though reward is not a kind of objective”, this is a terminological issue. In my view, calling a “antecedent-computation reinforcement criterion” an “objective” matches my definition of “objective”, and this is just a matter of terminology. The term “objective” is ill-defined enough that “even though reward is not a kind of objective” is a terminological claim about objective, not a claim about math/the world.
The idea that RL agents “reinforce antecedent computations” is completely core to our story of deception. You could not make sense of our argument for deception if you didn’t look at RL systems in this way. Viewing the base optimizer as “trying” to achieve an “objective” but “failing” because it is being “deceived” by the mesa optimizer is purely a metaphorical/terminological choice. It doesn’t negate the fact that we all understood that the base optimizer is just reinforcing “antecedent computations”. How else could you make sense of the story of deception, where an existing model, which represents the mesa optimizer, is being reinforced by the base optimizer because that existing model understands the base optimizer’s optimization process?
I am not claiming that the RFLO communicated this point well, just that it was understood and absolutely was core to the paper, and large parts of the paper wouldn’t even make sense if you didn’t have this insight. (Certainly the fact that we called it an objective doesn’t communicate the point, and it isn’t meant to).
The core point in this post is obviously correct, and yes people’s thinking is muddled if they don’t take this into account. This point is core to the Risks from learned optimization paper (so it’s not exactly new, but it’s good if it’s explained in different/better ways).
Maybe you have made a gestalt-switch I haven’t made yet, or maybe yours is a better way to communicate the same thing, but: the way I think of it is that the reward function is just a function from states to numbers, and the way the information contained in the reward function affects the model parameters is via reinforcement of pre-existing computations.
Is there a difference between saying:
A reward function is an objective function, but the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
A reward function is not an objective function, and the only way that it affects behaviour is via reinforcement of pre-existing computations in the model, and it doesn’t actually encode in any way the “goal” of the model itself.
It seems to me that once you acknowledge the point about reinforcement, the additional statement that reward is not an objective doesn’t actually imply anything further about the mechanistic properties of deep reinforcement learners? It is just a way to put a high-level conceptual story on top of it, and in this sense it seems to me that this point is already known (and in particular, contained within RFLO), even though we talked of the base objective still as an “objective”.
However, it might be that while RFLO pointed out the same mechanistic understanding that you have in mind, but calling it an objective tends in practice to not fully communicate that mechanistic understanding.
Or it might be that I am really not yet understanding that there is an actual diferrence in mechanistic understanding, or that my intuitions are still being misled by the wrong high-level concept even if I have the lower-level mechanistic understanding right.
(On the other hand, one reason to still call it an objective is because we really can think of the selection process, i.e. evolution/the learning algorithm of an RL agent, as having an objective but making imperfect choices, or we can think of the training objective as encoding a task that humans have in mind).
It seems to me that the basic conceptual point made in this post is entirely contained in our Risks from Learned Optimization paper. I might just be missing a point. You’ve certainly phrased things differently and made some specific points that we didn’t, but am I just misunderstanding something if I think the basic conceptual claims of this post (which seems to be presented as new) are implied by RFLO? If not, could you state briefly what is different?
(Note I am still surprised sometimes that people still think certain wireheading scenario’s make sense despite them having read RFLO, so it’s plausible to me that we really didn’t communicate everyrhing that’s in my head about this).
I agree this is a good distinction.
“I think in the defense-offense case the actions available to both sides are approximately the same”
If attacker has the action “cause a 100% lethal global pandemic” and the defender has the task “prevent a 100% lethal global pandemic”, then clearly these are different problems, and it is a thesis, a thing to be argued for, that the latter requires largely the same skills/tech as the former (which is what this offense-defense symmetry thesis states).
If you build an OS that you’re trying to make safe against attacks, you might do e.g. what the seL4 microkernel team did and formally verify the OS to rule out large classes of attacks, and this is an entirely different kind of action than “find a vulnerability in the OS and develop an exploit to take control over it”.
“I wouldn’t say the strategy-stealing assumption is about a symmetric game”
Just to point out that the original strategy stealing argument assumes literal symmetry. I think the argument only works insofar as generalizing from literal symmetry doesn’t break this argument (to e.g. something more like linearity of the benefit of initial resources). I think you actually need something like symmetry in both instrumental goals, and “initial-resources-to-output map”.
The strategy-stealing argument as applied to defense-offense would say something like “whatever offense does to increase its resources / power is something that defense could also do to increase resources / power”.
Yes, but this is almost the opposite of what the offense-defense symmetry thesis is saying. Because it can both be true that 1. defender can steal attacker’s strategies, AND 2. defender alternatively has a bunch of much easier strategies available, by which it can defend against attacker and keep all the resources.
This DO-symmetry thesis says that 2 is NOT true, because all such strategies in fact also require the same kind of skills. The point of the DO-symmetry thesis is to make more explicit the argument that humans cannot defend against misaligned AI without their own aligned AI.
“This isn’t the same as your thesis.”
Ok I only read this after writing all of the above, so I thought you were implying they were the same (and was confused as to why you would imply this), and I’m guessing you actually just meant to say “these things are sort of vaguely related”.
Anyway, if I wanted to state what I think the relation is in a simple way I’d say that they give lower and upper bounds respectively on the capabilities needed from AI systems:
OD-symmetry thesis: We need our defensive AI to be at least as capable as any misaligned AI.
strategy-stealing: We don’t need our defensive AI to be any more capable.
I think probably both are not entirely right.
Kind of a delayed response, but: Could you clarify what you think is the relation between that post and mine? I think they are somehow sort of related, but not sure what you think the relation is. Are you just trying to say “this is sort of related”, or are you trying to say “the strategy stealing assumption and this defense-offense symmetry thesis is the same thing”?
In the latter case: I think they are not the same thing, neither in terms of their actual meaning nor their intended purpose:
Strategy-stealing assumption is (in the context of AI alignment): for any strategy that a misaligned AI can use to obtain influence/power/resources, humans can employ a similar strategy to obtain a similar amount of influence/power/resources.
This defense-offense symmetry thesis: In certain domains, in order to defend against an attacker, the defender need the same cognitive skills (knowledge, understanding, models, …) as the attacker (and possibly more).
These seem sort of related, but they are just very different claims, even depending on different ontologies/cocepts. One particularly simple-to-state difference is that the strategy-stealing argument is explicitly about symmetric games whereas the defense-offense symmetry is about a (specific kind of) asymmetric game, where there is a defender who first has some time to build defenses, and then an attacker who can respond to that and exploit any weaknesses. (and the strategy-stealing argument as applied to AI alignment is not literally symmetric, but semi-symmetric in the sense of the relation between inbeing kind of “linear”).
So yeah given this, could you say what you think the relation is?
I just had a very quick look at that site, and it seems to be a collection of various chip models with pictures of them? Is there actual information on quantities sold, etc? I couldn’t find it immediately.
Yeah, I know they don’t understand them comprehensively. Is this the point though? I mean they understand them at a level of abstraction necessary to do what they need, and the claim is they have basically the same kind of knowledge of computers. Hmm, I guess that isn’t really communicated by my phrasing though, so maybe I should edit that
I think I communicated unclearly and it’s my fault, sorry for that: I shouldn’t have used the phrase “any easily specifiable task” for what I meant, because I didn’t mean it to include “optimize the entire human lightcone w.r.t. human values”. In fact, I was being vague and probably there isn’t really a sensible notion that I was trying to point to. However, to clarify what I really was trying to say: What I mean by “hard problem of alignment” is : “develop an AI system that keeps humanity permanently safe from misaligned AI (and maybe other x risks), and otherwise leaves humanity to figure out what it wants and do what it wants without restricting it in much of any way except some relatively small volume of behaviour around ‘things that cause existential catastrophe’ ” (maybe this ends up being to develop a second version AI that then gets free reign to optimize the universe w.r.t. human values, but I’m a bit skeptical). I agree that “solve all of human psychology and moral …” is significantly harder than that (as a technical problem). (maybe I’d call this the “even harder problem”).
Ehh, maybe I am changing my mind and also agree that even what I’m calling the hard problem is significantly more difficult than the pivotal act you’re describing, if you can really do it without modelling humans, by going to mars and doing WBE. But then still the whole thing would have to rely on the WBE, and I find it implausible to do it without it (currently, but you’ve been updating me about lack of need of human modelling so maybe I’ll update here too). Basically the pivotal act is very badly described as merely “melt the gpus”, and is much more crazy than what I thought it was meant to refer to.
Regarding “rogue”: I just looked up the meaning and I thought it meant “independent from established authority”, but it seems to mean “cheating/dishonest/mischievous”, so I take back that statement about rogueness.
I’ll respond to the “public opinion” thing later.
I’m surprised if I haven’t made this clear yet, but the thing that (from my perspective) seems different between my and your view is not that Step 1 seems easier to me than it seems to you, but that the “melt the GPUs” strategy (and possibly other pivotal acts one might come up with) seems way harder to me than it seems to you. You don’t have to convince me of “‘any easily human-specifiable task’ is asking for a really mature alignment”, because in my model this is basically equivalent to fully solving the hard problem of AI alignment.
Some reasons:
I don’t see how you can do “melt the GPUs” without having an AI that models humans. What if a government decides to send a black ops team to kill this new terrorist organization (your alignment research team), or send a bunch of icbms at your research lab, or do any of a handful of other violent things? Surely the AI needs to understand humans to a significant degree? Maybe you think we can intentionally restrict the AI’s model of humans to be only about precisely those abstractions that this alignment team considers safe and covers all the human-generated threat models such as “a black ops team comes to kill your alignment team” (e.g. the abstraction of a human as a soldier with a gun).
What if global public opinion among scientists turns against you and all ideas about “AI alignment” are from now on considered to be megalomaniacal crackpottery? Maybe part of your alignment team even has this reaction after the event, so now you’re working with a small handful of people on alignment and the world is against you, and you’ve semi-premanently destroyed any opportunity that outside researchers can effectively collaborate on alignment research. Probably your team will fail to solve alignment by themselves. It seems to me this effect alone could be enough to make the whole plan predictably backfire. You must have thought of this effect before, so maybe you consider it to be unlikely enough to take the risk, or maybe you think it doesn’t matter somehow? To me it seems almost inevitable, and could only be prevented with basically a level of secrecy and propaganda that would require your AI to model humans anyway.
These two things alone make me think that this plan doesn’t work in practice in the real world, unless you basically solve Step 1 already. Although I must say the point which I just speculated you might have, that we could somehow control the AI’s model of humans to be restricted to particular abstractions, gives me some pause and maybe I end up being wrong via something like that. This doesn’t affect the second bullet point though.
Reminder to the reader: This whole discussion is about a thought experiment that neither party actually seriously proposed as a realistic option. I want to mention this because lines might be taken out of context to give the impression that we are actually discussing whether to do this, which we aren’t.
“you” obviously is whoever would be building the AI system that ended up burning all the GPU’s (and ensuring no future GPU’s are created). I don’t know such sequence of events just as I don’t know the sequence of events for building the “burn all GPU’s” system, except at the level of granularity of “Step 1. build a superintelligent AI system that can perform basically any easily human-specifiable task without destroying the world. Step 2. make that system burn all GPU’s indefintely/build security services that prevent misaligned AI from destroying the world”.
I basically meant to say that I don’t know that “burn all the GPU’s” isn’t already as difficult as building the security services, because they both require step 1, which is basically all of the problem (with the caveat that I’m not sure, and made an edit stating a reason why it might be far from true). I basically don’t see how you execute the “burn all gpu’s” strategy without basically solving almost the entire problem.
Linking to my post about Dutch TV: https://www.lesswrong.com/posts/TMXEDZy2FNr5neP4L/datapoint-median-10-ai-x-risk-mentioned-on-dutch-public-tv