Jeremy Gillen comments on Another argument against maximizer-centric alignment paradigms

Jeremy Gillen 25 Sep 2024 11:55 UTC
1 point
0
Yeah I’m on board with deontological-injunction shaped constraints. See here for example.
Perhaps instead of “attempting to achieve the goal at any cost” it would be better to say “being willing to disregard conventions and costs imposed on uninvolved parties, if considering those things would get in the way of the pursuit of the goal”.
Nah I still disagree. I think part of why I’m interpreting the words differently is because I’ve seen them used in a bunch of places e.g. the lightcone handbook to describe the lightcone team. And to describe the culture of some startups (in a positively valenced way).
Being willing to be creative and unconventional—sure, but this is just part of being capable and solving previously unsolved problems. But disregarding conventions that are important for cooperation that you need to achieve your goals? That’s ridiculous.
Being willing to impose costs on uninvolved parties can’t be what is implied by ‘going hard’ because that depends on the goals. An agent that cares a lot about uninvolved parties can still go hard at achieving its goals.
I suspect we may be talking past each other here.
Unfortunately we are not. I appreciate the effort you put into writing that out, but that is the pattern that I understood you were talking about, I just didn’t have time to write out why I disagreed.
I expect this to continue to be true in the future
This is the main point where I disagree. The reason I don’t buy the extrapolation is that there are some (imo fairly obvious) differences between current tech and human-level researcher intelligence, and those differences appear like they should strongly interfere with naive extrapolation from current tech. Tbh I thought things like o1 or alphaproof might cause the people who naively extrapolate from LLMs to notice some of these, because I thought they were simply overanchoring on current SoTA, and since the SoTA has changed I thought they would update fast. But it doesn’t seem to have happened much yet. I am a little confused by this.
What observations lead you to suspect that this is a likely failure mode?
I didn’t say likely, it’s more an example of an issue that comes up so far when I try to design ways to solve other problems. Maybe see here for instabilities in trained systems, or here for more about that particular problem.
I’m going to drop out of this conversation now, but it’s been good, thanks! I think there are answers to a bunch of your claims in my misalignment and catastrophe post.