Error

LW server reports: not allowed.

This probably means the post has been deleted or moved back to the author's drafts.

Seth Herd Jul 4, 2024, 10:11 PM
6 points
2

Since you requested feedback and critique:
I think you’re saying that perfect alignment is impossible, and there are nontrivial difficulties to achieving good-enough alignment. I think everyone working on alignment would agree with those things.
To make a contribution to either an informed or novice audience, I think you’ve got to go into more detail on specific points, and reference the existing arguments and counterarguments on each point. More detail point-by-point at the end.
Let me say that your graphics are very impressive, and the piece is well-written. I’d love to see you adapt the piece to a more balanced discussion of alignment—because your arguments don’t come close to establishing that good-enough alignment is possible.
This piece doesn’t really add to the state of knowledge in the field; I’m not sure if it’s intended to, or to be aimed purely at people that aren’t familiar with existing in-depth discussions of alignment. It’s clear that you’re familiar with some work on alignment, but the current discussion goes deeper than your summary here. For instance, perfect alignment (following all of the values of every human simultaneously) is not what anyone on any side of the alignment discussion means by alignment. It’s obviously impossible, as you point out, so it’s not under discussion.
I’d define good-enough alignment as something like: superintelligent AGI helps create a future that most current and future humans think is pretty good, and ideally better than humans would’ve produced without AGI/ASI. There are nontrivial challenges.
I note that the piece doesn’t come with a conclusion. I assume the intended conclusions is: don’t build AGI. But I think publishing this piece is unlikely to really shift public opinion in that direction. It will convince some undecided people that we shouldn’t build AGI. But it will convince some other people that we should build AGI faster. I think it will probably help create polarization in viewpoints, not the nuanced, sensible discussion we need.
That’s because the piece is written to convince, more than to inform. LessWrong asks us to write to inform, not to convince, and there’s a good reason. Writing to convince creates arguments. It forces the audience to say “wait now what are the counterarguments” instead of giving the author’s knowledge of both sides of the topic. It makes the writing sound un-trustworthy, because it is: it’s deliberately not telling the whole truth. There are counterarguments to each of the points you raise. Specifics at the end.
So I’m not sure this article will find an audience on LessWrong as it stands, since it’s written to convince rather than to inform. There are some statements of the problems that might add to the discussion.
To briefly respond point by point: I wish I had a ref handy for each, but this is taking too long already:
To respond in order:
1) societal and engineering progress has never depended on precise definitions.
a) That’s right, and that’s a problem. But we’re almost certainly going to push ahead with just strong-enough arguments for alignment- or even without them. Aligned AGI (good-enough alignment, which is what everyone means by alignment) would be the best thing to happen to humanity, so at least a lot of people will be willing to take some risks when shooting for it.
b) we’ll continue coming up with criteria and rules to test against. Current evals are one example; they’ll need to be expanded to address real AGI.

1. ### Alignment lacks a falsifiable definition

a. There is no way to prove we have completed the goal

b. There is no criteria or rules to test against
1) societal and engineering progress has never depended on precise definitions.
a) That’s right, and that’s a problem. But we’re almost certainly going to push ahead with just strong-enough arguments for alignment- or even without them. Aligned AGI (good-enough alignment, which is what everyone means by alignment) would be the best thing to happen to humanity, so at least a lot of people will be willing to take some risks when shooting for it.
b) we’ll continue coming up with criteria and rules to test against. Current evals are one example; they’ll need to be expanded to address real AGI.

2. ### Proposed methods and ideals contain irresolvable logical contradictions

a. Must align to humanity’s values

i. _Our values are amorphous, unaligned, and interdependent_
Yes, perfect alignment with all humans is impossible. That’s why nobody is proposing it.

b. Must be predictable

i. _The nature of intelligence is unpredictable_
It only needs to be as predictable as intelligence, for a good reason: intelligence puruses goals. If those goals are aligned with human goals (even decently), the exact outcomes will be unpredictable, but we know we’ll like them.

c. Must be solved now before AGI

i. _We race toward AGI as fast as possible_
Yes, this is a big problem. I suggest you help solve them instead of trying to convince people they’re unsolvable. They’re not in theory; whether we solve them in practice depends on how many of us work on it, and how efficiently and cooperatively we work.

d. Must be corrigible to fix behavior

i. _Must not be corrigible to prevent tampering or incorrect modifications_
I’ve never heard anyone say they mustn’t be corrigible. Having a small group of relatively well-ntentioned humans with “veto power” is not that easy to arrange, but it seems like a purely good thing, and possible to achieve

e. Must benefit everyone

i. _Benefits are decided by those who build it_
It can benefit both. If the builders are decent people, they’ll share some of the benefits. Ensuring that decent people build and are in charge of AGI alignment/corrigibility is probably super important for this reason.

f. AI will be utilized to solve alignment

i. _No method to verify alignment is solved or AI is capable of such_
I agree that this is a big outstanding problem. It’s like a plan to make a plan. Which isn’t guaranteed to fail, it’s just not a plan yet. But there are real plans. For some of my favorites, see my We have promising alignment plans with low taxes. For the type of imperfect alignment I think we’ll try for in any plan (because it’s easier), see Instruction-following AGI is easier and more likely than value aligned AGI.
g. Alignment must be perfect

i. _We must prove we have thought of everything, before turning on the machine built to think of everything we cannot._
We neednt’ and won’t. Never have. Maybe we “should” in some sense, if you’re a longtermist and utilitarian, but most peole aren’t. They’re going to take good-enough roughly estimated odds of success, just like humans always have.
Most of these aren’t discussed that much, because they’re not the real problems with aligning AGI. For some of the more commonly discussed poitns, see my brief summary of cruxes of disagreement on alignment difficulty. Those seem to be the most common and most discussed.