Productive Mistakes, Not Perfect Answers
This post is part of the work done at Conjecture.
I wouldn’t bet on any current alignment proposal. Yet I think that the field is making progress and abounds with interesting opportunities to do even more, giving us a shot. Isn’t there a contradiction?
No, because research progress so rarely looks like having a clearly correct insight that clarifies everything; instead it often looks like building on apparently unpromising ideas, or studying the structure of the problem. Copernican heliocentrism didn’t initially predict observations as well as Ptolemaic astronomy; both ionic theory and the determination of basic molecular formula came from combining multiple approaches in chemistry, each getting some bits but not capturing the whole picture; Computer Science emerged from the arid debate over the foundations of mathematics; and Computational Complexity Theory has made more progress by looking at why some of its problems are hard than by waiting for clean solutions.
In the end you do want to solve the problem, obviously. But the road from here to there goes through many seemingly weird and insufficient ideas that are corrected, adapted, refined, often discarded except for a small bit. Alignment is no different, including “strong” alignment.
Research advances through productive mistakes, not perfect answers.
I’m taking this terminology from Goro Shimura’s characterization of his friend Yutaka Taniyama, with whom he formulated the Taniyama-Shimura Conjecture that Andrew Wiles proved in order to prove Fermat’s last theorem.
(Yutaka Taniyama and his time. Very personal recollections, Goro Shimura, 1989)
Though he was by no means a sloppy type, he was gifted with the special capability of making many mistakes, mostly in the right direction. I envied him for this, and tried in vain to imitate him, but found it quite difficult to make good mistakes.
So much of scientific progress takes the form of many people proposing different ideas that end up being partially right, where we can look back later and be like “damn, that was capturing a chunk of the solution.” It’s very rare that people arrive at the solution of any big scientific problem in one nice sweep of a clearly adequate idea. Even when it looks like it (Einstein is an example people like to bring up), they so often build on many of the weird and contradictory takes that came before, as well as the understanding of how the problem works at all (in Einstein’s case, this includes the many, many unconvincing attempts to unify mechanics and electromagnetism, the shape of Maxwell’s equations, the ether drag hypothesis, and Galileo’s relativity principle; he also made a lot of productive mistakes of his own).
Paul Graham actually says the same thing about startups that end up becoming great successes.
(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)
One of the most valuable exercises you can try if you want to understand startups is to look at the most successful companies and explain why they were not as lame as they seemed when they first launched. Because they practically all seemed lame at first. Not just small, lame. Not just the first step up a big mountain. More like the first step into a swamp.
Graham proposes a change of polarity in considering lame ideas: instead of looking for flaws, he encourages us to steelman not the idea itself, but how it could lead to greatness.
(What Microsoft Is This The Basic Altair Of?, Paul Graham, 2015)
Most people’s first impulse when they hear about a lame-sounding new startup idea is to make fun of it. Even a lot of people who should know better.
When I encounter a startup with a lame-sounding idea, I ask “What Microsoft is this the Altair Basic of?” Now it’s a puzzle, and the burden is on me to solve it. Sometimes I can’t think of an answer, especially when the idea is a made-up one. But it’s remarkable how often there does turn out to be an answer. Often it’s one the founders themselves hadn’t seen yet.
That’s this mindset that makes me excited about on-going conceptual alignment research.
I look at ARC’s ELK, and I have disagreement about the constraints, and the way of stating the problem, and about each proposed solution; but I also see how much productive discussion ELK has generated by pushing people to either solve it or articulate why it’s impossible or why it falls short of capturing the key problems that we want to solve.
I look at Steve’s Brain-like AGI Alignment work, and I’m not convinced that we will build brain-like AGI before ML-based AGI or automated economies; but I also see that Steve has been pushing the thinking around value learning and its subtleties, and has found a robust way of transferring results and models from neuroscience to alignment.
I look at John’s Natural Abstraction work, and I’m still unsure whether the natural abstraction hypothesis is correct, and if it might at all lead to tractable extraction/analysis of the abstractions used in prediction; but I also see how it reframes the thinking and ideas around fragility of value, and provide ideas for forcing an ontological lock (if the natural abstraction hypothesis doesn’t hold by default).
I look at Evan’s training stories, and I’m unclear whether this is the right frame to argue for alignment guarantees, and if it has blindspots; but I also see how it clarifies misunderstandings around inner alignment, and provide the first step for a common language to discuss failure modes in prosaic alignment.
I look at Alex’s power-seeking theorems, and I wonder if it’s not missing a crucial component about how power is spent, and if the set of permutations considered fit with how goals are selected in real life; but I also realize that the formalization made these subtleties of instrumental convergence more salient, and provided some intuitions about ways of sampling goals that might reduce power-seeking incentives.
I look at Vanessa’s Infra-bayesianism work, and I worry that it’s not tackling the crucial part of inferring and capturing human values, as well as going for too much generality at the cost of shareability; but I also see that it looks particularly good for tackling question of realizability and breaking self-reference, while yielding powerful enough math that I expect progress on the agenda.
I look at Critch’s RAAPs work, and I don’t know if competitive pressure is a strong enough mechanism to cause that kind of problem, nor am I so sure that the agentic and structural effects can be disentangled; but I also appreciate the attempt to bring more structural-type thinking into alignment, and how this addresses a historical gap in how to think about AI risk and alignment strategies.
And so on for many other works on the AF.[1]
It’s also clear when reading these works and interacting with these researchers that they all get how alignment is about dealing with unbounded optimization, they understand fundamental problems and ideas related to instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis…
None of these approaches looks good enough on its own, and I expect many to shift, get redirected, or even abandoned to iterate on a new version. I also expect to criticize their development and disagree with the researchers involved. Yet I still see benefits and insights they might deliver, and want more work to be put into them for that reason.
But isn’t all that avoiding the real problem of finding a solution to the alignment problem right now? No, because they each give us better tools and ideas and handles for tackling the problem, and all our current proposals don’t work.
That doesn’t look fast, you might answer. And I agree that fundamental science and solving new and complex problems have historically taken way too long for the short timelines we seem to be on. But that’s not a reason to refuse to do the necessary work, or despair; it’s a reason to find ways of accelerating science! For example, looking at what historically hampered progress, and remove it as much as possible. Or how hidden bits of evidence were revealed, and leverage that to explore the space of ideas and approaches faster.
Okay, but shouldn’t we focus all our efforts on finding smarter and smarter people to work on this problem instead of pushing for the small progress we’re able to make now? I think this misses the point: we don’t want smartness, we want the ability to reveal hidden bits of evidence. That’s certainly correlated with smartness, but with one big difference: there’s often diminishing returns to the bits of evidence you can get from one angle, and that leads to wanting a more diverse portfolio of researchers who are good at harnessing and revealing different streams of evidence. That’s one thing which the common “Which alignment researcher would you want to have 10 copies of?” misses: we want variety, because no one is that good at revealing bits from all relevant streams of evidence.
To go back to the Einstein example, he was clearly less of a math genius than most of his predecessors who attempted to unify mechanics and electromagnetism, like Poincaré. But that didn’t matter, because what Einstein had was a knack for revealing the hidden bits of evidence in what we already knew about physics and the shape of our laws of physics. And after he did that, many mathematicians and physicians with better math chops pushed his theory and ideas and revealed incredibly rich models and insights and predictions.
How do we get more streams of evidence? By making productive mistakes. By attempting to leverage weird analogies and connections, and iterating on them. We should obviously recognize that most of this will be garbage, but you’ll be surprised how many brilliant ideas in the history of science first looked like, or were, garbage.
So if you’re worried about AI risk, and want to know if there’s anything that can be done, the answer is a resounding yes. There are so many ways of improving our understanding and thus our chances: participating in current research programs and agendas, coming up with new weird takes and approaches, exploring the mechanism, history, and philosophy of science to accelerate the process as much as we can…[2]
I don’t know if we’ll make it in time. 5 to 15 years[3] is a tight deadline indeed, and the strong alignment problem is incredibly complex and daunting. But I know this: if we solve the problem and get out of this alive, this will not be by waiting for an obviously convincing approach; it will come instead from making as many productive mistakes as we can, and learning from them as fast as we can.
- ^
I’m not discussing applied alignment research here, like the work of Redwood, but I also find this part crucial and productive. It’s just that such work is less about “formulating a solution” and more about “exploring the models and the problems experimentally”, which fit well with the model I’m drawing here.
- ^
I’m currently finishing a sequence arguing for more pluralism in alignment and providing an abstraction of the alignment problem that I find particularly good for generating new approaches and understanding how all the different takes and perspectives relate.
- ^
The range where many short timelines put the bulk of their probability mass.
- Conjecture: a retrospective after 8 months of work by 23 Nov 2022 17:10 UTC; 180 points) (
- Refine: An Incubator for Conceptual Alignment Research Bets by 15 Apr 2022 8:57 UTC; 144 points) (
- How to Diversify Conceptual Alignment: the Model Behind Refine by 20 Jul 2022 10:44 UTC; 87 points) (
- Abstracting The Hardness of Alignment: Unbounded Atomic Optimization by 29 Jul 2022 18:59 UTC; 68 points) (
- Refine’s First Blog Post Day by 13 Aug 2022 10:23 UTC; 55 points) (
- My Thoughts on the ML Safety Course by 27 Sep 2022 13:15 UTC; 50 points) (
- Refine: An Incubator for Conceptual Alignment Research Bets by 15 Apr 2022 8:59 UTC; 47 points) (EA Forum;
- How to Diversify Conceptual AI Alignment: the Model Behind Refine by 20 Jul 2022 10:44 UTC; 43 points) (EA Forum;
- Mosaic and Palimpsests: Two Shapes of Research by 12 Jul 2022 9:05 UTC; 39 points) (
- Levels of Pluralism by 27 Jul 2022 9:35 UTC; 37 points) (
- What if we approach AI safety like a technical engineering safety problem by 20 Aug 2022 10:29 UTC; 36 points) (
- Steelmining via Analogy by 13 Aug 2022 9:59 UTC; 24 points) (
- 25 Apr 2022 20:20 UTC; 19 points) 's comment on Intuitions about solving hard problems by (
- RLHF by 12 May 2022 21:18 UTC; 18 points) (
- 22 May 2022 11:02 UTC; 17 points) 's comment on On saving one’s world by (
- Intelligent behaviour across systems, scales and substrates by 21 Oct 2022 17:09 UTC; 11 points) (
- 24 Nov 2022 19:57 UTC; 8 points) 's comment on What I Learned Running Refine by (
- 7 Jul 2022 13:34 UTC; 7 points) 's comment on Principles for Alignment/Agency Projects by (
- 10 Apr 2022 13:59 UTC; 5 points) 's comment on AMA Conjecture, A New Alignment Startup by (
- 19 Apr 2022 12:07 UTC; 2 points) 's comment on Refine: An Incubator for Conceptual Alignment Research Bets by (
- 12 May 2022 17:20 UTC; 2 points) 's comment on [Intro to brain-like-AGI safety] 14. Controlled AGI by (
Mostly I’d agree with this, but I think there needs to be a bit of caution and balance around:
Do we want variety? Absolutely: worlds where things work out well likely correlate strongly with finding a variety of approaches.
However, there’s some risk in Do(increase variety). The ideal is that we get many researchers thinking about the problem in a principled way, and variety happens. If we intentionally push too much for variety, we may end up with a lot of wacky approaches that abandoned too much principled thinking too early. (I think I’ve been guilty of this at times)
That said, I fully agree with the goal of finding a variety of approaches. It’s just rather less clear to me how much an individual researcher should be thinking in terms of boosting variety. (it’s very clear that there should be spaces that provide support for finding different approaches, so I’m entirely behind that; currently it’s much more straightforward to work on existing ideas than to work on genuinely new ones)
Certainly many great ideas initially looked like garbage—but I’ll wager a lot of garbage initially looked like garbage too. I’d be interested in knowing more about the hidden-greatness-garbage: did it tend to have any common recognisable qualities at the time? Did it tend to emerge from processes with common recognisable qualities? In environments with shared qualities?...
I bet Adam will argue about this (or something similar) is the minimal we want for a research idea, because I agree with your idea that we shouldn’t expect solution to alignment to fall out of the marketing program for Oreos. We want to constrain it to at least “has a plausible story on reducing x-risk” and maybe what’s mentioned in the quote as well.
For sure I agree that the researcher knowing these things is a good start—so getting as many potential researchers to grok these things is important.
My question is about which ideas researchers should focus on generating/elaborating given that they understand these things. We presumably don’t want to restrict thinking to ideas that may overcome all these issues—since we want to use ideas that fail in some respects, but have some aspect that turns out to be useful.
Generating a broad variety of new ideas is great, and we don’t want to be too quick in throwing out those that miss the target. The thing I’m unclear about is something like:
What target(s) do I aim for if I want to generate the set of ideas with greatest value?
I don’t think that “Aim for full alignment solution” is the right target here.
I also don’t think that “Aim for wacky long-shots” is the right target—and of course I realize that Adam isn’t suggesting this.
(we might find ideas that look like wacky long-shots from outside, but we shouldn’t be aiming for wacky long-shots)
But I don’t have a clear sense of what target I would aim for (or what process I’d use, what environment I’d set up, what kind of people I’d involve...), if my goal were specifically to generate promising ideas (rather than to work on them long-term, or to generate ideas that I could productively work on).
Another disanalogy with previous research/invention… is that we need to solve this particular problem. So in some sense a history of:
[initially garbage-looking-idea] ---> [important research problem solved] may not be relevant.
What we need is: [initially garbage-looking-idea generated as attempt to solve x] ---> [x was solved]
It’s not good enough if we find ideas that are useful for something, they need to be useful for this.
I expect the kinds of processes that work well to look different from those used where there’s no fixed problem.
There is an implicit assumption here that is not covering all the possible outcomes of research progress.
With progress on understanding some open problems in mathematics and computer science, they have turned out unsolvable. That is a valuable, decision-relevant conclusion. It means it is better to do something else than keep hacking away at solving that maths problem.
E.g.
Solving for why mathematical models would be both consistent and complete (see https://youtu.be/HeQX2HjkcNo)
Solving for that any distributed data store can guarantee consistency, availability, and partition tolerance (see https://en.m.wikipedia.org/wiki/CAP_theorem)
Solving for 5 degree polynomials with radicals (see http://www.scientificlib.com/en/Mathematics/LX/AbelRuffiniTheorem.html).
We cannot just rely on a can-do attitude, as we can with starting a start-up (where even if there’s something fundamentally wrong about the idea, and it fails, only a few people’s lives are impacted hard).
With ‘solving for’ the alignment of generally intelligent and scalable/replicable machine algorithms, it is different.
This is the extinction of human society and all biological life we are talking about. We need to think this through rationally, and consider all outcomes of our research impartially.
I appreciate the emphasis on diverse conceptual approaches. Please, be careful in what you are looking for.
I’m confused on what your point here even is. For the first part, if you’re trying to say
, then that makes sense. But the post didn’t mention anything about that?
You said:
which I feel is satirizing the post. I read the post to say
We don’t have any proofs that the approaches the referenced researchers are doomed to fail like we have for P!=NP and what you linked. I would predict that Adam does think approaches that run counter to “instrumental convergence, the security mindset, the fragility of value, the orthogonality thesis” to be doomed to fail.
Besides looking for different angles or ways to solve alignment, or even for strong arguments/proofs why a particular technique will not solve alignment,
… it seems prudent to also look for whether you can prove embedded misalignment by contradiction (in terms of the inconsistency of the inherent logical relations between essential properties that would need to be defined as part of the concept of embedded/implemented/computed alignment).
This is analogous to the situation Hilbert and others in the Vienna circle found themselves in trying to ‘solve for’ mathematical models being (presumably) both complete and consistent. Gödel, who was a semi-outsider, instead took the inverse route of proving by contradiction that a model cannot be simultaneously complete and consistent.
If you have an entire community operating under the assumption that a problem is solvable or at least resolving to solve the problem in the hope that it is solvable, it seems epistemically advisable to have at least a few oddballs attempting to prove that the problem is unsolvable.
Otherwise you end up skewing your entire ‘portfolio allocation’ of epistemic bets.
I understand your point now, thanks. It’s:
or something of the sort.
Yeah, that points well to what I meant. I appreciate your generous intellectual effort here to paraphrase back!
Sorry about my initially vague and disagreeable comment (aimed at Adam, who I chat with sometimes as a colleague). I was worried about what looks like a default tendency in the AI existential safety community to start from the assumption that problems in alignment are solvable.
Adam has since clarified with me that although he had not written about it in the post, he is very much open to exploring impossibility arguments (and sent me a classic paper on impossibility proofs in distributed computing).
… making your community and (in this case) the wider world fragile to reality proving you wrong.
We don’t know the status or evolution of internal MIRI or LW independent/individual Safety Align Research.
But it seems that A.G.I. has a (much?) higher probability of getting invented away.
So the problem is not only to discover how to Safely Align A.G.I. but also to invent A.G.I.
Inventing A.G.I. seems to be a step before than discovering how to Safely Align A.G.I. right?
How probable is it estimated that the first A.G.I. will be the Singularity? isn’t it a spectrum? The answer is probably in the take-off speed and acceleration.
If anyone could provide resources on this it would be much appreciated.
Thousands of highly competent people are working on projects aimed at increasing AI capabilities, there is vast financial incentive there already. We don’t need to and should not help with that.
If we only figure out alignment after the intelligence explosion, it will be too late. We might get a chance to course correct in a slow take-off, but we definitely can’t count on it.
As for resources, Rob Miles has many excellent introductory videos to AI alignment.