The best example I have right now is this thread with Liron, and it’s a good example since it demonstrates the errors most cleanly.
Warning, this is a long comment, since I need to characterize the thread fully to explain why this thread demonstrates the terrible epistemics of Liron in this thread, why safety research is often confused, and more and also I will add my own stuff on alignment optimism here.
Anyways, let’s get right into the action.
Liron tries to use the argument that it violates basic constraints analogous to a perpetual motion machine to have decentralized AI amongst billions of humans, and he doesn’t even try to state what the constraints are until later, which turns out to be not great.
His scenario is a kind of perpetual motion machine, violating basic constraints that he won’t acknowledge are constraints.
Quintin Pope recognizes that the comparison between the level of evidence for thermodynamics, and the speculation every LWer did about AI alignment is massively unfair, in that the thermodynamics example is way more solid than virtually everything LW said on AI. (BTW, this is why I dislike climate-AI analogies in the evidence each one has, since the evidence for climate change is also way better than all AI discussion ever acheived.) Quintin Pope notices that Liron is massively overconfident here.
Equating a bunch of speculation about instrumental convergence, consequentialism, the NN prior, orthogonality, etc., with the overwhelming evidence for thermodynamic laws, is completely ridiculous.
Seeing this sort of massive overconfidence on the part of pessimists is part of why I’ve become more confident in my own inside-view beliefs that there’s not much to worry about.
Liron claims that instrumental convergence and the orthogonality thesis are simple deductions, and criticizes Quintin Pope for seemingly having an epistemology that is wildly empiricist.
Instrumental convergence and orthogonality are extremely simple logical deductions. If the only way to convince you about that is to have an AI kill you, that’s not gonna be the best epistemology to have.
Quintin Pope points out that once we make it have any implications for AI, things get vastly more complicated, and uses an example to show how even a very good constructed argument for an analogous thing to AI doom basically totally fails for predictable reasons:
Instrumental convergence and orthogonality are extremely simple logical deductions.
They’re only simple if you ignore the vast complexity that would be required to make the arguments actually mean anything. E.g., orthogonality:
What does it mathematically mean for “intelligence levels” and “goals” to be “orthogonal”?
What does it mean for a given “intelligence level” to be “equally good” at perusing two different “goals”?
What do any of the quoted things above actually mean?
Making a precise version of the orthogonality argument which actually makes concrete claims about how the structure of “intelligent algorithms space” relates to the structure of “goal encodings space”, would be one of the most amazing feats of formalisation and mathematical argumentation ever.
Supposed you successfully argued that some specific property held between all pairs of “goals” and “intelligence levels”. So what? How does this general argument translate into actual predictions about the real world process of building AI systems?
To show how arguments about the general structure of mathematical objects can fail to translate into the “expected” real world consequences, let’s look at thermodynamics of gas particles. Consider the following argument for why we will all surely die of overpressure injuries, regardless of the shape of the rooms we’re in:
Gas particles in a room are equally likely to be in any possible configuration.
This property is “orthogonal” to room shape, in the specific mechanistic sense that room shape doesn’t change the relative probabilities of any of the allowed particle configurations, merely renders some of them impossible (due to no particles being allowed outside the room).
Therefore, any room shape is consistent with any possible level of pressure being exerted against any of its surfaces (within some broad limitations due to the discrete nature of gas particles).
The range of gas pressures which are consistent with human survival is tiny compared to the range of possible gas pressures.
Therefore, we are near-certain to be subjected to completely unsurvivable pressures, and there’s no possible room shape that will save us from this grim fate.
This argument makes specific, true statements about how the configuration space of possible rooms interacts with the configuration spaces of possible particle positions. But it still fails to be at all relevant to the real world because it doesn’t account for the specifics of how statements about those spaces map into predictions for the real world (in contrast, the orthogonality thesis doesn’t even rigorously define the spaces about which it’s trying to make claims, never mind make precise claims about the relationship between those spaces, and completely forget about showing such a relationship has any real-world consequences).
The specific issue with the above argument is that the “parameter-function map” between possible particle configurations and the resulting pressures on surfaces concentrates an extremely wide range of possible particle configurations into a tiny range of possible pressures, so that the vast majority of the possible pressures just end up being ~uniform on all surfaces of the room. In other words, it applies the “counting possible outcomes and see how bad they are” step to the space of possible pressures, rather than the space of possible particle positions.
The classical learning theory objections to deep learning made the same basic mistake when they said that the space of possible functions that interpolate a fixed number of points is enormous, so using overparameterized models is far more likely to get a random function from that space, rather than a “nice” interpolation.
They were doing the “counting possible outcomes and seeing how bad they are” step to the space of possible interpolating functions, when they should have been doing so in the space of possible parameter settings that produce a valid interpolating function. This matters for deep learning because deep learning models are specifically structured to have parameter-function maps that concentrate enormous swathes of parameter space to a narrow range of simple functions (https://arxiv.org/abs/1805.08522, ignore everything they say about Solomonoff induction).
I think a lot of pessimism about the ability of deep learning training to specify the goals on an NN is based on a similar mistake, where people are doing the “count possible outcomes and see how bad they are” step to the space of possible goals consistent with doing well on the training data, when it should be applied to the space of possible parameter settings consistent with doing well on the training data, with the expectation that the parameter-function map of the DL system will do as it’s been designed to, and concentrate an enormous swathe of possible parameter space into a very narrow region of possible goals space.
If the only way to convince you about that is to have an AI kill you, that’s not gonna be the best epistemology to have. (Liron’s quote)
I’m not asking for empirical demonstrations of an AI destroying the world. I’m asking for empirical evidence (or even just semi-reasonable theoretical arguments) for the foundational assumptions that you’re using to argue AIs are likely to destroy the world. There’s this giant gap between the rigor of the arguments I see pessimists using, versus the scale and confidence of the conclusions they draw from those arguments.
There are components to building a potentially-correct argument with real-world implications, that I’ve spent the entire previous section of this post trying to illustrate. There exist ways in which a theoretical framework can predictably fail to have real-world implications, which do not amount to “I have not personally seen this framework’s most extreme predictions play out before me.”
Quintin also has other good side tweets to the main thread talking about the orthogonality thesis and why it either doesn’t matter or is actually false for our situation, which you should check out:
“Orthogonality” simply means that a priori intelligence doesn’t necessarily correlate with values. (Simone Sturniolo’s question.)
Correlation doesn’t make sense except in reference to some joint distribution. Under what distribution are you claiming they do not correlate? E.g., if the distribution is, say, the empirical results of training an AI on some values-related data, then your description of orthogonality is a massive, non-”baseline” claim about how values relate to training process. (Quintin Pope’s response.)
What does it mean for a given “intelligence level” to be “equally good” at perusing two different “goals”?
You’re misstating OT. It doesn’t claim a given intelligence will be equally good at pursuing any 2 goals, just that any goal can be pursued by any intelligence. (Cody Miner’s tweet.)
In order for OT to have nontrivial implications about the space of accessible goals / intelligence tuples, it needs some sort of “equally good” (or at least, “similarly good”) component, otherwise there are degenerate solutions where some goals could be functionally unpersuable, but OT still “holds” because they are not literally 100% unpersuable. (Quintin Pope’s response.)
Anyways, back to the main thread at hand.
Liron argues that the mechanistic knowledge we have about Earth’s pressure is critical to our safety:
The gas pressure analogy to orthogonality is structurally valid.
The fact that Earth’s atmospheric pressure is safe, and that we mechanistically know that nothing we do short of a nuke is going to modify that property out of safe range, are critical to the pressure safety claim.
Quintin counters that the argument he defined about the gas pressure, even though it does way better than all AI safety arguments to date, still fails to have any real-world consequences predictably:
The point of the analogy was not “here is a structurally similar argument to the orthogonality thesis where things turn out fine, so orthogonality’s pessimistic conclusion is probably false.”
The point of my post was that the orthogonality argument isn’t the sort of thing that can possibly have non-trivial implications for the real world. This is because orthogonality:
1: doesn’t define the things it’s trying to make a statement about.
2: doesn’t define the statement it’s trying to make
3: doesn’t correctly argue for that statement.
4: doesn’t connect that statement to any real-world implications.
The point of the analogy to gas pressure is to give you a concrete example of an argument where parts 1-3 are solid, but the argument still completely fails because it didn’t handle part 4 correctly.
Once again, my argument is not “gas pressure doesn’t kill us, so AI probably won’t either”. It’s “here’s an argument which is better-executed than orthogonality across many dimensions, but still worthless because it lacks a key piece that orthogonality also lacks”.
This whole exchange illustrates one of the things I find most frustrating about so many arguments for pessimism: they operate on the level of allegories, not mechanism.
My response to @liron was not about trying to counter his vibes of pessimism with my vibes of optimism. I wasn’t telling an optimistic story of how “deep learning is actually safe if you understand blah blah blah simplicity of the parameter-function map blah blah”.
I was pointing out several gaps in the logical structure of the orthogonality-based argument for AI doom (points 1-4 above), and then I was narrowing in on one specific gap (point 4, the question of how statements about the properties of a space translate into real-world outcomes) and showing a few examples of different arguments that fail because they have structurally equivalent gaps.
Saying that we only know people are safe from overpressure because of x, y, or z, is in no way a response to the argument I was actually making, because the point of the gas pressure example was to show how even one of the gaps in the orthogonality argument is enough to doom an arguments that is structurally equivalent to the orthogonality argument.
Liron argues that the gas pressure argument does connect to the real world:
But the gas pressure argument does connect to the real world. It just happens to be demonstrably false rather than true. Your analogy is mine now to prove my point.
Quintin counters that the gas pressure argument doesn’t connect to the real world, since it does not correctly translate from the math to the real world, and the argument used seems very generalizable to a lot of AI discourse:
But the gas pressure argument does connect to the real world. It just happens to be demonstrably false rather than true.
It doesn’t “just happen” to be false. There’s a specific reason why this argument is (predictably) false: it doesn’t correctly handle the “how does the mathematical property connect to reality?” portion of the argument.
There’s an alternative version of the argument which does correctly handle that step. It would calculate surface pressure as a function of gas particle configuration, and then integrate over all possible gas particle configurations, using the previously established fact that all configurations are equally likely. This corrected argument would actually produce the correct answer, that uniform, constant pressure over all surfaces is by far the most likely outcome.
Even if you had precisely defined the orthogonality thesis, and had successfully argue for it being true, there would still be this additional step where you had to figure out what implications it being true would have for the real world. Arguments lacking this step (predictably) cannot be expected to have any real-world implications.
Liron then admits, while he’s unaware of it to a substantial weakening of the claim, since he discarded the idea that AI safety was essentially difficult or impossible, he now makes the vastly weaker claim that AI can be misaligned/unsafe. This is a substantial update that isn’t hinted to the reader at all, since virtually everything can be claimed for, including the negation of AI governance and AI misalignment, since it uses words and only uses can instead of anything else.
@liron I previously disengaged from this conversation because I believed you had conceded the main point of contention, and agreed that the orthogonality argument provides no evidence for high probabilities of value misalignment.
Liron motte-and-baileys back to the very strong claim that optimization theory gives us any reason to believe aligned AI is extraordinarily improbable (short answer, it isn’t and it can’t make any claim to this.)
Analogy to physical-law violation: While the basic principles of “optimization theory” don’t quite say aligned AI is impossible (like perpetual motion would be), they say it’s extremely improbable without having a reason to expect many bits of goal engineering to locate aligned behavior in goal-space (and we know we currently don’t understand major parts of the goal engineering that would be needed).
E.g. the current trajectory of just scaling capabilities and doing something like RLHF (or just using a convincing-sounding RLHF’d AI to suggest a strategy for “Superalignment”) has a very low a-priori probability of overcoming that improbability barrier.
Btw I appreciate that you’ve raised some thought-provoking objections to my worldview on LessWrong. I’m interested to chat more if you are, but can we do it as like a 45-minute podcast? IMO it’d be a good convo and help get clarity on our cruxes of disagreement.
Quintin suggests a crux here, in that his optimization theory, insofar as it could be called a theory, implies that alignment could be relatively easy. I don’t buy all of his optimization theory, but I have other sources of evidence for alignment being easy, and it’s way better than anything LW ever came up with.
I’d be fine with doing a podcast. I think the crux of our disagreement is pretty clear, though. You seem to think there are ‘basic principles of “optimization theory”’ that let you confidently conclude that alignment is very difficult. I think such laws, insofar as we know enough to guess at them, imply alignment somewhere between “somewhat tricky” and “very easy”, with current empirical evidence suggesting we’re more towards the “very easy” side of the spectrum.
Personally, I have no problem with pointing to a few candidate ‘basic principles of “optimization theory”’ that I think support my position. In roughly increasing order of speculativeness:
1: The geometry of the parameter-function map is most of what determines the “prior” of an optimization process over a parameter space, with the relative importance of the map increasing as the complexity of the optimization criterion increases.
2: Optimization processes tend to settle into regions of parameter space with flat (or more accurately, degenerate / singular) parameter-function maps, since those regions tend to map a high volume of parameter space to their associated, optimization criterion-satisfying, functional behavior (though it’s actually the RLCT from singular learning theory that determines the “prior/complexity” of these regions, not their volume).
3: Symmetries in the parameter-function map are most important for determining the relative volumes/degeneracy of different solution classes, with many of those symmetries being entangled with the optimization criterion.
4: Different optimizers primarily differ from each other via their respective distributions of gradient noise across iterations, with the zeroth-order effect of higher noise being to induce a bias towards flat regions of the loss landscape. (somewhat speculative)
5: The Eigenfunctions of the parameter-function map’s local linear approximation form a “basis” translating local movement in parameter space to the corresponding changes in functional behaviors, and the spectrum of the Eigenfunctions determines the relative learnability of different functional behaviors at that specific point in parameter space.
6: Eigenfunctions of the local linearized parameter-function map tend to align with the target function associated with the optimization criterion, and this alignment increases as the optimization process proceeds. (somewhat speculative)
How each of these points suggest alignment is tractable:
Points 1 and 2 largely counter concerns about impossible to overcome under-specification that you > reference when you say alignment is “extremely improbable without having a reason to expect many bits of goal engineering to locate aligned behavior in goal-space”. Specifically, deep learning is not actually searching over “goal-space”. It’s searching over parameter space, and the mapping from parameter space to goal space is extremely compressive, such that there aren’t actually that many goals consistent with a given set of training data. Again, this is basically why deep learning works at all, and why overparameterized models don’t just pick a random perfect loss function which fails to generalize outside the training data.
Point 3 suggests that NNs strongly prefer short, parallel ensembles of many shallower algorithms, over a small number of “deep” algorithms (since parallel algorithms have an entire permutation group associated with their relative ordering in the forwards pass, whereas each component of a single deep circuit has to be in the correct relative order). This basically introduces a “speed prior” into the “simplicity prior” of deep nets, and makes deceptive alignment less likely, IMO.
Points 4 and 6 suggest that different optimizers don’t behave that differently from each other, especially when there’s more data / longer training runs. This would mean that we’re less likely to have problems due to fundamental differences in how SGD works as compared to the brain’s pseudo-Hebbian / whatever local update rule it really uses to minimize predictive loss and maximize reward.
Point 5 suggests a lens from which we can examine the learning trajectories of deep networks and quantify how different updates change their functional behaviors over time.
Given this illustration of what I think may count as ‘basic principles of “optimization theory”‘, and a quick explanation of how I think they suggest alignment is tractable, I would like to ask you: what exactly are your ‘basic principles of “optimization theory”’, and how do these principles imply aligned AI is “extremely improbable without having a reason to expect many bits of goal engineering to locate aligned behavior in goal-space”?
Further, I’d like to ask: how do your principles not also say the same thing about, e.g., training grammatically fluent language models of English, or any of the numerous other artifacts we successfully use ML to create? What’s different about human values, and how does that difference interact with your ‘basic principles of “optimization theory”’ to imply that “behaving in accordance with human values” is such a relatively more difficult data distribution to learn, as compared with all the other distributions that deep learning demonstrably does learn?
Liron suggests that his optimizer theory suggests that natural architectures can learn a vast variety of goals, which combined with the presumed most goals being disastrous for humans, makes him worried about AI safety. IMO, it’s notably worse in that it’s way more special casey than Quintin Pope’s theory, and it describes only the end results.
There exist natural goal optimizer architectures (analogous to our experience with the existence of natural Turing-complete computing architectures) such that minor modular modifications to its codebase can cause it to optimize any goal in a very large goal-space.
Optimizing the vast majority of goals in this goal-space would be disastrous to humans.
A system with superhuman optimization power tends to foom to far superhuman level and thus become unstoppable by humans.
AI doom hypothesis: In order to survive, we need a precise combination of building something other than the default natural outcome of a rogue superhuman AI optimizing a human-incompatible objective, but we’re not on track to set up the narrow/precise initial conditions to achieve that.
Quintin Pope points out the flaws in Liron’s optimization theory. In particular, they’re outcomes that are relabeled as laws:
None of these are actual “laws/theory of optimization”. They are all specific assertions about particular situations, relabeled as laws. They’re the kind of thing you’re supposed to conclude from careful analysis using the laws as a starting point.
Analogously, there is no law of physics which literally says “nuclear weapons are possible”. Rather, there is the standard model of particle physics, which says stuff about the binding energies and interaction dynamics of various elementary particle configurations. From the standard model, one can derive the fact that nuclear weapons must be possible, by analyzing the standard model’s implications in the case that a free neutron impacts a plutonium nucleus.
Laws / theories are supposed to be widely applicable descriptions of a domain’s general dynamics, able to make falsifiable predictions across many different contexts for the domain in question. This is why laws / theories have their special epistemic status. Because they’re so applicable to so many contexts, and make specific predictions for those contexts, each of those contexts acts as experimental validation for the laws / theories.
In contrast, a statement like “A system with superhuman optimization power tends to foom to far superhuman level and thus become unstoppable by humans.” is specific to a single (not observed) context, and so it cannot possibly have the epistemic status of an actual law / theory, not unless it’s very clearly implied by an actual law / theory.
Of course, none of my proposed laws have the epistemic backing of the laws of physics. The science of deep learning isn’t nearly advanced enough for that. But they do have this “character” of physical laws, where they’re applicable to a wide variety of contexts (and can thus be falsified / validated in a wide variety of contexts). Then, I argue from the proposed laws to the various alignment-relevant conclusions I think they support. I don’t list out the conclusion that I think support optimism, then call them laws / theory.
I previously objected to your alluding to thermodynamic laws in regards to the epistemic status of your assertions (https://x.com/QuintinPope5/status/1703569557053644819?s=20). I did so because I was quite confident that there do not exist any such laws of optimization. I am still confident in that position.
Overall, I see pretty large issues with Liron’s side of the conversation, in that he moves between 2 different claims such that one is defensible but has ~no implications, and the claim that has implications but needs much, much more work to make it do well.
Also, Liron is massively overconfident in his theories here, which also is bad news.
Some additions to the AI alignment optimism case are presented below, to point out that AI safety optimism is sort of robust.
For more on why RLHF is actually extraordinarily general for AI alignment, Quintin Pope’s comment on LW basically explains it better than I can:
For the more general AI alignment optimism cases, Nora Belrose has a part of the post dedicated to the point that AIs are white boxes, not black boxes, and while it definitely overestimates the easiness (I do not believe that for ANNs today, that we can essentially analyze or manipulate them at 0 cost, and Steven Byrnes in the comments is right to point a worrisome motte-and-bailey that Nora Belrose does, albeit even then it’s drastically easier to analyze ANNs rather than brains today.)
For the untested but highly promising solution to the AI shutdown problem, these 3 posts provide necessary reading, since Elliott Thornley found a way to usefully weaken expected utility maximization to retain most of the desirable properties of expected utility maximization, but without making the AI unshutdownable or display other bad behaviors. This might be implemented using John Wentworth’s idea of subagents.
Damn, this was a long comment for me to make, since I needed it to be a reference for the future when people ask me about my optimism on AI safety, and the problems with AI epistemics, and I want it to be both self-contained and dense.
I don’t think this was a good debate, but I felt I was in a position where I would have had to invest a lot of time to do better by the other side’s standards.
Quintin and I have agreed to do a X Space debate, and I’m optimistic that format can be more productive. While I don’t necesarily expect to update my view much, I am interested to at least understand what the crux is, which I’m not super clear on atm.
Here’s a meta-level opinion:
I don’t think it was the best choice of Quintin to keep writing replies that were disproportionally long compared to mine.
There’s such a thing as zooming claims and arguments out. When I write short tweets, that’s what I’m doing. If he wants to zoom in on something, I think it would be a better conversation if he made an effort to do it less at a time, or do it for fewer parts at a time, for a more productive back & forth.
I don’t think it was the best choice of Quintin to keep writing replies that were disproportionally long compared to mine.
I understand why you feel this way, but I do think that it was sort of necessary to respond like this, primarily because I see a worrisome asymmetry between the arguments for AI doom and AI being safe by default.
AI doom arguments are more intuitive than AI safety by default arguments, making AI doom arguments requires less technical knowledge than AI safety by default arguments, and critically the AI doom arguments are basically entirely wrong, and the AI safety by default arguments are mostly correct.
This, Quintin Pope has to respond at length, since refuting bullshit or wrong theories takes very long compared to making intuitive, but wrong arguments for AI doom.
Quintin and I have agreed to do a X Space debate, and I’m optimistic that format can be more productive.
Alright, that might work. I’m interested to see whether you will write up a transcript, or whether I will be able to join the X space debate.
“AI doom arguments are more intuitive than AI safety by default arguments, making AI doom arguments requires less technical knowledge than AI safety by default arguments, and critically the AI doom arguments are basically entirely wrong, and the AI safety by default arguments are mostly correct.”
I really don’t like that you make repeated assertions like this. Simply claiming that your side is right doesn’t add anything to the discussion and easily becomes obnoxious.
I really don’t like that you make repeated assertions like this. Simply claiming that your side is right doesn’t add anything to the discussion and easily becomes obnoxious.
Yes, I was trying to be short rather than write the long comment or post justifying this claim, because I had to write at least two long comments on this issue.
But thank you for point here. I definitely agree that I was wrong to just claim that I was right without trying to show why, especially explaining things.
Now that I’m thinking that text-based interaction is actually bad, since we can’t communicate a lot of information.
The best example I have right now is this thread with Liron, and it’s a good example since it demonstrates the errors most cleanly.
Warning, this is a long comment, since I need to characterize the thread fully to explain why this thread demonstrates the terrible epistemics of Liron in this thread, why safety research is often confused, and more and also I will add my own stuff on alignment optimism here.
Anyways, let’s get right into the action.
Liron tries to use the argument that it violates basic constraints analogous to a perpetual motion machine to have decentralized AI amongst billions of humans, and he doesn’t even try to state what the constraints are until later, which turns out to be not great.
https://twitter.com/liron/status/1703283147474145297
Quintin Pope recognizes that the comparison between the level of evidence for thermodynamics, and the speculation every LWer did about AI alignment is massively unfair, in that the thermodynamics example is way more solid than virtually everything LW said on AI. (BTW, this is why I dislike climate-AI analogies in the evidence each one has, since the evidence for climate change is also way better than all AI discussion ever acheived.) Quintin Pope notices that Liron is massively overconfident here.
https://twitter.com/QuintinPope5/status/1703569557053644819
Liron claims that instrumental convergence and the orthogonality thesis are simple deductions, and criticizes Quintin Pope for seemingly having an epistemology that is wildly empiricist.
https://twitter.com/liron/status/1703577632833761583
Quintin Pope points out that once we make it have any implications for AI, things get vastly more complicated, and uses an example to show how even a very good constructed argument for an analogous thing to AI doom basically totally fails for predictable reasons:
https://twitter.com/QuintinPope5/status/1703595450404942233
Quintin also has other good side tweets to the main thread talking about the orthogonality thesis and why it either doesn’t matter or is actually false for our situation, which you should check out:
https://twitter.com/QuintinPope5/status/1706849035850813656
https://twitter.com/CodyMiner_/status/1706161818358444238
https://twitter.com/QuintinPope5/status/1706849785125519704
Anyways, back to the main thread at hand.
Liron argues that the mechanistic knowledge we have about Earth’s pressure is critical to our safety:
https://twitter.com/liron/status/1703603479074456012
Quintin counters that the argument he defined about the gas pressure, even though it does way better than all AI safety arguments to date, still fails to have any real-world consequences predictably:
https://twitter.com/QuintinPope5/status/1703878630445895830
Liron argues that the gas pressure argument does connect to the real world:
https://twitter.com/liron/status/1703883262450655610
Quintin counters that the gas pressure argument doesn’t connect to the real world, since it does not correctly translate from the math to the real world, and the argument used seems very generalizable to a lot of AI discourse:
https://twitter.com/QuintinPope5/status/1703889281927032924
Liron then admits, while he’s unaware of it to a substantial weakening of the claim, since he discarded the idea that AI safety was essentially difficult or impossible, he now makes the vastly weaker claim that AI can be misaligned/unsafe. This is a substantial update that isn’t hinted to the reader at all, since virtually everything can be claimed for, including the negation of AI governance and AI misalignment, since it uses words and only uses can instead of anything else.
https://twitter.com/liron/status/1704126007652073539
Quintin Pope then re-enters the conversation, since he believed that Liron conceded, and then asks questions about what Liron intended to do here:
https://twitter.com/QuintinPope5/status/1706855532085313554
Liron motte-and-baileys back to the very strong claim that optimization theory gives us any reason to believe aligned AI is extraordinarily improbable (short answer, it isn’t and it can’t make any claim to this.)
https://twitter.com/liron/status/1706869351348125847
Quintin suggests a crux here, in that his optimization theory, insofar as it could be called a theory, implies that alignment could be relatively easy. I don’t buy all of his optimization theory, but I have other sources of evidence for alignment being easy, and it’s way better than anything LW ever came up with.
https://twitter.com/QuintinPope5/status/1707916607543284042
Liron suggests that his optimizer theory suggests that natural architectures can learn a vast variety of goals, which combined with the presumed most goals being disastrous for humans, makes him worried about AI safety. IMO, it’s notably worse in that it’s way more special casey than Quintin Pope’s theory, and it describes only the end results.
https://twitter.com/liron/status/1707950230266909116
Quintin Pope points out the flaws in Liron’s optimization theory. In particular, they’re outcomes that are relabeled as laws:
https://twitter.com/QuintinPope5/status/1708575273304899643
Overall, I see pretty large issues with Liron’s side of the conversation, in that he moves between 2 different claims such that one is defensible but has ~no implications, and the claim that has implications but needs much, much more work to make it do well.
Also, Liron is massively overconfident in his theories here, which also is bad news.
Some additions to the AI alignment optimism case are presented below, to point out that AI safety optimism is sort of robust.
For more on why RLHF is actually extraordinarily general for AI alignment, Quintin Pope’s comment on LW basically explains it better than I can:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/?commentId=Lj3gJmjMMSS24bbMm
For the more general AI alignment optimism cases, Nora Belrose has a part of the post dedicated to the point that AIs are white boxes, not black boxes, and while it definitely overestimates the easiness (I do not believe that for ANNs today, that we can essentially analyze or manipulate them at 0 cost, and Steven Byrnes in the comments is right to point a worrisome motte-and-bailey that Nora Belrose does, albeit even then it’s drastically easier to analyze ANNs rather than brains today.)
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#Alignment_optimism__AIs_are_white_boxes
For the untested but highly promising solution to the AI shutdown problem, these 3 posts provide necessary reading, since Elliott Thornley found a way to usefully weaken expected utility maximization to retain most of the desirable properties of expected utility maximization, but without making the AI unshutdownable or display other bad behaviors. This might be implemented using John Wentworth’s idea of subagents.
Sami Petersen’s post on Invulnerable Incomplete Preferences: https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1
Elliott Thornley’s submission for the AI contest: https://s3.amazonaws.com/pf-user-files-01/u-242443/uploads/2023-05-02/m343uwh/The Shutdown Problem- Two Theorems%2C Incomplete Preferences as a Solution.pdf
John Wentworth’s post on subagents, for how this might work in practice:
https://www.lesswrong.com/posts/3xF66BNSC5caZuKyC/why-subagents
Damn, this was a long comment for me to make, since I needed it to be a reference for the future when people ask me about my optimism on AI safety, and the problems with AI epistemics, and I want it to be both self-contained and dense.
Seems fair to tag @Liron here.
How did you manage to tag Liron, exactly? But yes, I will be waiting for Liron to respond, as well as other interested parties to respond.
Simply type the at-symbol to tag people. I don’t know when LW added this, but I’m glad we have it.
Appreciate the detailed analysis.
I don’t think this was a good debate, but I felt I was in a position where I would have had to invest a lot of time to do better by the other side’s standards.
Quintin and I have agreed to do a X Space debate, and I’m optimistic that format can be more productive. While I don’t necesarily expect to update my view much, I am interested to at least understand what the crux is, which I’m not super clear on atm.
Here’s a meta-level opinion:
I don’t think it was the best choice of Quintin to keep writing replies that were disproportionally long compared to mine.
There’s such a thing as zooming claims and arguments out. When I write short tweets, that’s what I’m doing. If he wants to zoom in on something, I think it would be a better conversation if he made an effort to do it less at a time, or do it for fewer parts at a time, for a more productive back & forth.
I understand why you feel this way, but I do think that it was sort of necessary to respond like this, primarily because I see a worrisome asymmetry between the arguments for AI doom and AI being safe by default.
AI doom arguments are more intuitive than AI safety by default arguments, making AI doom arguments requires less technical knowledge than AI safety by default arguments, and critically the AI doom arguments are basically entirely wrong, and the AI safety by default arguments are mostly correct.
This, Quintin Pope has to respond at length, since refuting bullshit or wrong theories takes very long compared to making intuitive, but wrong arguments for AI doom.
Alright, that might work. I’m interested to see whether you will write up a transcript, or whether I will be able to join the X space debate.
“AI doom arguments are more intuitive than AI safety by default arguments, making AI doom arguments requires less technical knowledge than AI safety by default arguments, and critically the AI doom arguments are basically entirely wrong, and the AI safety by default arguments are mostly correct.”
I really don’t like that you make repeated assertions like this. Simply claiming that your side is right doesn’t add anything to the discussion and easily becomes obnoxious.
Yes, I was trying to be short rather than write the long comment or post justifying this claim, because I had to write at least two long comments on this issue.
But thank you for point here. I definitely agree that I was wrong to just claim that I was right without trying to show why, especially explaining things.
Now that I’m thinking that text-based interaction is actually bad, since we can’t communicate a lot of information.