I’m going to spend most of this comment responding to your concrete remarks about ELK, but I wanted to start with some meta level discussion because it seems to cut closer to the heart of the issue and might be more generally applicable.
I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can’t anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise. It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
My sense is that you have more faith in a rough intuitive sense you’ve developed of what the “hard part” of alignment is, and so you’d primarily recommend thinking about that until we feel less confused. I disagree in large part because I feel like your broad intuitive sense has not yet had much opportunity to make contact with either reality or with formal reasoning, and I’d guess it’s not precise enough to be a useful guide to research prioritization.
More concretely, you talk about novel mechanisms by which AI systems gain capabilities, but I think you haven’t said much concrete about why existing alignment work couldn’t address these mechanisms. This looks to me like a pretty unproductive stance; I suspect you are wrong about the shape of the problem, but if you are right then I think your main realistic path to impact involves saying something more concrete about why you think this.
I think you don’t see the situation the same way, probably because you feel like you have said plenty concrete. Perhaps this is the most serious disagreement of all. I don’t think saying there is a “capabilities well” is helpfully concrete until you say something about what it looks like, why it poses alignment problems different from SGD and why particular approaches don’t generalize, etc.
In ARC’s day to day work we write down particular models of capabilities that would generalize far outside of training (e.g.: what about a causal model of the world that holds robustly? what about logical deduction from valid premises with longer chains of reasoning? what about continuing to learn by trial and error when deployed in a novel environment?), and ask about whether a given alignment solution would generalize along with them. If we can find any gap, then that it goes on the list of problems. We focus on the gaps that seem least likely to be addressable by using known techniques, and try to develop new techniques or to identify general reasons why the gap is unresolvable.
My guess is that you are playing a roughly similar game much more informally, and that you are just making a mistake because reasoning about this stuff is in fact hard. But I can’t really tell, since your thinking is happening in private and we are seeing the vague intuitions that result. (I’ve been hanging around MIRI for a long time, and I suspect I have a better model of your and Eliezer’s position than virtually anyone else outside of MIRI, yet this is still where I’m at.)
Anyway, now turning to your discussion of ELK in particular.
Your first problem is that the recent capabilities gains made by the AGI might not have come from gradient descent (much like how humans’ sudden explosion of technological knowledge accumulated in our culture rather than our genes, once we turned the corner). You might not be able to just “expose the bad behavior” to gradients that you can hit to correct the thing, at least not easily and quickly.
I often think and write about other places where capabilities may come from that could challenge our basic alignment plan. Four particularly salient examples:
Your AI might perform search internally, e.g. looking for hypotheses that match the data or for policies that work well.
Natural selection may occur internally, e.g. cognitive patterns that acquire power might tend to dominate the behavior of your AI (despite the AI having no explicit prediction that they would work well).
Your AI might reason about how to think better, e.g. select cognitive actions based on anticipated consequences of those cognitive actions.
Our AI might deploy new algorithms that pose their own alignment risk for different (potentially unanticipated) reasons.
Some of these represent real problems, but none of them seem to fundamentally change the game or be deal-breakers:
Aligning the internal search seems very similar to aligning SGD on the outside. We could distinguish two additional difficulties in this case:
Because the search is on the inside, we can’t directly apply our alignment insights to align it. Instead we need to ensure that SGD learns to align the search. This itself poses two difficulties: (a) the outer gradient needs to incentivize doing this, (b) we need to argue that it’s nearly as easy for SGD to learn the aligned search as to learn the unaligned search (or build scaffolding such that it becomes similarly easily). This is what we’re talking about in this appendix, and it’s part of why we are skeptical about approaches to ELK based on simple regularizers. But we don’t see a reason that either (a) or (b) would be a dealbreaker, and we tentatively think our current approaches to ontology identification would at least solve (a) if they were successful at all. It’s pretty hard to talk about (b) without having more clarity about what the alignment scheme actually looks like but we don’t see an in principle reason it’s hard.
The internal search algorithm may not be SGD, and perhaps our alignment strategy was specific to some detail of SGD. But SGD appears to be amongst the hardest search algorithms, and ARC tries to pursue approaches that work for other algorithms rather than leveraging anything about SGD in particular. We’re definitely in the market for other search algorithms that cause trouble but don’t yet know of any.
Natural selection on the inside is similar but potentially more tricky, because the optimizer has more limited control over how this search works. This is like the analog of memetic selection being smarter than humans and eventually overpowering or hijacking human consequentialism. Another extreme example is that it seems like a large enough neural network may be catastrophically misaligned at initialization simply because of selection amongst activation patterns within a single forward pass. Ultimately we’d like to handle this in exactly the same way that we handle the last point, by some combination of (a) we can just directly apply the same hope from the previous section even to natural selection, (b) we can run explicit searches that are more powerful than implicit search by natural selection within our model, which requires ensuring that our explicit learned search captures whatever is good about natural selection (this seems tough but not at all obvious impossible to me). It’s hard to talk about option (a) without seeing if/how we solve the problem from the last point. We could definitely work on option (b) now, and a large enough ARC would be working on it, but it seems like a relatively low priority since it’s both very remote from existing systems and seems relatively unlikely-to-me to be the simplest place where we get stuck.
If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we’ve been succeeding at alignment so far then the model will be trying to stay aligned. By analogy, if humans cared about the amount of human DNA in the universe, then to the extent that cultural evolution was guided by human consequentialism (rather than e.g. being memetic selection), we would be trying to develop cultural machinery that was helpful for maximizing the amount of human DNA in the universe.
One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). I think this is a real problem, but there are a lot of reasons I don’t consider it an existential challenge for our approach:
If you’ve succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success. AI compresses the timescale both for “new AI algorithms with new alignment problems” but also for all of the solutions to those problems, so I don’t think it changes the game from future humans. And so I’d focus on prosaic AI alignment for exactly the same reasons I focus on prosaic AI alignment when trying to help future humans succeed at alignment.
I think that we should be considering the particular algorithms that might pose a new alignment problem, and trying to solve alignment for each of them. If we have some general reason to think that new algorithms will be much harder than old algorithms, or that lessons won’t transfer, then we can discuss those and whether they should affect research prioritization. So far I don’t think we have such arguments, and so I think we should just be looking for algorithms that might pose problems. (I don’t actually think that’s the highest priority, because prosaic ML so obviously poses problems, and the other problems we see seem so closely analogous to the ones posed by prosaic ML. But I’m certainly in the market for other problems and think that a large enough research community should already be actively looking for them.)
Possible disagreements between us: (i) you think that at least one of these examples looks really bad for our approach, (ii) you have other examples in mind, (iii) you don’t think we can write down a concrete example that looks bad, but we have reason to expect other kinds of capability gains that will be bad, (iv) nothing looks like a dealbreaker in particular, but it’s just contributing to a long list of problems you’d have to solve and that’s either a lot of work or something probably won’t work out.
For me, the upshot of all of this is that SGD poses some obvious problems, that those problems are the most likely to actually occur, that they seem similar to (and at least subproblems of) the other alignment problems we may face, and that there are neither super compelling alternatives to aligning SGD nor particular arguments that the rest of the problem is harder than this step.
Your second problem is that the AGI’s concepts might rapidly get totally uninterpretable to your ELK head. Like, you could imagine doing neuroimaging on your mammals all the way through the evolution process. They’ve got some hunger instincts in there, but it’s not like they’re smart enough yet to represent the concept of “inclusive genetic fitness” correctly, so you figure you’ll just fix it when they get capable enough to understand the alternative (of eating because it’s instrumentally useful for procreation). And so far you’re doing great: you’ve basically decoded the visual cortex, and have a pretty decent understanding of what it’s visualizing.
Our goal is to learn a reporter that describes the latent knowledge of the model, and to keep this up to date as the model changes under SGD. If thinking about SGD, we usually think concretely about a single step of SGD, and how you could find a good reporter at the end of that gradient descent step assuming you had one at the beginning.
It feels to me like what you are saying here is just “you might not be able to solve ELK.” Or else maybe restating the previous point, that the model builds latent knowledge by mechanisms other than SGD and therefore you need to learn a reporter that can also follow along with those other mechanisms.
In either case, I can’t speak to whether it’s helpful for the audience understanding why ELK is hard, but it is certainly not helping me understand why you think ELK is hard. I think this discussion is just too vague to be helpful.
I think it’s not crazy for you to say “ARC’s hopes about how to solve ELK are too vague to seem worth engaging with” (this is pretty similar to me saying “Nate’s arguments about why alignment is hard are too vague to seem worth engaging with”).
Analogously, your ELK head’s abilities are liable to fall off a cliff right as the AGI’s capabilities start generalizing way outside of its training distribution.
But can you say something concrete about why? What I’d like to do is talk about what the AGI is actually thinking, the particular computation it’s running, so that we can talk about why that computation keeps being correlated with reality off distribution and then ask whether the reporter remains correlated with reality. When I go through this exercise I don’t see big dealbreakers, and I can’t tell if you disagree with that diagnosis, or if you are noticing other things that might be going on inside the AI, or if the difference is that I think “this looks like it might work in all the concrete cases we can see” is a relevant signal and you think “nah the cases we can’t see are way worse than those we can see.”
And if they don’t, then this ELK head is (in this hypothetical) able to decode and understand the workings of an alien mind. Likely a kludgey behemoth of an alien mind. This itself is liable to require quite a lot of capability, quite plausibly of the sort that humanity gets first from the systems that took sharp left-turns, rather than systems that ground along today’s scaling curves until they scaled that far.
Again, this seems too vague to be helpful, or perhaps just mistaken. The reporter is not some other AI looking at your predictor and trying to “decode its workings,” or maybe it is but if so it’s just because those english words are vague and broad. Can we talk about the particular kinds of cognition that your AI might be performing, such that you don’t think this works? (Or which would require the reporter to itself be using magic-mystery-juice-of-intelligence?)
That’s really the central theme of my response, so it’s worth restating: ARC loves examples of ways an AI might be thinking such that ELK is difficult. But your description of the sharp left turn is too vague to be helpful for this purpose, and so I’d either like to turn this into more concrete discussion of the internals of the algorithm, or else some significantly more precise argument about why we expect the unknown possible internals to be so much less favorable for ELK than any of the concrete examples we can write down.[1]
I’d like to head off a possible response you might make that I disagree with: “Sure your algorithm works for any example you can write down, but the whole point is that you need it to work for alien cognition, where humans don’t understand why it works. So of course it works on concrete examples but not in the unknown real world.” . I’m putting this in a footnote because it seems like a digression and I have no idea if this is your view.
My main response is that we can in fact talk about concrete examples where “why your AI system’s cognition works” isn’t accessible to humans in the relevant ways:
We can consider tricky facts we understand about how to reason, for which our discovery of those facts is empirically contingent (and where discovering those facts is harder than discovering the reasons itself). Then we can consider whether our AI alignment strategies would work even if humans hadn’t figured out the relevant facts about reasoning.
We can consider AI cognition which is contingent on hypothesized unknown-to-human facts, e.g. about the causal structure of reality, or about key facts about mathematics, or whatever else.
Most of our ELK approaches don’t make no-holds-barred use of “can a human come up with some story about why this AI cognition may work,” and so this just isn’t a particularly salient threshold anyway. As a silly example, if you were solving this problem with a speed prior (or indeed with any of the approaches in the regularization section of the ELK document) you wouldn’t expect a particular key threshold at the space of strategies that a human understands.
I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can’t anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise.
I would uncharitably summarize this as “let’s just assume that finding a faithful concrete operationalization of the problem is not itself the hard part”. And then, any time finding a faithful concrete operationalization of the problem is itself the hard part, you basically just automatically fail.
Is that… wrong? Am I missing something here? Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself “the hard part”? (I mean, just intuitively, I’d expect hacking away at the legible parts to induce some progress on the illegible, but it sounds extremely slow, to the point where it would very plausibly just not converge to solving the illegible parts at all.)
If I had to guess at your model here, I’d guess your intuition is something like “well, trying to make progress without concrete operationalizations is just really hard, it’s too easy to become decoupled from mathematical/physical reality”. To which my response would be “just because it’s hard does not mean we can ignore it and still expect to solve the problem, especially in a reasonable timeframe”. Yes, staying grounded is hard when finding faithful concrete operationalizations is itself the hard part of the problem, but we can’t actually avoid that.
It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
This is great early on in the process when we don’t yet know what the hard parts are. But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again. Just ignoring those bottlenecks because we don’t know how to operationalize them yet does not sound like an optimal search-strategy. What we want to do is focus on those intuitions, and figure out generalizable operationalizations of the bottlenecks. That itself is often where the hard work is (especially in alignment). Getting hyper-focused on a single concrete failure mode with a single strategy just results in an operationalization which is too narrow and potentially not relevant to most other strategies; a better approach is to look at intuitively-similar failure modes in a bunch of strategies and try to find an operationalization which unifies them and captures the intuitive pattern.
Similarly, once we have some intuition for where the bottlenecks are, it does seem completely correct to mostly dismiss strategies which are not obviously tackling them in some way, even before the bottlenecks are fully formalized. I mean, maybe spot-check one once in a while, but mostly just ignore such strategies. Otherwise, we just waste a ton of time on strategies which are in fact very likely hopeless.
Uncharitably summarizing again (and hopefully you will correct me if this is inaccurate): it sounds like you want to just not update very much on evidence which we don’t know how to formalize yet. And I’d say this is basically the same mistake as e.g. someone who says we have no idea whether an updated version of a covid vaccine works until there’s been a phase-3 clinical trial with statistically significant result.
It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It’s like the exact opposite of security mindset.
I don’t think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by “concrete.” In particular, “concrete” doesn’t mean “formalized,” it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.
You write:
But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again.
I don’t yet have this sense about a “sharp left turn” bottleneck.
I think I would agree with you if we’d looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what’s really going on. At a high level that’s very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.
But for the sharp left turn I think we basically don’t have examples. Existing alignment strategies fail in much more basic ways, which I’d call “concrete.” We don’t have examples of strategies that don’t run into concrete difficulties, but they fail for a vague and hard-to-understand reason that we’d summarize as a “sharp left turn.” So I don’t really believe that this difficulty is being abstracted from a pattern of failures.
There can be other ways to learn about problems, and I didn’t think Nate was even saying that this problem is derived from examples of obstructions to potential alignment approaches. I think Nate’s perspective is that he has some petty good arguments and intuitions about why a sharp left turn will cause novel problems. And so a lot of what I’m saying is that I’m not yet buying it, that I think Nate’s argument has fatal holes in it which are hidden by its vagueness, and that if the arguments are really very important then we should be trying hard to make them more concrete and to address those holes.
Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself “the hard part”?
ARC does theoretical work guided by concrete stories about how a proposed AI system could fail; we are focused on the “legible part” insofar as we try to fix failures for which we can tell concrete stories. I’m not quite sure what you mean by “illegible” and so this might just be a miscommunication, but I think this is the relevant sense of “illegible” so I’ll respond briefly to it.
I think we can tell concrete stories about deceptive alignment; about ontology mismatches making it hard or meaningless to “elicit latent knowledge;” about exploitability of humans making debate impossible; and so on. And I think those stories we can tell seem to do a great job of capturing the reasons why we would expect existing alignment approaches to fail. So if we addressed these concrete stories I would feel like we’ve made real progress. That’s a huge part of my optimism about concrete stories.
It feels to me like either we are miscommunicating about what ARC is doing, or you are saying that those concrete difficulties aren’t the really important failures. That even if an alignment approach addressed all of them, it still wouldn’t represent meaningful progress because the true risk is the risk that cannot be named.
One thing you might mean is that “these concrete difficulties are just shadows of a deeper core.” But I think that’s not actually a challenge to ARC’s approach at all, and it’s not that different from my own view. I think that if you have an intuitive sense of a deep problem, then it’s really great to attack specific instantiations of the problem as a way to learn about the deep core. I feel pretty good about this approach, and I think it’s pretty standard in most disciplines that face problems like this (e.g. if you are deeply confused about physics, it’s good to think a lot about the simplest concrete confusing phenomenon and understand it well; if you are confused about how to design algorithms that overcome a conceptual barrier, it’s good to think about the simplest concrete task that requires crossing that barrier; etc.).
Another thing you might mean is that “these concrete difficulties are distractions from a bigger difficulty that emerges at a later step of the plan.” It’s worth noting that ARC really does try to look at the whole plan and pick the step that is most likely to fail. But I do think it would be a problem for our methodology if there is a good argument about why plans will fail, which won’t let us tell a concrete story about what the failure looks like. My position right now is that I don’t see such an argument; I think we have some vague intuitions, and we have a bunch of examples which do correspond to concrete failure stories. I don’t think there are any examples from which to infer the existence of a difficulty that can’t be captured in concrete stories, and I’m not yet aware of arguments that I find persuasive without any examples. But I’m really quite strongly in the market for such arguments.
Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It’s like the exact opposite of security mindset.
Here’s how the situation feels to me. I know this isn’t remotely fair as a summary of your view, it’s just intended to illustrate where ARC is coming from. (It’s also possible this is a research methodology disagreement, in which case I do just disagree strongly.)
Cryptographer: It seems like our existing proposals for “secure” communication are still vulnerable to man in the middle attacks. Better infrastructure for key distribution is one way to overcome this particular attack, so let’s try to improve that. We can also see how this might fit in with the rest of our security infrastructure to help build to a secure internet, though no doubt the details will change.
Cryptography skeptic: The real difficulty isn’t man in the middle attacks, it’s that security is really hard. By focusing on concrete stuff like man-in-the-middle you are overlooking the real nature of the problem, focusing on the known risks rather than the unknown unknowns. Someone with a true security mindset wouldn’t be fiddling around the edges like this.
I’m not saying that infrastructure for key distribution solves security (and indeed we have huge security problems). I’m saying that working on concrete problems is the right way to make progress in situations like this. I don’t think this is in tension with security mindset. In fact I think effective people with security mindset spend most of their time thinking about concrete risks and how to address them.
It’s great to generalize once you have a bunch of concrete risks and you think there is a deeper underlying pattern. But I think you basically need the examples to learn from, and if there is a real pattern then you should be able to instantiate it in any particular case rather than making reference to the pattern.
I understand the security mindset (from the ordinary paranoia post) as: “What are the unexamined assumptions of your security systems which merely stem from investing or adapting a given model?”. The vulnerability comes from the model. The problem is the “unknowable unknowns”. In addition to the Cryptographer and the Cryptography skeptic, I would add the NSA Quantum computing engineer. Concretisation and operationalisation of these problems may have implicit assumptions that could be system wide catastrophic.
I don’t have clear ways of better articulating this back from analogy to Paul’s concretisations of a proposed AI system. I’m not sure there’s no disanalogy here. However it could be something like “We have this effective model of a proposed AI system. What are useful concretisations in which the AI system would fail?”. The security mindset question would be something like “What representations in the ‘UV-complete’ theory of this AI system would lead to catastrophic failure modes?”
This comment made me notice a kind of duality: - Paul wants to focus on finding concrete problems, and claims that Nate/Eliezer aren’t being very concrete with their proposed problems. - Nate/Eliezer want to focus on finding concrete solutions, and claim that Paul/other alignment researchers aren’t being very concrete with their proposed solutions.
It seems like “how well do we understand the problem” is one a crux here. I disagree with John’s comment because it feels like he’s assuming too much about our understanding of the problem. If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn’t exist.
I don’t feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).
ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorithms don’t yet meet those desiderata or they may be unachievable. The output is just actual code which is purported to solve major difficulties in alignment.
And on the flip side, I spend a significant amount of my time looking at the algorithms we are proposing (and the bigger plans into which they would fit if successful) and trying to find the best arguments I can that these plans will fail.
I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.
Put differently: I’m not saying that Nate and Eliezer are vague about problems but concrete about solutions, I’m saying they are vague about everything. And I don’t think they are saying that I’m concrete about problems but vague about solutions, they would say that I’m concrete about parts of the solution/problem that don’t matter while systematically pushing all the difficulty into the parts I’m still vague about.
I do think “how well do we understand the problem” seems like a pretty big crux; that leads Nate and Eliezer to think that I’m avoiding the predictably-important difficulty, and it leads me to think that Nate and Eliezer need to get more concrete in order to have an accurate picture of what’s going on.
Yeah, my comment was sloppily phrased; I agree with “I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.”
If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn’t exist.
I don’t think that’s how this works? The strategy I’m recommending explicitly contains two parts where we gain evidence about whether a part of the problem actually exists:
noticing an intuitive pattern in the failure-modes of some strategies
attempting to formalize (which presumably includes backpropagating our mathematics into our intuitions)
… so if a part of the problem doesn’t exist, then (a) we probably don’t notice a pattern in the first place, but even if our notoriously unreliable human pattern-matchers over-match, then (b) while we’re attempting to formalize we we have plenty of opportunity to notice that maybe the pattern doesn’t actually exist the way we thought it did.
It feels like you’re looking for a duality which does not exist. I mean, the duality between “look for concrete solutions” and “look for concrete problems” I buy (and that would indeed cause one side to be over-optimistic and the other over-pessimistic in exactly the pattern we actually see between Paul and Nate/Eliezer). But it feels like you’re also looking for a duality between how-Paul’s-recommended-search-order-just-fails and how-mine-just-fails. And the reason that duality does not exist is because my recommended search order is using strictly more evidence; Paul is basically advocating ignoring a whole class of very useful evidence, and that makes his strategy straightforwardly suboptimal. If we were both picking different points on a pareto frontier, then yeah, there’d be a trade-off. But Paul just isn’t on the pareto frontier.
I feel confused about the difference between your “attempt to formalize” step and Paul’s “attempt to concretize” step. It feels like you can view either as a step towards the other—if you successfully formalize, then presumably you’ll be able to concretize; but also one valuable step towards formalizing is by finding concrete examples and then generalizing from them. I think everyone agrees that it’d be great to end up with a formalism for the problem, and then disagrees on how much that process should involve “finding concrete examples of the problem”. My own view is that since it’s so incredibly easy for people to get lost in abstractions, people should try to concretize much more when talking about highly abstract domains. (Even when people are confident that they’re not lost in abstractions, like Eliezer and Nate are, that’s still really useful for conveying ideas to other people.)
Maybe you’ve already got this idea on your list of concrete examples, but while I was reading the original post I had an idea for a concrete example of how a parameter-frozen model which was above some critical intelligence threshold could potentially gain dangerous amounts of intelligence / power if given compute, access to data, and data storage.
I’m not saying that this is particularly probable, just that it seems like a path worth adding to a concrete list of ‘ways things could go tricky’.
Imagine you have a model-system which has a large parameter-frozen language model as its primary component, and some sort of simple RL wrapper giving it goals. It could write and read text from its data storage, writing about hypotheses that it then checks in its data or in its training environment, and recording the results. In theory it seems like the system of self-edited data and frozen model could bootstrap its way to being a smarter overall entity, and then design prompts for it’s language model component in a directed way to further accelerate this process, and go completely off-the-rails without even a single additional SGD update. Thus, it could ‘leave behind’ the similarly parameter-frozen auditor model.
[Edit: This writing could include the sort of writing one might imagine a human taking notes for themself might write, perhaps encoded in a self-created cypher. Or it could be like antagonist samples of tokens in weird out-of-distribution combinations, a sort of self-programming. Or some combination of the two. For an intuition pump for the human note style, imagine an immortal human scientist with a poor memory trapped in a very hard escape room, determined to escape. They’d keep notes on their ideas both for how to solve the huge number of intentional puzzles of the room, and also on exploits they discovered, and also on the processes they used to discover these things. If they thought the maker of the room would spy on their notes and adjust the difficulty of the room to keep them forever trapped, then they’d encrypt their notes and devote part of their limited memory to constantly re-memorizing the encryption key.]
I’m going to spend most of this comment responding to your concrete remarks about ELK, but I wanted to start with some meta level discussion because it seems to cut closer to the heart of the issue and might be more generally applicable.
I think a productive way forward (when working on alignment or on other research problems) is to try to identify the hardest concrete difficulties we can understand then try to make progress on them. This involves acknowledging that we can’t anticipate all possible problems, but expecting that solving the concrete problems is a useful way to make steps forward and learn general lessons. It involves solving individual challenges, even if none of them will address the whole problem, and even if we have a vague sense that further difficulties will arise. It means not becoming too pessimistic about a direction until we see fairly concretely where it’s stuck, partially because we hope that zooming in on a very concrete case where you get stuck is the main way to eventually make progress.
My sense is that you have more faith in a rough intuitive sense you’ve developed of what the “hard part” of alignment is, and so you’d primarily recommend thinking about that until we feel less confused. I disagree in large part because I feel like your broad intuitive sense has not yet had much opportunity to make contact with either reality or with formal reasoning, and I’d guess it’s not precise enough to be a useful guide to research prioritization.
More concretely, you talk about novel mechanisms by which AI systems gain capabilities, but I think you haven’t said much concrete about why existing alignment work couldn’t address these mechanisms. This looks to me like a pretty unproductive stance; I suspect you are wrong about the shape of the problem, but if you are right then I think your main realistic path to impact involves saying something more concrete about why you think this.
I think you don’t see the situation the same way, probably because you feel like you have said plenty concrete. Perhaps this is the most serious disagreement of all. I don’t think saying there is a “capabilities well” is helpfully concrete until you say something about what it looks like, why it poses alignment problems different from SGD and why particular approaches don’t generalize, etc.
In ARC’s day to day work we write down particular models of capabilities that would generalize far outside of training (e.g.: what about a causal model of the world that holds robustly? what about logical deduction from valid premises with longer chains of reasoning? what about continuing to learn by trial and error when deployed in a novel environment?), and ask about whether a given alignment solution would generalize along with them. If we can find any gap, then that it goes on the list of problems. We focus on the gaps that seem least likely to be addressable by using known techniques, and try to develop new techniques or to identify general reasons why the gap is unresolvable.
My guess is that you are playing a roughly similar game much more informally, and that you are just making a mistake because reasoning about this stuff is in fact hard. But I can’t really tell, since your thinking is happening in private and we are seeing the vague intuitions that result. (I’ve been hanging around MIRI for a long time, and I suspect I have a better model of your and Eliezer’s position than virtually anyone else outside of MIRI, yet this is still where I’m at.)
Anyway, now turning to your discussion of ELK in particular.
I often think and write about other places where capabilities may come from that could challenge our basic alignment plan. Four particularly salient examples:
Your AI might perform search internally, e.g. looking for hypotheses that match the data or for policies that work well.
Natural selection may occur internally, e.g. cognitive patterns that acquire power might tend to dominate the behavior of your AI (despite the AI having no explicit prediction that they would work well).
Your AI might reason about how to think better, e.g. select cognitive actions based on anticipated consequences of those cognitive actions.
Our AI might deploy new algorithms that pose their own alignment risk for different (potentially unanticipated) reasons.
Some of these represent real problems, but none of them seem to fundamentally change the game or be deal-breakers:
Aligning the internal search seems very similar to aligning SGD on the outside. We could distinguish two additional difficulties in this case:
Because the search is on the inside, we can’t directly apply our alignment insights to align it. Instead we need to ensure that SGD learns to align the search. This itself poses two difficulties: (a) the outer gradient needs to incentivize doing this, (b) we need to argue that it’s nearly as easy for SGD to learn the aligned search as to learn the unaligned search (or build scaffolding such that it becomes similarly easily). This is what we’re talking about in this appendix, and it’s part of why we are skeptical about approaches to ELK based on simple regularizers. But we don’t see a reason that either (a) or (b) would be a dealbreaker, and we tentatively think our current approaches to ontology identification would at least solve (a) if they were successful at all. It’s pretty hard to talk about (b) without having more clarity about what the alignment scheme actually looks like but we don’t see an in principle reason it’s hard.
The internal search algorithm may not be SGD, and perhaps our alignment strategy was specific to some detail of SGD. But SGD appears to be amongst the hardest search algorithms, and ARC tries to pursue approaches that work for other algorithms rather than leveraging anything about SGD in particular. We’re definitely in the market for other search algorithms that cause trouble but don’t yet know of any.
Natural selection on the inside is similar but potentially more tricky, because the optimizer has more limited control over how this search works. This is like the analog of memetic selection being smarter than humans and eventually overpowering or hijacking human consequentialism. Another extreme example is that it seems like a large enough neural network may be catastrophically misaligned at initialization simply because of selection amongst activation patterns within a single forward pass. Ultimately we’d like to handle this in exactly the same way that we handle the last point, by some combination of (a) we can just directly apply the same hope from the previous section even to natural selection, (b) we can run explicit searches that are more powerful than implicit search by natural selection within our model, which requires ensuring that our explicit learned search captures whatever is good about natural selection (this seems tough but not at all obvious impossible to me). It’s hard to talk about option (a) without seeing if/how we solve the problem from the last point. We could definitely work on option (b) now, and a large enough ARC would be working on it, but it seems like a relatively low priority since it’s both very remote from existing systems and seems relatively unlikely-to-me to be the simplest place where we get stuck.
If our model is selecting cognitive actions, or designing new algorithms, then our core hope is that an aligned model will try to think in an aligned way. So if we’ve been succeeding at alignment so far then the model will be trying to stay aligned. By analogy, if humans cared about the amount of human DNA in the universe, then to the extent that cultural evolution was guided by human consequentialism (rather than e.g. being memetic selection), we would be trying to develop cultural machinery that was helpful for maximizing the amount of human DNA in the universe.
One way this can go wrong is if our model wants to stay aligned but fails, e.g. because it identifies new techniques for thinking that themselves pose new alignment difficulties (just as we desire human flourishing but may instead implement AI systems that want paperclips). I think this is a real problem, but there are a lot of reasons I don’t consider it an existential challenge for our approach:
If you’ve succeeded at alignment so far, then your AI will also consider this a problem and will be trying to solve it. I think we should relate to our AI, discovering new ways to think that might pose new alignment difficulties, in the same way that we relate to future humans who may encounter alignment difficulties. The AI may solve the problem, or may implement policy solutions, or etc., and our role is to set them up for success just like we are trying to set up future humans for success. AI compresses the timescale both for “new AI algorithms with new alignment problems” but also for all of the solutions to those problems, so I don’t think it changes the game from future humans. And so I’d focus on prosaic AI alignment for exactly the same reasons I focus on prosaic AI alignment when trying to help future humans succeed at alignment.
I think that we should be considering the particular algorithms that might pose a new alignment problem, and trying to solve alignment for each of them. If we have some general reason to think that new algorithms will be much harder than old algorithms, or that lessons won’t transfer, then we can discuss those and whether they should affect research prioritization. So far I don’t think we have such arguments, and so I think we should just be looking for algorithms that might pose problems. (I don’t actually think that’s the highest priority, because prosaic ML so obviously poses problems, and the other problems we see seem so closely analogous to the ones posed by prosaic ML. But I’m certainly in the market for other problems and think that a large enough research community should already be actively looking for them.)
Possible disagreements between us: (i) you think that at least one of these examples looks really bad for our approach, (ii) you have other examples in mind, (iii) you don’t think we can write down a concrete example that looks bad, but we have reason to expect other kinds of capability gains that will be bad, (iv) nothing looks like a dealbreaker in particular, but it’s just contributing to a long list of problems you’d have to solve and that’s either a lot of work or something probably won’t work out.
For me, the upshot of all of this is that SGD poses some obvious problems, that those problems are the most likely to actually occur, that they seem similar to (and at least subproblems of) the other alignment problems we may face, and that there are neither super compelling alternatives to aligning SGD nor particular arguments that the rest of the problem is harder than this step.
Our goal is to learn a reporter that describes the latent knowledge of the model, and to keep this up to date as the model changes under SGD. If thinking about SGD, we usually think concretely about a single step of SGD, and how you could find a good reporter at the end of that gradient descent step assuming you had one at the beginning.
It feels to me like what you are saying here is just “you might not be able to solve ELK.” Or else maybe restating the previous point, that the model builds latent knowledge by mechanisms other than SGD and therefore you need to learn a reporter that can also follow along with those other mechanisms.
In either case, I can’t speak to whether it’s helpful for the audience understanding why ELK is hard, but it is certainly not helping me understand why you think ELK is hard. I think this discussion is just too vague to be helpful.
I think it’s not crazy for you to say “ARC’s hopes about how to solve ELK are too vague to seem worth engaging with” (this is pretty similar to me saying “Nate’s arguments about why alignment is hard are too vague to seem worth engaging with”).
But can you say something concrete about why? What I’d like to do is talk about what the AGI is actually thinking, the particular computation it’s running, so that we can talk about why that computation keeps being correlated with reality off distribution and then ask whether the reporter remains correlated with reality. When I go through this exercise I don’t see big dealbreakers, and I can’t tell if you disagree with that diagnosis, or if you are noticing other things that might be going on inside the AI, or if the difference is that I think “this looks like it might work in all the concrete cases we can see” is a relevant signal and you think “nah the cases we can’t see are way worse than those we can see.”
Again, this seems too vague to be helpful, or perhaps just mistaken. The reporter is not some other AI looking at your predictor and trying to “decode its workings,” or maybe it is but if so it’s just because those english words are vague and broad. Can we talk about the particular kinds of cognition that your AI might be performing, such that you don’t think this works? (Or which would require the reporter to itself be using magic-mystery-juice-of-intelligence?)
That’s really the central theme of my response, so it’s worth restating: ARC loves examples of ways an AI might be thinking such that ELK is difficult. But your description of the sharp left turn is too vague to be helpful for this purpose, and so I’d either like to turn this into more concrete discussion of the internals of the algorithm, or else some significantly more precise argument about why we expect the unknown possible internals to be so much less favorable for ELK than any of the concrete examples we can write down.[1]
I’d like to head off a possible response you might make that I disagree with: “Sure your algorithm works for any example you can write down, but the whole point is that you need it to work for alien cognition, where humans don’t understand why it works. So of course it works on concrete examples but not in the unknown real world.” . I’m putting this in a footnote because it seems like a digression and I have no idea if this is your view.
My main response is that we can in fact talk about concrete examples where “why your AI system’s cognition works” isn’t accessible to humans in the relevant ways:
We can consider tricky facts we understand about how to reason, for which our discovery of those facts is empirically contingent (and where discovering those facts is harder than discovering the reasons itself). Then we can consider whether our AI alignment strategies would work even if humans hadn’t figured out the relevant facts about reasoning.
We can consider AI cognition which is contingent on hypothesized unknown-to-human facts, e.g. about the causal structure of reality, or about key facts about mathematics, or whatever else.
Most of our ELK approaches don’t make no-holds-barred use of “can a human come up with some story about why this AI cognition may work,” and so this just isn’t a particularly salient threshold anyway. As a silly example, if you were solving this problem with a speed prior (or indeed with any of the approaches in the regularization section of the ELK document) you wouldn’t expect a particular key threshold at the space of strategies that a human understands.
I would uncharitably summarize this as “let’s just assume that finding a faithful concrete operationalization of the problem is not itself the hard part”. And then, any time finding a faithful concrete operationalization of the problem is itself the hard part, you basically just automatically fail.
Is that… wrong? Am I missing something here? Is there some reason to expect that always working on the legible parts of a problem will somehow induce progress on the illegible parts, even when making-the-illegible-parts-legible is itself “the hard part”? (I mean, just intuitively, I’d expect hacking away at the legible parts to induce some progress on the illegible, but it sounds extremely slow, to the point where it would very plausibly just not converge to solving the illegible parts at all.)
If I had to guess at your model here, I’d guess your intuition is something like “well, trying to make progress without concrete operationalizations is just really hard, it’s too easy to become decoupled from mathematical/physical reality”. To which my response would be “just because it’s hard does not mean we can ignore it and still expect to solve the problem, especially in a reasonable timeframe”. Yes, staying grounded is hard when finding faithful concrete operationalizations is itself the hard part of the problem, but we can’t actually avoid that.
This is great early on in the process when we don’t yet know what the hard parts are. But pretty quickly, we usually see intuitively-similar bottlenecks coming up again and again. Just ignoring those bottlenecks because we don’t know how to operationalize them yet does not sound like an optimal search-strategy. What we want to do is focus on those intuitions, and figure out generalizable operationalizations of the bottlenecks. That itself is often where the hard work is (especially in alignment). Getting hyper-focused on a single concrete failure mode with a single strategy just results in an operationalization which is too narrow and potentially not relevant to most other strategies; a better approach is to look at intuitively-similar failure modes in a bunch of strategies and try to find an operationalization which unifies them and captures the intuitive pattern.
Similarly, once we have some intuition for where the bottlenecks are, it does seem completely correct to mostly dismiss strategies which are not obviously tackling them in some way, even before the bottlenecks are fully formalized. I mean, maybe spot-check one once in a while, but mostly just ignore such strategies. Otherwise, we just waste a ton of time on strategies which are in fact very likely hopeless.
Uncharitably summarizing again (and hopefully you will correct me if this is inaccurate): it sounds like you want to just not update very much on evidence which we don’t know how to formalize yet. And I’d say this is basically the same mistake as e.g. someone who says we have no idea whether an updated version of a covid vaccine works until there’s been a phase-3 clinical trial with statistically significant result.
Also, a separate issue with this: it sounds like this will systematically generate strategies which ignore unknown unknowns. It’s like the exact opposite of security mindset.
I don’t think those are great summaries. I think this is probably some misunderstanding about what ARC is trying to do and about what I mean by “concrete.” In particular, “concrete” doesn’t mean “formalized,” it means more like: you are able to discuss a bunch of concrete examples of the difficulty and why they leads to failure of particular concrete approaches; you are able to point out where the problem will appear in a particular decomposition of the problem, and would revise your picture if that turned out to be wrong; etc.
You write:
I don’t yet have this sense about a “sharp left turn” bottleneck.
I think I would agree with you if we’d looked at a bunch of plausible approaches, and then convinced ourselves that they would fail. And then we tried to introduce the sharp left turn to capture the unifying theme of those failures and to start exploring what’s really going on. At a high level that’s very similar to what ARC is doing day to day, looking at a bunch of approaches to a problem, seeing why they fail, and then trying to understand the nature of the problem so that we can succeed.
But for the sharp left turn I think we basically don’t have examples. Existing alignment strategies fail in much more basic ways, which I’d call “concrete.” We don’t have examples of strategies that don’t run into concrete difficulties, but they fail for a vague and hard-to-understand reason that we’d summarize as a “sharp left turn.” So I don’t really believe that this difficulty is being abstracted from a pattern of failures.
There can be other ways to learn about problems, and I didn’t think Nate was even saying that this problem is derived from examples of obstructions to potential alignment approaches. I think Nate’s perspective is that he has some petty good arguments and intuitions about why a sharp left turn will cause novel problems. And so a lot of what I’m saying is that I’m not yet buying it, that I think Nate’s argument has fatal holes in it which are hidden by its vagueness, and that if the arguments are really very important then we should be trying hard to make them more concrete and to address those holes.
ARC does theoretical work guided by concrete stories about how a proposed AI system could fail; we are focused on the “legible part” insofar as we try to fix failures for which we can tell concrete stories. I’m not quite sure what you mean by “illegible” and so this might just be a miscommunication, but I think this is the relevant sense of “illegible” so I’ll respond briefly to it.
I think we can tell concrete stories about deceptive alignment; about ontology mismatches making it hard or meaningless to “elicit latent knowledge;” about exploitability of humans making debate impossible; and so on. And I think those stories we can tell seem to do a great job of capturing the reasons why we would expect existing alignment approaches to fail. So if we addressed these concrete stories I would feel like we’ve made real progress. That’s a huge part of my optimism about concrete stories.
It feels to me like either we are miscommunicating about what ARC is doing, or you are saying that those concrete difficulties aren’t the really important failures. That even if an alignment approach addressed all of them, it still wouldn’t represent meaningful progress because the true risk is the risk that cannot be named.
One thing you might mean is that “these concrete difficulties are just shadows of a deeper core.” But I think that’s not actually a challenge to ARC’s approach at all, and it’s not that different from my own view. I think that if you have an intuitive sense of a deep problem, then it’s really great to attack specific instantiations of the problem as a way to learn about the deep core. I feel pretty good about this approach, and I think it’s pretty standard in most disciplines that face problems like this (e.g. if you are deeply confused about physics, it’s good to think a lot about the simplest concrete confusing phenomenon and understand it well; if you are confused about how to design algorithms that overcome a conceptual barrier, it’s good to think about the simplest concrete task that requires crossing that barrier; etc.).
Another thing you might mean is that “these concrete difficulties are distractions from a bigger difficulty that emerges at a later step of the plan.” It’s worth noting that ARC really does try to look at the whole plan and pick the step that is most likely to fail. But I do think it would be a problem for our methodology if there is a good argument about why plans will fail, which won’t let us tell a concrete story about what the failure looks like. My position right now is that I don’t see such an argument; I think we have some vague intuitions, and we have a bunch of examples which do correspond to concrete failure stories. I don’t think there are any examples from which to infer the existence of a difficulty that can’t be captured in concrete stories, and I’m not yet aware of arguments that I find persuasive without any examples. But I’m really quite strongly in the market for such arguments.
Here’s how the situation feels to me. I know this isn’t remotely fair as a summary of your view, it’s just intended to illustrate where ARC is coming from. (It’s also possible this is a research methodology disagreement, in which case I do just disagree strongly.)
I’m not saying that infrastructure for key distribution solves security (and indeed we have huge security problems). I’m saying that working on concrete problems is the right way to make progress in situations like this. I don’t think this is in tension with security mindset. In fact I think effective people with security mindset spend most of their time thinking about concrete risks and how to address them.
It’s great to generalize once you have a bunch of concrete risks and you think there is a deeper underlying pattern. But I think you basically need the examples to learn from, and if there is a real pattern then you should be able to instantiate it in any particular case rather than making reference to the pattern.
This was a good reply, I basically buy it. Thanks.
I understand the security mindset (from the ordinary paranoia post) as: “What are the unexamined assumptions of your security systems which merely stem from investing or adapting a given model?”. The vulnerability comes from the model. The problem is the “unknowable unknowns”. In addition to the Cryptographer and the Cryptography skeptic, I would add the NSA Quantum computing engineer. Concretisation and operationalisation of these problems may have implicit assumptions that could be system wide catastrophic.
I don’t have clear ways of better articulating this back from analogy to Paul’s concretisations of a proposed AI system. I’m not sure there’s no disanalogy here. However it could be something like “We have this effective model of a proposed AI system. What are useful concretisations in which the AI system would fail?”. The security mindset question would be something like “What representations in the ‘UV-complete’ theory of this AI system would lead to catastrophic failure modes?”
I’m probably missing something here though.
This comment made me notice a kind of duality:
- Paul wants to focus on finding concrete problems, and claims that Nate/Eliezer aren’t being very concrete with their proposed problems.
- Nate/Eliezer want to focus on finding concrete solutions, and claim that Paul/other alignment researchers aren’t being very concrete with their proposed solutions.
It seems like “how well do we understand the problem” is one a crux here. I disagree with John’s comment because it feels like he’s assuming too much about our understanding of the problem. If you follow his strategy, then you can spend arbitrarily long trying to find a faithful concrete operationalization of a part of the problem that doesn’t exist.
I don’t feel like this is right (though I think this duality feels like a real thing that is important sometimes and is interesting to think about, so appreciated the comment).
ARC is spending its time right now (i) trying to write down concrete algorithms that solve ELK using heuristic arguments, and then trying to produce concrete examples in which they do the wrong thing, (ii) trying to write down concrete formalizations of heuristic arguments that have the desiderata needed for those algorithms to work, and trying to identify cases in which our algorithms don’t yet meet those desiderata or they may be unachievable. The output is just actual code which is purported to solve major difficulties in alignment.
And on the flip side, I spend a significant amount of my time looking at the algorithms we are proposing (and the bigger plans into which they would fit if successful) and trying to find the best arguments I can that these plans will fail.
I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.
Put differently: I’m not saying that Nate and Eliezer are vague about problems but concrete about solutions, I’m saying they are vague about everything. And I don’t think they are saying that I’m concrete about problems but vague about solutions, they would say that I’m concrete about parts of the solution/problem that don’t matter while systematically pushing all the difficulty into the parts I’m still vague about.
I do think “how well do we understand the problem” seems like a pretty big crux; that leads Nate and Eliezer to think that I’m avoiding the predictably-important difficulty, and it leads me to think that Nate and Eliezer need to get more concrete in order to have an accurate picture of what’s going on.
Yeah, my comment was sloppily phrased; I agree with “I think that the disagreement is more about what kind of concreteness is possible or desirable in this domain.”
I don’t think that’s how this works? The strategy I’m recommending explicitly contains two parts where we gain evidence about whether a part of the problem actually exists:
noticing an intuitive pattern in the failure-modes of some strategies
attempting to formalize (which presumably includes backpropagating our mathematics into our intuitions)
… so if a part of the problem doesn’t exist, then (a) we probably don’t notice a pattern in the first place, but even if our notoriously unreliable human pattern-matchers over-match, then (b) while we’re attempting to formalize we we have plenty of opportunity to notice that maybe the pattern doesn’t actually exist the way we thought it did.
It feels like you’re looking for a duality which does not exist. I mean, the duality between “look for concrete solutions” and “look for concrete problems” I buy (and that would indeed cause one side to be over-optimistic and the other over-pessimistic in exactly the pattern we actually see between Paul and Nate/Eliezer). But it feels like you’re also looking for a duality between how-Paul’s-recommended-search-order-just-fails and how-mine-just-fails. And the reason that duality does not exist is because my recommended search order is using strictly more evidence; Paul is basically advocating ignoring a whole class of very useful evidence, and that makes his strategy straightforwardly suboptimal. If we were both picking different points on a pareto frontier, then yeah, there’d be a trade-off. But Paul just isn’t on the pareto frontier.
I feel confused about the difference between your “attempt to formalize” step and Paul’s “attempt to concretize” step. It feels like you can view either as a step towards the other—if you successfully formalize, then presumably you’ll be able to concretize; but also one valuable step towards formalizing is by finding concrete examples and then generalizing from them. I think everyone agrees that it’d be great to end up with a formalism for the problem, and then disagrees on how much that process should involve “finding concrete examples of the problem”. My own view is that since it’s so incredibly easy for people to get lost in abstractions, people should try to concretize much more when talking about highly abstract domains. (Even when people are confident that they’re not lost in abstractions, like Eliezer and Nate are, that’s still really useful for conveying ideas to other people.)
Maybe you’ve already got this idea on your list of concrete examples, but while I was reading the original post I had an idea for a concrete example of how a parameter-frozen model which was above some critical intelligence threshold could potentially gain dangerous amounts of intelligence / power if given compute, access to data, and data storage.
I’m not saying that this is particularly probable, just that it seems like a path worth adding to a concrete list of ‘ways things could go tricky’.
Imagine you have a model-system which has a large parameter-frozen language model as its primary component, and some sort of simple RL wrapper giving it goals. It could write and read text from its data storage, writing about hypotheses that it then checks in its data or in its training environment, and recording the results. In theory it seems like the system of self-edited data and frozen model could bootstrap its way to being a smarter overall entity, and then design prompts for it’s language model component in a directed way to further accelerate this process, and go completely off-the-rails without even a single additional SGD update. Thus, it could ‘leave behind’ the similarly parameter-frozen auditor model.
[Edit: This writing could include the sort of writing one might imagine a human taking notes for themself might write, perhaps encoded in a self-created cypher. Or it could be like antagonist samples of tokens in weird out-of-distribution combinations, a sort of self-programming. Or some combination of the two. For an intuition pump for the human note style, imagine an immortal human scientist with a poor memory trapped in a very hard escape room, determined to escape. They’d keep notes on their ideas both for how to solve the huge number of intentional puzzles of the room, and also on exploits they discovered, and also on the processes they used to discover these things. If they thought the maker of the room would spy on their notes and adjust the difficulty of the room to keep them forever trapped, then they’d encrypt their notes and devote part of their limited memory to constantly re-memorizing the encryption key.]