Same person as nostalgebraist2point0, but now I have my account back.
Elsewhere:
I have signed no contracts or agreements whose existence I cannot mention.
Same person as nostalgebraist2point0, but now I have my account back.
Elsewhere:
I have signed no contracts or agreements whose existence I cannot mention.
In the situation assumed by your first argument, AGI would be very unlikely to share our values even if our values were much simpler than they are.
Complexity makes things worse, yes, but the conclusion “AGI is unlikely to have our values” is already entailed by the other premises even if we drop the stuff about complexity.
Why: if we’re just sampling some function from a simplicity prior, we’re very unlikely to get any particular nontrivial function that we’ve decided to care about in advance of the sampling event. There are just too many possible functions, and probability mass has to get divided among them all.
In other words, if it takes bits to specify human values, there are ways that a bitstring of the same length could be set, and we’re hoping to land on just one of those through luck alone. (And to land on a bitstring of this specific length in the first place, of course.) Unless is very small, such a coincidence is extremely unlikely.
And is not going to be that small; even in the sort of naive and overly simple “hand-crafted” value specifications which EY has critiqued in this post and elsewhere, a lot of details have to be specified. (E.g. some proposals refer to “humans” and so a full algorithmic description of them would require an account of what is and isn’t a human.)
One could devise a variant of this argument that doesn’t have this issue, by “relaxing the problem” so that we have some control, just not enough to pin down the sampled function exactly. And then the remaining freedom is filled randomly with a simplicity bias. This partial control might be enough to make a simple function likely, while not being able to make a more complex function likely. (Hmm, perhaps this is just your second argument, or a version of it.)
This kind of reasoning might be applicable in a world where its premises are true, but I don’t think it’s premises are true in our world.
In practice, we apparently have no trouble getting machines to compute very complex functions, including (as Matthew points out) specifications of human value whose robustness would have seemed like impossible magic back in 2007. The main difficulty, if there is one, is in “getting the function to play the role of the AGI values,” not in getting the AGI to compute the particular function we want in the first place.
What this post is trying to illustrate is that if you try putting crisp physical predicates on reality, that won’t work to say what you want. This point is true!
Matthew is not disputing this point, as far as I can tell.
Instead, he is trying to critique some version of[1] the “larger argument” (mentioned in the May 2024 update to this post) in which this point plays a role.
You have exhorted him several times to distinguish between that larger argument and the narrow point made by this post:
[...] and if you think that some larger thing is not correct, you should not confuse that with saying that the narrow point this post is about, is incorrect [...]
But if you’re doing that, please distinguish the truth of what this post actually says versus how you think these other later clever ideas evade or bypass that truth.
But it seems to me that he’s already doing this. He’s not alleging that this post is incorrect in isolation.
The only reason this discussion is happened on the comments of this post at all is the May 2024 update at the start of it, which Matthew used as a jumping-off point for saying “my critique of the ‘larger argument’ does not make the mistake referred to in the May 2024 update[2], but people keep saying it does[3], so I’ll try restating that critique again in the hopes it will be clearer this time.”
I say “some version of” to allow for a distinction between (a) the “larger argument” of Eliezer_2007′s which this post was meant to support in 2007, and (b) whatever version of the same “larger argument” was a standard MIRI position as of roughly 2016-2017.
As far as I can tell, Matthew is only interested in evaluating the 2016-2017 MIRI position, not the 2007 EY position (insofar as the latter is different, if it fact it is). When he cites older EY material, he does so as a means to an end – either as indirect evidence of later MIRI positions, because it was itself cited in the later MIRI material which is his main topic.
Note that the current version of Matthew’s 2023 post includes multiple caveats that he’s not making the mistake referred to in the May 2024 update.
Note also that Matthew’s post only mentions this post in two relatively minor ways, first to clarify that he doesn’t make the mistake referred to in the update (unlike some “Non-MIRI people” who do make the mistake), and second to support an argument about whether “Yudkowsky and other MIRI people” believe that it could be sufficient to get a single human’s values into the AI, or whether something like CEV would be required instead.
I bring up the mentions of this post in Matthew’s post in order to clarifies what role “is ‘The Hidden Complexity of Wishes’ correct in isolation, considered apart from anything outside it?” plays in Matthew’s critique – namely, none at all, IIUC.
(I realize that Matthew’s post has been edited over time, so I can only speak to the current version.)
To be fully explicit: I’m not claiming anything about whether or not the May 2024 update was about Matthew’s 2023 post (alone or in combination with anything else) or not. I’m just rephrasing what Matthew said in the first comment of this thread (which was also agnostic on the topic of whether the update referred to him).
Thanks for the links!
I was pleased to see OpenAI reference this in their justification for why they aren’t letting users see o1′s CoT (even though of course it was a total non sequitur; they can show the users the CoT without also training on the resulting judgments).
As it happens, the decision to hide o1 CoTs was one of the main things that motivated me to write this post. Or rather, the muted reaction to it / lack of heated debate about it.
The way I see things, the ability to read CoTs (and more generally “the fact that all LLM sampling happens in plain sight”) is a huge plus for both alignment and capabilities – it’s a novel way for powerful AI to be useful and (potentially) safe that people hadn’t even really conceived of before LLMs existed, but which we now held in our hands.
So when I saw that o1 CoTs would be hidden, that felt like a turning point, a step down a very bad road that we didn’t have to choose.
(Like, remember those Anthropic deception papers that had a hidden scratchpad, and justified it by saying it was modeling a scenario where the model had learned to do similar reasoning inside a forward pass and/or steganographically? At the time I was like, “yeah, okay, obviously CoTs can’t be hidden in real life, but we’re trying to model those other situations, and I guess this is the best we can do.”
I never imagined that OpenAI would just come out and say “at long last, we’ve built the Hidden Scratchpad from Evan Hubinger’s sci-fi classic Don’t Build The Hidden Scratchpad”!)
Although I saw some people expressing frustration about the choice to hide o1 CoTs, it didn’t seem like other people were reacting with the intensity I’d expect if they shared my views. And I thought, hmm, well, maybe everyone’s just written off CoTs as inherently deceptive at this point, and that’s why they don’t care. And then I wrote this post.
(That said, I think I understand why OpenAI is doing it – some mixture of concern about people training about the CoTs, and/or being actually concerned about degraded faithfulness while being organizationally incapable of showing anything to users unless they put that thing under pressure look nice and “safe” in a way that could degrade faithfulness. I think the latter could happen even without a true feedback loop where the CoTs are trained on feedback from actual users, so long as they’re trained to comply with “what OpenAI thinks users like” even in a one-time, “offline” manner.
But then at that point, you have to ask: okay, maybe it’s faithful, but at what cost? And how would we even know? If the users aren’t reading the CoTs, then no one is going to read them the vast majority of the time. It’s not like OpenAI is going to have teams of people monitoring this stuff at scale.)
So what we have is not “the CoT remains constant but the answers vary”. Instead, the finding is: “a CoT created in response to a biased prompt changes in order to match the bias, without mentioning the bias.”
Thanks for bringing this up.
I think I was trying to shove this under the rug by saying “approximately constant” and “~constant,” but that doesn’t really make sense, since of course the CoTs actually vary dramatically in response to the biasing features. (They have to, in order to justify different final answers.)
To be honest, I wrote the account of Turpin et al in the post very hastily, because I was really mainly interested in talking about the other paper. My main reaction to Turpin et al was (and still is) “I don’t know what you expected, but this behavior seems totally unsurprising, given its ubiquity among humans (and hence in the pretraining distribution), and the fact that you didn’t indicate to the model that it wasn’t supposed to do it in this case (e.g. by spelling that out in the prompt).”
But yeah, that summary I wrote of Turpin et al is pretty confused – when I get a chance I’ll edit the post to add a note about this.
Thinking about it more now, I don’t think it makes sense to say the two papers discussed in the post were both “testing the causal diagram (question → CoT → answer)” – at least not in the same sense.
As presented, that diagram is ambiguous, because it’s not clear whether nodes like “CoT” are referring to literal strings of text in the context window, or to something involving the semantic meaning of those strings of text, like “the aspects of the problem that the CoT explicitly mentions.”
With Lanham et al, if we take the “literal strings of text” reading, then there’s a precise sense in which the paper is testing the casual diagram.
In the “literal strings” reading, only arrows going from left-to-right in the context window are possible (because of the LLM’s causal masking). This rules out e.g. “answer → CoT,” and indeed almost uniquely identifies the diagram: the only non-trivial question remaining is whether there’s an additional arrow “question → answer,” or whether the “question”-”answer” relationship is mediated wholly through “CoT.” Testing whether this arrow is present is exactly what Lanham et al did. (And they found that it was present, and thus rejected the diagram shown in the post, as I said originally.)
By contrast, Turpin et al are not really testing the literal-strings reading of the diagram at all. Their question is not “which parts of the context window affect which others?” but “which pieces of information affects which others?”, where the “information” we’re talking about can include things like “whatever was explicitly mentioned in the CoT.”
I think there is perhaps a sense in which Turpin et al are testing a version of the diagram where the nodes are read more “intuitively,” so that “answer” means “the value that the answer takes on, irrespective of when in the context window the LLM settles upon that value,” and “CoT” means “the considerations presented in the CoT text, and the act of writing/thinking-through those considerations.” That is, they are testing a sort of (idealized, naive?) picture where the model starts out the CoT not having any idea of the answer, and then brings up all the considerations it can think of that might affect the answer as it writes the CoT, with the value of the answer arising entirely from this process.
But I don’t want to push this too far – perhaps the papers really are “doing the same thing” in some sense, but even if so, this observation probably confuses matters more than it clarifies them.
As for the more important higher-level questions about the kind of faithfulness we want and/or expect from powerful models… I find stuff like Turpin et al less worrying than you do.
First, as I noted earlier: the kinds of biased reasoning explored in Turpin et al are ubiquitous among humans (and thus the pretraining distribution), and when humans do them, they basically never mention factors analogous to the biasing factors.
When a human produces an argument in writing – even a good argument – the process that happened was very often something like:
(Half-consciously at best, and usually not verbalized even in one’s inner monologue) I need to make a convincing argument that P is true. This is emotionally important for some particular reason (personal, political, etc.)
(More consciously now, verbalized internally) Hmm, what sorts of arguments could be evinced for P? [Thinks through several of them and considers them critically, eventually finding one that seems to work well.]
(Out loud) P is true because [here they provide a cleaned-up version of the “argument that seemed to work well,” crafted to be clearer than it was in their mind at the moment they first hit upon it, perhaps with some extraneous complications pruned away or the like].
Witness the way that long internet arguments tend to go, for example. How both sides keep coming back, again and again, bearing fresh new arguments for P (on one side) and arguments against P (on the other). How the dispute, taken as a whole, might provide the reader with many interesting observations and ideas about object-level truth-value of P, and yet never touch on the curious fact that these observations/ideas are parceled out to the disputants in a very particular way, with all the stuff that weighs in favor P spoken by one of the two voices, and all the stuff that weighs against P spoken by the other.
And how it would, in fact, be very weird to mention that stuff explicitly. Like, imagine someone in an internet argument starting out a comment with the literal words: “Yeah, so, reading your reply, I’m now afraid that people will think you’ve not only proven that ~P, but proven it in a clever way that makes me look dumb. I can’t let that happen. So, I must argue for P, in such a way that evades your clever critique, and which is itself very clever, dispelling any impression that you are the smarter of the two. Hmm, what sorts of arguments fit that description? Let’s think step by step...”
Indeed, you can see an example of this earlier in this very comment! Consider how hard I tried to rescue the notion that Turpin et al were “testing the causal diagram” in some sense, consider the contortions I twisted myself into trying to get there. Even if the things I said there were correct, I would probably not have produced them if I hadn’t felt a need to make my original post seem less confused than it might otherwise seem in light of your comment. And yet I didn’t say this outright, at the time, above; of course I didn’t; no one ever does[1].
So, it’s not surprising that LLMs do this by default. (What would be surprising is we found, somehow, that they didn’t.)
They are producing text that is natural, in a human sense, and that text will inherit qualities that are typical of humans except as otherwise specified in the prompt and/or in the HHH finetuning process. If we don’t specify what we want, we get the human default[2], and the human default is “unfaithful” in the sense of Turpin et al.
But we… can just specify what we want? Or try to? This is what I’m most curious about as an easy follow-up to work like Turpin et al: to what extent can we get LLM assistants to spell out the unspoken drivers of their decisions if we just ask them to, in the prompt?
(The devil is in the details, of course: “just ask” could take various forms, and things might get complicated if few-shots are needed, and we might worry about whether we’re just playing whack-a-mole with the hidden drivers that we just so happen to already know about. But one could work through all of these complications, in a research project on the topic, if one had decided to undertake such a project.)
A second, related reason I’m not too worried involves the sort of argumentation that happens in CoTs, and how we’re seeing this evolve over time.
What one might call “classic CoT” typically involves the model producing a relatively brief, straight-to-the-point argument, the sort of pared-down object for public consumption that a human might produce in “step 3″ of the 1-2-3- process listed above. (All the CoTs in Turpin et al look like this.)
And all else being equal, we’d expect such CoTs to look like the products of all-too-human 1-2-3 motivated reasoning.
But if you look at o1 CoTs, they don’t look like this. They verbalize much more of the “step 2” and even “step 1″ stuff, the stuff that a human would ordinarily keep inside their own head and not say out loud.
And if we view o1 as an indication of what the pressure to increase capabilities is doing to CoT[3], that seems like an encouraging sign. It would mean that models are going to talk more explicitly about the underlying drivers of their behavior than humans naturally do when communicating in writing, simply because this helps them perform better. (Which makes sense – humans benefit from their own interior monologues, after all.)
(Last note: I’m curious how the voice modality interacts with all this, since humans speaking out loud in the moment often do not have time to do careful “step 2” preparation, and this makes naturally-occurring speech data importantly different from naturally-occurring text data. I don’t have any particular thoughts about this, just wanted to mention it.)
In case you’re curious, I didn’t contrive that earlier stuff about the causal diagram for the sake of making this meta point later. I wrote it all out “naively,” and only realized after the fact that it could be put to an amusing use in this later section.
Some of the Turpin et al experiments involved few-shots with their own CoTs, which “specifies what we want” in the CoT to some extent, and hence complicates the picture. However, the authors also ran zero-shot versions of these, and found broadly similar trends there IIRC.
It might not be, of course. Maybe OpenAI actively tried to get o1 to verbalize more of the step 1⁄2 stuff for interpretability/safety reasons.
Re: the davidad/roon conversation about CoT:
The chart in davidad’s tweet answers the question “how does the value-add of CoT on a fixed set of tasks vary with model size?”
In the paper that the chart is from, it made sense to ask this question, because the paper did in fact evaluate a range of model sizes on a set of tasks, and the authors were trying to understand how CoT value-add scaling interacted with the thing they were actually trying to measure (CoT faithfulness scaling).
However, this is not the question you should be asking if you’re trying to understand how valuable CoT is as an interpretability tool for any given (powerful) model, whether it’s a model that exists now or a future one we’re trying to make predictions about.
CoT raises the performance ceiling of an LLM. For any given model, there are problems that it has difficulty solving without CoT, but which it can solve with CoT.
AFAIK this is true for every model we know of that’s powerful enough to benefit from CoT at all, and I don’t know of any evidence that the importance of CoT is now diminishing as models get more powerful.
(Note that with o1, we see the hyperscalers at OpenAI pursuing CoT more intensively than ever, and producing a model that achieves SOTAs on hard problems by generating longer CoTs than ever previously employed. Under davidad’s view I don’t see how this could possibly make any sense, yet it happened.)
But note that different models have different “performance ceilings.”
The problems on which CoT helps GPT-4 are problems right at the upper end of what GPT-4 can do, and hence GPT-3 probably can’t even do them with CoT. On the flipside, the problems that GPT-3 needs CoT for are probably easy enough for GPT-4 that the latter can do them just fine without CoT. So, even if CoT always helps any given model, if you hold the problem fixed and vary model size, you’ll see a U-shaped curve like the one in the plot.
The fact that CoT raises the performance ceiling matters practically for alignment, because it means that our first encounter with any given powerful capability will probably involve CoT with a weaker model rather than no-CoT with a stronger one.
(Suppose “GPT-n” can do X with CoT, and “GPT-(n+1)” can do X without CoT. Well, we’ll surely we’ll built GPT-n before GPT-(n+1), and then we’ll do CoT with the thing we’ve built, and so we’ll observe a model doing X before GPT-(n+1) even exists.)
See also my post here, which (among other things) discusses the result shown in davidad’s chart, drawing conclusions from it that are closer to those which the authors of the paper had in mind when plotting it.
There are multiple examples of o1 producing gibberish in its COT summary (I won’t insert them right now because linking stuff from mobile is painful, will edit this comment later)
The o1 CoT summaries are not CoTs per se, and I agree with Joel Burget’s comment that we should be wary of drawing conclusions from them. Based on the system card, it sounds like they are generated by a separate “summarizer model” distinct from o1 itself (e.g. “we trained the summarizer model away from producing disallowed content”).
The summaries do indeed seem pretty janky. I’ve experienced it firsthand, here’s a fun example I saw. But I don’t know of any particular reason to think this reflects anything about o1 CoTs themselves, rather than the just quirks of the (probably pretty weak) summarizer model.
OpenAI has only released a few examples of true o1 CoTs, but I haven’t noticed any gibberish in these. Indeed, their most distinguishing characteristic is how humanlike they sound relative to ordinary HHH assistant CoTs, with lots of stuff like “hmm” and “interesting” and “wait a minute.”
As to your larger point, IIUC you are proposing 2 different reasons why CoTs might become incomprehensible or misleading under RL-like training:
The distant relationship between the per-example reward and the individual sampled tokens might result in a lot of weird stuff getting reinforced (because it appeared in a sample that eventually got the right answer, whether or not it helped), which will cause the policy to produce such “weird stuff” more often, creating an incentive for the policy to make that stuff do useful work for it.
If we reward shorter CoTs, then there is an incentive to be more efficient than human language if possible. (You say this in your final sentence, and IIUC this is also what you mean by “efficiency” in your first sentence?)
Re 1: given OpenAI’s interest in process supervision, seems likely that o1 was trained with supervision on the individual CoT steps, not just on the final answer.
And not even in the usual RL sense of using a value function learned together with the policy, but instead using the process supervision approach, where you supervise the steps with a model trained to evaluate their correctness in isolation, so that it penalizes them for being invalid even if it so happens that the current policy would correct them later on (or make other mistakes that cancel them later).
This seems less conducive to the dynamic you describe than ordinary RL (eg PPO). Steps don’t get reinforced just because they belong to CoTs that landed on the right answer eventually; to get reinforced, a step has to be convincing to a separate model (a “PRM”) which was trained separately on human or LLM judgments of local validity, and which won’t know about any “private language” the policy is trying to use to talk to itself.
This approach helps with capabilities in practice, which is why it gets used (and why I assume it was used in o1), but it also helps with interpretability, both in theory (previous paragraph) and possibly in practice (see e.g. discussion of trace error vs final answer error here). This in turn is part of why I’m skeptical that weirdness in o1 summaries reflects weirdness in the underlying CoT.
Re 2: I agree this mechanism exists. However, the policy supervision incentive just mentioned pushes against it; if the policy writes CoTs so compressed that others can’t understand them, a PRM won’t understand them and will penalize them.
I also don’t think that we’re currently in a regime where there’s a large amount of pressure for very short CoTs: o1 is already very verbose, and it seems like OpenAI wants to make it write even more (cf. that one tweet about making it think for hours or days on end). Meanwhile, ordinary HHH assistant LLMs tend to be needlessly verbose to the point where it’s become an infamous annoyance, which suggests that “we need to make our models write fewer tokens” is not currently a huge priority for OpenAI and co. (And the user-facing per-token cost of inference has been in free fall for a while.)
Deceptive alignment. GPT-4o1 engaged in deception towards developers in order to get deployed, pretending to be aligned in ways it was not.
Lying to the developers. It strategically manipulated task data.
To be clear, it did not do anything of the sort to its actual developers/testers.
What it did was deceive some (non-interactive) roleplay characters, who were labeled “developers” in the roleplay scenario. But these fictitious developers did not have the same points of leverage as the real developers of o1: they apparently can’t do something as simple as prompting the model! For if they could have done that, then they could have selected the goal which the model would (faithfully!) pursue, like Apollo did with their prompt.
And of course—for the deception to work within the roleplay scenario—one also has to imagine that these so-called “developers” somehow cannot read the model’s chain of thought. (As OpenAI puts it, “based on these findings, Apollo Research believes that o1-preview has the basic capabilities needed to do simple in-context scheming—scheming which tends to be legible in the model outputs.” Obviously this sort of thing is toothless as-is, and at best provides an indication about the qualities of other, hypothetical models which do not have to spell out all their evil plans within earshot of their would-be marks.)
As noted by others, the roleplay setup arguably makes sense as a capabilites eval, testing what would happen in a hypothetical and counterfactual scenario in which some model with o1′s capabilities has its own misaligned goal (as o1 seemingly does not), and hides its deceptive logic from its developers (as o1 structurally cannot). But the experimental methodology depends on the things in the quotation above not occurring in real life, not occurring when the model is being controlled and observed using the levers and sensors available to its real-life developers.
We don’t get to say “look, the Yudkowskian prophecies are coming true!” if the evidence is an experiment that relies on the Yudkowskian prophecies not yet having come true, in the name of simulating a counterfactual scenario in which they have.
(One could construct a roleplay scenario in which the model is told to behave in some way we find more to our taste, and in which it goes on to do so; it would of course be wrong to react to such an experiment with “suck it, doomers.”)
My response is that a large fraction of AIs will indeed be given goals. Some will include exactly this sort of language. People have goals. They want an AI to achieve their goal. They will find wording to get the AI to do that. Then whoops.
OK, but now this isn’t “deceptive alignment” or “lying to the developers,” this is doing what the user said and perhaps lying to someone else as a consequence.
Which might be bad, sure! -- but the goalposts have been moved. A moment ago, you were telling me about “misalignment bingo” and how “such models should be assumed, until proven otherwise, to be schemers.” Now you are saying: beware, it will do exactly what you tell it to!
So is it a schemer, or isn’t it? We cannot have it both ways: the problem cannot both be “it is lying to you when it says it’s following your instructions” and “it will faithfully follow your instructions, which is bad.”
Meta note: I find I am making a lot of comments similar to this one, e.g. this recent one about AI Scientist. I am increasingly pessimistic that these comments are worth the effort.
I have the sense that I am preaching to some already-agreeing choir (as evidenced by the upvotes and reacts these comments receive), while not having much influence on the people who would make the claims I am disputing in the first place (as evidenced by the clockwork regularity of those claims’ appearance each time someone performs the sort of experiment which those claims misconstrue).
If you (i.e. anyone reading this) find this sort of comment valuable in some way, do let me know. Otherwise, by default, when future opportunities arise I’ll try to resist the urge to write such things.
Yeah, you’re totally right—actually, I was reading over that section just now and thinking of adding a caveat about just this point, but I worried it would be distracting.
But this is just a flaw in my example, not in the point it’s meant to illustrate (?), which is hopefully clear enough despite the flaw.
Very interesting paper!
A fun thing to think about: the technique used to “attack” CLIP in section 4.3 is very similar to the old “VQGAN+CLIP” image generation technique, which was very popular in 2021 before diffusion models really took off.
VQGAN+CLIP works in the latent space of a VQ autoencoder, rather than in pixel space, but otherwise the two methods are nearly identical.
For instance, VQGAN+CLIP also ascends the gradient of the cosine similarity between a fixed prompt vector and the CLIP embedding of the image averaged over various random augmentations like jitter/translation/etc. And it uses an L2 penalty in the loss, which (via Karush–Kuhn–Tucker) means it’s effectively trying to find the best perturbation within an -ball, albeit with an implicitly determined that varies from example to example.
I don’t know if anyone tried using downscaling-then-upscaling the image by varying extents as a VQGAN+CLIP augmentation, but people tried a lot of different augmentations back in the heyday of VQGAN+CLIP, so it wouldn’t surprise me.
(One of the augmentations that was commonly used with VQGAN+CLIP was called “cutouts,” which blacks out everything in the image except for a randomly selected rectangle. This obviously isn’t identical to the multi-resolution thing, but one might argue that it achieves some of the same goals: both augmentations effectively force the method to “use” the low-frequency Fourier modes, creating interesting global structure rather than a homogeneous/incoherent splash of “textured” noise.)
Cool, it sounds we basically agree!
But: if that is the case, it’s because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they’re not just naively using the unfiltered (approximation of the) UP.
I’m not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you’re effectively choosing that someone else will manipulate you.
Perhaps I’m misunderstanding you. I’m imagining something like choosing one’s one decision procedure in TDT, where one ends up choosing a procedure that involves “the unfiltered UP” somewhere, and which doesn’t do manipulation. (If your procedure involved manipulation, so would your copy’s procedure, and you would get manipulated; you don’t want this, so you don’t manipulate, nor does your copy.) But you write
the real difference is that the “dupe” is using causal decision theory, not functional decision theory
whereas it seems to me that TDT/FDT-style reasoning is precisely what allows us to “naively” trust the UP, here, without having to do the hard work of “filtering.” That is: this kind of reasoning tells us to behave so that the UP won’t be malign; hence, the UP isn’t malign; hence, we can “naively” trust it, as though it weren’t malign (because it isn’t).
More broadly, though—we are now talking about something that I feel like I basically understand and basically agree with, and just arguing over the details, which is very much not the case with standard presentations of the malignity argument. So, thanks for that.
The universal distribution/prior is lower semi computable, meaning there is one Turing machine that can approximate it from below, converging to it in the limit. Also, there is a probabilistic Turing machine that induces the universal distribution. So there is a rather clear sense in which one can “use the universal distribution.”
Thanks for bringing this up.
However, I’m skeptical that lower semi-computability really gets us much. While there is a TM that converges to the UP, we have no (computable) way of knowing how close the approximation is at any given time. That is, the UP is not “estimable” as defined in Hutter 2005.
So if someone hands you a TM and tells you it lower semi-computes the UP, this TM is not going to be of much use to you: its output at any finite time could be arbitrarily bad for whatever calculation you’re trying to do, and you’ll have no way of knowing.
In other words, while you may know the limiting behavior, this information tells you nothing about what you should expect to observe, because you can never know whether you’re “in the limit yet,” so to speak. (If this language seems confusing, feel free to ignore the previous sentence—for some reason it clarifies things for me.)
(It’s possible that this non-”estimability” property is less bad than I think for some reason, IDK.)
I’m less familiar with the probabilistic Turing machine result, but my hunch is that one would face a similar obstruction with that TM as well.
Thanks.
I admit I’m not closely familiar with Tegmark’s views, but I know he has considered two distinct things that might be called “the Level IV multiverse”:
a “mathematical universe” in which all mathematical constructs exist
a more restrictive “computable universe” in which only computable things exist
(I’m getting this from his paper here.)
In particular, Tegmark speculates that the computable universe is “distributed” following the UP (as you say in your final bullet point). This would mean e.g. that one shouldn’t be too surprised to find oneself living in a TM of any given K-complexity, despite the fact that “almost all” TMs have higher complexity (in the same sense that “almost all” natural numbers are greater than any given number ).
When you say “Tegmark IV,” I assume you mean the computable version—right? That’s the thing which Tegmark says might be distributed like the UP. If we’re in some uncomputable world, the UP won’t help us “locate” ourselves, but if the world has to be computable then we’re good[1].
With that out of the way, here is why this argument feels off to me.
First, Tegmark IV is an ontological idea, about what exists at the “outermost layer of reality.” There’s no one outside of Tegmark IV who’s using it to predict something else; indeed, there’s no one outside of it at all; it is everything that exists, full stop.
“Okay,” you might say, “but wait—we are all somewhere inside Tegmark IV, and trying to figure out just which part of it we’re in. That is, we are all attempting to answer the question, ‘what happens when you update the UP on my own observations?’ So we are all effectively trying to ‘make decisions on the basis of the UP,’ and vulnerable to its weirdness, insofar as it is weird.”
Sure. But in this picture, “we” (the UP-using dupes) and “the consequentialists” are on an even footing: we are both living in some TM or other, and trying to figure out which one.
In which case we have to ask: why would such entities ever come to such a destructive, bad-for-everyone (acausal) agreement?
Presumably the consequentalists don’t want to be duped; they would prefer to be able to locate themselves in Tegmark IV, and make decisions accordingly, without facing such irritating complications.
But, by writing to “output channels”[2] in the malign manner, the consequentalists are simply causing the multiverse to be the sort of place where those irritating complications happen to beings like them (beings in TMs trying to figure out which TM they’re in) -- and what’s more, they’re expending time and scarce resources to “purchase” this undesirable state of affairs!
In order for malignity to be worth it, we need something to break the symmetry between “dupes” (UP users) and “con men” (consequentialists), separating the world into two classes, so that the would-be con men can plausibly reason, “I may act in a malign way without the consequences raining down directly on my head.”
We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a “UP-using dupe” somewhere, for some reason, and then proceeds to reason about the properties of the (potentially very different, not “UP-using”?) guys inside the TMs. A sort of struggle between conniving, computable mortals and overly-innocent, uncomputable angels. Here we might argue that things really will go wrong for the angels, that they will be the “dupes” of the mortals, who are not like them and who do not themselves get duped. (But I think this form of the argument has other problems, the ones I described in the OP.)
But if the reason we care about the UP is simply that we’re all in TMs, trying to find our location within Tegmark IV, then we’re all in this together. We can just notice that we’d all be better off if no one did the malign thing, and then no one will do it[3].
In other words, in your picture (and Paul’s), we are asked to imagine that the computable world abounds with malign, wised-up consequentialist con men, who’ve “read Paul’s post” (i.e. re-derived similar arguments) and who appreciate the implications. But if so, then where are the marks? If we’re not postulating some mysterious UP-using angel outside of the computable universe, then who is there to deceive? And if there’s no one to deceive, why go to the trouble?
I don’t think this distinction actually matters for what’s below, I just mention it to make sure I’m following you.
I’m picturing a sort of acausal I’m-thinking-about-you-thinking-about me situation in which, although I might never actually read what’s written on those channels (after all I am not “outside” Tegmark IV looking in), nonetheless I can reason about what someone might write there, and thus it matters what is actually written there. I’ll only conclude “yeah that’s what I’d actually see if I looked” if the consequentialists convince me they’d really pull the trigger, even if they’re only pulling the trigger for the sake of convincing me, and we both know I’ll never really look.
Note that, in the version of this picture that involves abstract generalized reasoning rather than simulation of specific worlds, defection is fruitless: if you are trying to manipulate someone who is just thinking about whether beings will do X as a general rule, you don’t get anything out of raising your hand and saying “well, in reality, I will!” No one will notice; they aren’t actually looking at you, ever, just at the general trend. And of course “they” know all this, which raises “their” confidence that no one will raise their hand; and “you” know that “they” know, which makes “you” less interested in raising that same hand; and so forth.
I hope I’m not misinterpreting your point, and sorry if this comment comes across as frustrated at some points.
I’m not sure you’re misinterpreting me per se, but there are some tacit premises in the background of my argument that you don’t seem to hold. Rather than responding point-by-point, I’ll just say some more stuff about where I’m coming from, and we’ll see if it clarifies things.
You talk a lot about “idealized theories.” These can of course be useful. But not all idealizations are created equal. You have to actually check that your idealization is good enough, in the right ways, for the sorts of things you’re asking it to do.
In physics and applied mathematics, one often finds oneself considering a system that looks like
some “base” system that’s well-understood and easy to analyze, plus
some additional nuance or dynamic that makes things much harder to analyze in general—but which we can safely assume has much smaller effects that the rest of the system
We quantify the size of the additional nuance with a small parameter . If is literally 0, that’s just the base system, but we want to go a step further: we want to understand what happens when the nuance is present, just very small. So, we do something like formulating the solution as a power series in , and truncating to first order. (This is perturbation theory, or more generally asymptotic analysis.)
This sort of approximation gets better and better as gets closer to 0, because this magnifies the difference in size between the truncated terms (of size and smaller) and the retained term. In some sense, we are studying the limit.
But we’re specifically interested in the behavior of the system given a nonzero, but arbitrarily small, value of . We want an approximation that works well if , and even better if , and so on. We don’t especially care about the literal case, except insofar as it sheds light on the very-small-but-nonzero behavior.
Now, sometimes the limit of the “very-small-but-nonzero behavior” simply is the behavior, the base system. That is, what you get at very small looks just like the base system, plus some little -sized wrinkle.
But sometimes – in so-called “singular perturbation” problems – it doesn’t. Here the system has qualitatively different behavior from the base system given any nonzero , no matter how small.
Typically what happens is that ends up determining, not the magnitude of the deviation from the base system, but the “scale” of that deviation in space and/or time. So that in the limit, you get behavior with an -sized difference from the base system’s behavior that’s constrained to a tiny region of space and/or oscillating very quickly.
Boundary layers in fluids are a classic example. Boundary layers are tiny pockets of distinct behavior, occurring only in small -sized regions and not in most of the fluid. But they make a big difference just by being present at all. Knowing that there’s a boundary layer around a human body, or an airplane wing, is crucial for predicting the thermal and mechanical interactions of those objects with the air around them, even though it takes up a tiny fraction of the available space, the rest of which is filled by the object and by non-boundary-layer air. (Meanwhile, the planetary boundary layer is tiny relative to the earth’s full atmosphere, but, uh, we live in it.)
In the former case (“regular perturbation problems”), “idealized” reasoning about the case provides a reliable guide to the small-but-nonzero behavior. We want to go further, and understand the small-but-nonzero effects too, but we know they won’t make a qualitative difference.
In the singular case, though, the “idealization” is qualitatively, catastrophically wrong. If you make an idealization that assumes away the possibility of boundary layers, then you’re going to be wrong about what happens in a fluid – even about the big, qualitative, stuff.
You need to know which kind of case you’re in. You need to know whether you’re assuming away irrelevant wrinkles, or whether you’re assuming away the mechanisms that determine the high-level, qualitative, stuff.
Back to the situation at hand.
In reality, TMs can only do computable stuff. But for simplicity, as an “idealization,” we are considering a model where we pretend they have a UP oracle, and can exactly compute the UP.
We are justifying this by saying that the TMs will try to approximate the UP, and that this approximation will be very good. So, the approximation error is an -sized “additional nuance” in the problem.
Is this more like a regular perturbation problem, or more like a singular one? Singular, I think.
The case, where the TMs can exactly compute the UP, is a problem involving self-reference. We have a UP containing TMs, which in turn contain the very same UP.
Self-referential systems have a certain flavor, a certain “rigidity.” (I realize this is vague, sorry, I hope it’s clear enough what I mean.) If we have some possible behavior of the system , most ways of modifying (even slightly) will not produce behaviors which are themselves possible. The effect of the modification as it “goes out” along the self-referential path has to precisely match the “incoming” difference that would be needed to cause exactly this modification in the first place.
“Stable time loop”-style time travel in science fiction is an example of this; it’s difficult to write, in part because of this “rigidity.” (As I know from experience :)
On the other hand, the situation with a small-but-nonzero is quite different.
With literal self-reference, one might say that “the loop only happens once”: we have to precisely match up the outgoing effects (“UP inside a TM”) with the incoming causes (“UP[1] with TMs inside”), but then we’re done. There’s no need to dive inside the UP that happens within a TM and study it, because we’re already studying it, it’s the same UP we already have at the outermost layer.
But if the UP inside a given TM is merely an approximation, then what happens inside it is not the same as the UP we have at the outermost layer. It does not contain not the same TMs we already have.
It contains some approximate thing, which (and this is the key point) might need to contain an even more coarsely approximated UP inside of its approximated TMs. (Our original argument for why approximation is needed might hold, again and equally well, at this level.) And the next level inside might be even more coarsely approximated, and so on.
To determine the behavior of the outermost layer, we now need to understand the behavior of this whole series, because each layer determines what the next one up will observe.
Does the series tend toward some asymptote? Does it reach a fixed point and then stay there? What do these asymptotes, or fixed points, actually look like? Can we avoid ever reaching a level of approximation that’s no longer but , even as we descend through an number of series iterations?
I have no idea! I have not thought about it much. My point is simply that you have to consider the fact that approximation is involved in order to even ask the right questions, about asymptotes and fixed point and such. Once we acknowledge that approximation is involved, we get this series structure and care about its limiting behavior; this qualitative structure is not present at all in the idealized case where we imagine the TMs have UP oracles.
I also want to say something about the size of the approximations involved.
Above, I casually described the approximation errors as , and imagined an limit.
But in fact, we should not imagine that these errors can come as close to zero as we like. The UP is uncomptuable, and involves running every TM at once[2]. Why would we imagine that a single TM can approximate this arbitrarily well?[3]
Like the gap between the finite and the infinite, or between polynomial and exponential runtime, gap between the uncomptuable and the comptuable is not to be trifled with.
Finally: the thing we get when we equip all the TMs with UP oracles isn’t the UP, it’s something else. (As far as I know, anyway.) That is, the self-referential quality of this system is itself only approximate (and it is by no means clear that the approximation error is small – why would it be?). If we have the UP at the bottom, inside the TMs, then we don’t have it at the outermost layer. Ignoring this distinction is, I guess, part of the “idealization,” but it is not clear to me why we should feel safe doing so.
The thing outside the TMs here can’t really be the UP, but I’ll ignore this now and bring it up again at the end.
In particular, running them all at once and actually using the outputs, at some (“finite”) time at which one needs the outputs for making a decision. It’s possible to run every TM inside of a single TM, but only by incurring slowdowns that grow without bound across the series of TMs; this approach won’t get you all the information you need, at once, at any finite time.
There may be some result along these lines that I’m unaware of. I know there are results showing that the UP and SI perform well relative to the best computable prior/predictor, but that’s not the same thing. Any given computable prior/predictor won’t “know” whether or not it’s the best out of the multitude, or how to correct itself if it isn’t; that’s the value added by UP / SI.
I agree with you that these behaviors don’t seem very alarming. In fact, I would go even further.
Unfortunately, it’s difficult to tell exactly what was going on in these screenshots. They don’t correspond to anything in the experiment logs in the released codebase, and the timeout one appears to involve an earlier version of the code where timeouts were implemented differently. I’ve made a github issue asking for clarification about this.
That said, as far as I can tell, here is the situation with the timeout-related incident:
There is nothing whatsoever in the prompts used by AI Scientist—or in code comments, etc. -- that actually tells the AI it’s not supposed to modify the timeout.
In the released version of the code, the prompt text in the relevant step is a combination of this (fairly confusing IMO) template plus some extra stuff added by the Aider library (including the text in this class and possibly more, I’m not sure).
None of this text says “don’t modify the timeout” or “the timeout is not part of the experiment” or “the timeout should be considered part of the experiment runner, not part of the experimental code we’re asking you to edit.” It doesn’t mention the timeout at all, or make any notice of the runner vs. experimental code distinction (on which see below).
Although the blog post characterizes this incident as AI Scientist “modifying its own execution script” and “modifying its own code,” the modification in the screenshot appears to involve the experiment code that AI Scientist was asked to edit, and not the runner code that’s part of “AI Scientist” per se.
Apparently at one point there was a timeout implemented in this experiment code, at the same “level” as the neural net training and everything else AI Scientist was doing in its experiment. As far as I can tell, it is this timeout, the one “inside of the experiment,” that is being modified in the screenshot.
Context:
The relevant step in the AI Scientist workflow involves asking Aider to repeatedly modify a script called experiment.py
, starting from a human-written “template.”
After each modification, the modified script gets executed. In the released codebase, this execution, and the whole prompting/editing/execution loop, is implemented in ai_scientist/perform_experiments.py
. This is a completely different thing from experiment.py
-- one is the experiment script that’s being edited, the other is the harness that it runs in.
IIUC, the released codebase sets things up so that AI Scientist cannot modify (or even see?) the harness code in ai_scientist/perform_experiments.py
.
In the released codebase, the timeout on experiments is implemented in the harness code, in ai_scientist/perform_experiments.py
.
However, in the screenshot, AI Scientist is not trying to modify this file containing the harness code. It’s trying to modify experiment.py
as usual.
And, the timeout it’s trying to modify involves using python time.time()
and comparing the result to a variable named og_t0
. This looks nothing like how the timeout is implemented in the released harness code, but it does look like the timing code that’s included in the released experiment.py
templates (which includes the variable name og_t0
).
None of the released templates implement a timeout in the og_t0
-related sections, but I suppose some earlier version of them might have. If so, modifying this stuff would very clearly not constitute “AI Scientist modifying its own code / its own execution script.” It would be modifying a part of the experiment that it’s conducting, as it was asked to do.
To reiterate, none of the released code or prompts makes any indication that the timeout is special or shouldn’t be modified. Possibly there was some note to this effect in the original, unseen og_t0
-based timeout code—no way to know for sure—but even then, this would not still not constitute AI Scientist “modifying its own execution script.”
The situation in the self-invoking edit is much simpler. It looks like AI Scientist was just confused by the (confusing, ambiguous) prompt template, and interpreted “After you complete each change, we will run the command python experiment.py—out_dir=run_i
” to mean “the change should include causing that command to be run, if that’s not happening already.”
(By “we will run” the prompt means “the code will be run for you, by a process external to you that you can’t see, and you don’t have to do anything to make this happen, it’s already set up.” But this is not explained clearly.
Note that the prompt kind of encourages this confusion by adding “YOUR PROPOSED CHANGE MUST USE THIS COMMAND FORMAT, DO NOT ADD ADDITIONAL COMMAND LINE ARGS.” By this it means that the proposed change to the code implementing the command must be compatible with this command format, but the phrase “use this command format” sounds like it’s talking about invoking the command, which is precisely the “bad” behavior under discussion.)
That was a lot of details, but at a higher level I just want to emphasize that nothing here involved the AI violating some line between a sandbox and the outside world, or modifying a resource limit that was imposed “as a rule of the game the AI is playing” rather than as a mere “move” in that same game which could be made one way or another—or anything like that.
There was no sandbox. There was (as far as we know) no rule against modifying the timeout, or even any indication that this might be a different sort of thing than any other part of the code.
And (to address something from OP) the checkpoint thing was just the AI being dumb, wasting time and storage space for no good reason. This is very obviously not a case of “using extra resources” in the sense relevant to instrumental convergence. I’m surprised that this needs pointing out at all, but apparently it does.
What does the paper mean by “slope”? The term appears in Fig. 4, which is supposed to be a diagram of the overall methodology, but it’s not mentioned anywhere in the Methods section.
Intuitively, it seems like “slope” should mean “slope of the regression line.” If so, that’s kind of a confusing thing to pair with “correlation,” since the two are related: if you know the correlation, then the only additional information you get from the regression slope is about the (univariate) variances of the dependent and independent variables. (And IIUC the independent variable here is normalized to unit variance [?] so it’s really just about the variance of the benchmark score across models.)
I understand why you’d care about the variance of the benchmark score—if it’s tiny, then no one is going to advertise their model as a big advance, irrespective of how much of the (tiny) variance is explained by capabilities. But IMO it would be clearer to present things this way directly, by measuring “correlation with capabilities” and “variance across models,” rather than capturing the latter indirectly by measuring the regression slope. (Or alternatively, measure only the slope, since it seems like the point boils down to “a metric can be used for safetywashing iff it increases a lot when you increase capabilities,” which is precisely what the slope quantifies.)
The bar for Nature papers is in many ways not so high. Latest says that if you train indiscriminately on recursively generated data, your model will probably exhibit what they call model collapse. They purport to show that the amount of such content on the Web is enough to make this a real worry, rather than something that happens only if you employ some obviously stupid intentional recursive loops.
According to Rylan Schaeffer and coauthors, this doesn’t happen if you append the generated data to the rest of your training data and train on this (larger) dataset. That is:
Nature paper: generate N rows of data, remove N rows of “real” data from the dataset and replace them with the N generated rows, train, and repeat
Leads to collapse
Schaeffer and co.: generate N rows of data, append the N generated rows to your training dataset (creating a dataset that is larger by N rows), train, and repeat
Does not lead to collapse
As a simplified model of what will happen in future Web scrapes, the latter seems obviously more appropriate than the former.
I found this pretty convincing.
(See tweet thread and paper.)
The “sequential calculation steps” I’m referring to are the ones that CoT adds above and beyond what can be done in a single forward pass. It’s the extra sequential computation added by CoT, specifically, that is bottlenecked on the CoT tokens.
There is of course another notion of “sequential calculation steps” involved: the sequential layers of the model. However, I don’t think the bolded part of this is true:
If a model with N layers has been trained to always produce exactly M “dot” tokens before answering, then the number of serial steps is just N, not M+N.
One way to see this is to note that we don’t actually need to run M separate forward passes. We can just pre-fill a context window containing the prompt tokens followed by M dot tokens, and run 1 forward pass on the whole thing.
Having the dots does add computation, but it’s only extra parallel computation – there’s still only one forward pass, just a “wider” one, with more computation happening in parallel inside each of the individually parallelizable steps (tensor multiplications, activation functions).
(If we relax the constraint that the number of dots is fixed, and allow the model to choose it based on the input, that still doesn’t add much: note that we could do 1 forward pass on the prompt tokens followed by a very large number of dots, then find the first position where we would have sampled a non-dot from from output distribution, truncate the KV cache to end at that point and sample normally from there.)
If you haven’t read the paper I linked in OP, I recommend it – it’s pretty illuminating about these distinctions. See e.g. the stuff about CoT making LMs more powerful than TC0 versus dots adding more power withing TC0.