LessWrong dev & admin as of July 5th, 2022.
RobertM
Anthropic did not publish a “risk discussion” of Mythos when required by their RSP
We have to do something a little more annoying than that, since we don’t have unlimited (and un-rate-limited) Claude Code usage, but something like that is happening.
Do not be surprised if LessWrong gets hacked
For the sake of the epistemic commons, I want to ask: had you seen previous discourse on the name before noticing that connection yourself?
With a further claim about the origin of the “Sam had been lying to us all the time” bit: https://x.com/paulg/status/2041459514634060093
I’ve now done ~15 minutes of research on Ronan Farrow’s previous reporting to check how much bounded distrust to be exercising, and have updated to “slightly more”.
My modal belief is still that Sam Altman’s parting from YC was both causally downstream[1] of other people involved reacting to various forms of deceptive behavior, and that the “YC Chairman” shenanigans were in fact shenanigans.
- ^
In non-trivial part, not necessarily in total or even mostly.
- ^
Baking tips
I think the most interesting part of the piece is the secondhand confirmation that Altman was pushed out of YC for (basically) dishonesty:
By 2018, several Y.C. partners were so frustrated with Altman’s behavior that they approached Graham to complain. Graham and Jessica Livingston, his wife and a Y.C. founder, apparently had a frank conversation with Altman. Afterward, Graham started telling people that although Altman had agreed to leave the company, he was resisting in practice. Altman told some Y.C. partners that he would resign as president but become chairman instead. In May, 2019, a blog post announcing that Y.C. had a new president came with an asterisk: “Sam is transitioning to Chairman of YC.” A few months later, the post was edited to read “Sam Altman stepped away from any formal position at YC”; after that, the phrase was removed entirely. Nevertheless, as recently as 2021, a Securities and Exchange Commission filing listed Altman as the chairman of Y Combinator. (Altman says that he wasn’t aware of this until much later.)
Altman has maintained over the years, both in public and in recent depositions, that he was never fired from Y.C., and he told us that he did not resist leaving. Graham has tweeted that “we didn’t want him to leave, just to choose” between Y.C. and OpenAI. In a statement, Graham told us, “We didn’t have the legal power to fire anyone. All we could do was apply moral pressure.” In private, though, he has been unambiguous that Altman was removed because of Y.C. partners’ mistrust. This account of Altman’s time at Y Combinator is based on discussions with several Y.C. founders and partners, in addition to contemporaneous materials, all of which indicate that the parting was not entirely mutual. On one occasion, Graham told Y.C. colleagues that, prior to his removal, “Sam had been lying to us all the time.”
Certainly you can do a lot with selective quotations, such that “Sam had been lying to us all the time” might not mean exactly what it sounds like. But given Paul Graham’s caginess in previous statements on the subject, as well as all of the other available information, I think the correct update to make here is to reduce your trust in Paul Graham as someone who reliably communicates in a way that leads you believe true things. It seems like the only sense in which they (maybe) did not fire Sam is that they couldn’t legally fire him (somehow), they merely employed all other means at their disposal to get him to leave.
I agree there are two arguments separate arguments there, but I have seen arguments for both.
I think the argument for value preservation post-unaligned-ASI is about as weak (or maybe even weaker?) than the case for the bits necessary to reconstruct your identity getting preserved. “There are no software engineering agents or shopping assistants where we’re going!” I buy the case for this before that point, though—I try to nod at this in “The upside of your writing being used in pretraining”, but maybe it wasn’t super clear.
Don’t write for LLMs, just record everything
You should be able to continue iterating on a design by copying the HTML from the iframe in your browser when you have your design selected (even if it’s visually broken) and dropping it into a new chat, but might be misunderstanding what you mean by “prevents the page from loading”.
I’d recommend simply making a new post; we currently don’t have infrastructure set up for automatically re-evaluating posts have previously been rejected.
I pretty strongly disagree with you if you think that every post on a topic like this has to be really doomy and apologetic about focusing on anything other than misalignment work or pausing AI.
This isn’t exactly what I was trying to say. It’s a little difficult for me to try to tell you what the post “ought” to be doing, since I don’t really understand who its intended audience is, what simulacrum level it’s operating on, etc. But it feels like the analysis in the post is just incomplete, given your beliefs about (unconditional?) misalignment risk, and the level at which the post seems like it’s trying to operate[1].
- ^
Providing a high-level argument for why people should be thinking/working more on this particular subject, on the current margin.
- ^
Intentionally, yes.
The optimal amount of misfired curations is probably not zero, etc.
With that said, it’s not obvious what policy change this implies, except:
More carefully considering what domains might contain difficult-to-verify errors,
More carefully disclaiming my epistemic status in such cases (see Raemon’s latest curation notice, which was prepended to the post that was emailed to everyone)
For instance, it seems like they waited until the peak of the news cycle about their conflict with the US government to release this update, and I suspect that was intentional, and also that this worked.
It happened a few days before the “peak”—it was at a point where barely anyone was paying attention to that particular conflict. Would bet against this being strategically timed[1] at 5:1, if we could reasonably operationalize and figure out a sufficient resolution criteria. (My current model is that Holden is the owner of that project, and would have been responsible for pressing the button on the announcement, and I don’t believe Holden would do that, or knowingly take instruction to do that. Most of my probability mass on what you said being the case lies in worlds where someone else was responsible for the timing of the release, somehow.)
- ^
Would not bet against a claim of the form “they didn’t change the timing of the announcement the way they would have done if the announcement had been about something they wanted to see discussion of, like a new model release”.
- ^
Zooming out from all of the above, I get the vibe from your comments that maybe you just don’t think people should work on anything other than misalignment, given how high your probability of AI takeover is. Is that right? To me it seems very reasonable for people to work on AI character under the assumption that alignment will be not too difficult, because I think that’s a real possibility that we should put significant credence in.
I think my main objection is that this post seems to have a missing mood. Yes, I think alignment is quite hard (though in fact I think it’s probably better to work on buying us more time, right now, than to directly try to solve alignment, given how little time it looks like we have, and how hard I think alignment & the necessary philosophy are). But this post reads me to as one that’s occupying a mindset where we can just do AI character training and pretty confidently expect it to have something close enough to the desired effect, with respect to the models’ propensities in out-of-distribution contexts, such that the analysis doesn’t even bother to factor in worlds where this doesn’t actually end up being true. Surely there’s something interesting to say about those worlds, and how people should relate to the importance of AI character training given their likelihood[1]?
On one hand, I’m wary of demands for long lists of disclaimers, bracketing, conditioning, etc. On the other hand, this level of confidence would be pretty controversial even among the AI lab employees working on these techniques.
- ^
Whatever you think that likelihood is! I don’t know, because it’s not in the post.
- ^
Hmm, there might be a miscommunication here. I agree that most of the expected impact of AI character work flows through the transition to ASI. The claim we’re making in the sentences you quote is (roughly) that it’s hard to directly affect the character of superintelligence itself, as our work will be washed out be work done by slightly superhuman AIs. So we think most impact here flows through influencing the character of slightly superhuman AI, but this itself has impact via influencing the transition to ASI. So what we’re saying it compatable with thinking that everything that matters has impact by influencing the transition to superintelligence.
Does that clarify?
This sounds more reasonable to me, but I don’t quite understand how to square this with the content of the post.
Section 1 includes many examples of ways in which AI character might matter. Many of them are low-stakes and none of them draw any connection to the transition to ASI w.r.t. their impact story. The same is true for section 1.1 (reinforced by “So far, the argument has concerned worlds where AI does not take over.” at the start of section 1.2).
Section 1.3 says the opposite:
The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is. But it’s possible that AI character work, today, could even have a path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world.
Most of the expected impact is from the effect of AI character before superintelligence, in contrast to its expected impact from having a “path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world”.
It really seems to me like the post is pretty explicitly communicating the opposite of “AI character work matters mostly because it will affect how well we manage the transition to a post-ASI regime (where it will probably cease to matter in an object-level sense, though we think there’s a small chance that it’ll still matter even then)”—that AI character will matter mostly for its first-order effects on the world:
The impact could come from rare but high-stakes situations, like an attempted coup, or from lower-stakes but common situations, like a user asking how to vote or whether the AI itself is conscious. Even when the effect of any individual interaction is modest, the total impact across hundreds of millions of interactions could be enormous.
I expect that most readers would understand the post the way I understood it—might be worth editing if you meant it the other way, or something in between. (This is true for Claude, n=1, no custom instructions/memory enabled.)
Don’t agree on this. Yes, you won’t be interested in this work if you think alignment is so hard that the intended alignment target has no predictable affect on AI character. But that’s a very extreme and pessimistic view on alignment. Maybe that’s your view?
Approximately my view in the long-run, though I’ll grant that AI character work seems to matter for current models being more or less annoying to interact with.[1] I do have some hope for affecting the propensities of not-wildly-superhuman models in favorable-to-us directions, though I don’t have much hope for approaches like whatever Anthropic is doing with their Constitution—I find Claude much nicer to talk to than GPT-5.4, but in a coding context it doesn’t seem meaningfully “better behaved”. The level of optimization pressure applied in post-training for those two things just does not seem comparable.
- ^
The direction of this effect seems to vary per person, though...
- ^
Several features were dropped with the post editor redesign, that one among them. Most of them were deliberate but it’s good to ask in case we accidentally dropped something we didn’t intend to.
In addition to the concerns that J Bostock brings up (primarily that the choice of the term “character” seems confusing/unmotivated), I’m also confused by this:
The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is.
As I mentioned a few months ago in response to Effective altruism in the age of AGI, believing in ASI is kind of a “totalizing” belief. If you take ASI seriously, then “most of the expected impact” of
, where is anything that might affect how the transition to ASI goes, is in that effect.Many of the “Pathways to impact” are the kinds of things that might affect whether our transition to a post-ASI future goes well or not. But the way you phrased the first sentence of section 1.2 suggests that you’re thinking of those pathways to impact as mattering mostly in worlds where we don’t transition to a post-ASI future?[1] If so, this seems backwards to me.
I have a few other concerns with this post, which seem less tractable to resolve here:
It seems to mostly be taking for granted a question that is the subject of much dispute—whether we will be able to meaningfully solve for whatever parts of the alignment problem need to be solved to robustly instill specific values into these AIs.
Section 1.2 doesn’t actually argue for why we should expect character training to help us end up in any of those worlds, nor why ending up in those worlds actually leads to the desirable properties. There’s a lot of existing literature on these questions.
“Might be easier to hit as an alignment target”—see Nearest Unblocked Strategy, Deep Deceptiveness, etc. (If it was the “not-holding-power preference” that was load-bearing there, corrigibility seems difficult / anti-natural.)
“Might yield safe AI even if only partially hit”—myopia seems both technically difficult and actively counter the economic incentives of the AI labs, value is fragile, etc.
“Might produce AI that cooperates even if misaligned”—I’m not sure what this even means if we’re conditioning on worlds where the AI meaningfully has the option (sufficient capabilities) to take over. If you can successfully make an AI irrationally risk-averse in only the very out-of-distribution situation where it’s considering whether it’s worth taking over (by any means—see again Nearest Unblocked Strategy), and not in many other unrelated situations (which would substantially hamper its capabilities in a way which might destroy its economic value to the lab building it), then it sounds like you really have quite a lot of steering power over that AI’s preferences and you can just do something better than that.
Those seem like total defeaters to me. Do you have any references to arguments for how to get around those problems, or why we might realistically expect to not run into them, rather than merely imagining the possibilities, and leaving aside the question of how likely those possibilities are?
- ^
I’m not sure this is what you meant, but I’m not sure how to understand the structure of the post, and the lack of text suggesting that those pathways to impact matter for improving our odds of avoiding various catastrophic post-ASI outcomes like extinction, bad value lock-in, etc.
I’d noticed the same issue, but seeing this comment nudged me to write this post.