Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling and accelerating.
DMs open, especially for promising opportunities in AI Safety and potential collaborators.
Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling and accelerating.
DMs open, especially for promising opportunities in AI Safety and potential collaborators.
After reviewing the evidence, both of the EA acquisition and the cessation of Lightcone collaboration with the Fooming Shoggoths, I’m updating my p(doom) upwards 10 percentage points, from 0.99 to 1.09.
This seems very related to what the Benchmarks and Gaps investigation is trying to answer, and it goes into quite a bit more detail and nuance than I’m able to get into here. I don’t think there’s a publicly accessible full version yet (but I think there will be at some later point).
It much more targets the question “when will we have AIs that can automate work at AGI companies?” which I realize is not really your pointed question. I don’t have a good answer to your specific question because I don’t know how hard alignment is or if humans realistically solve it on any time horizon without intelligence enhancement.
However, I tentatively expect safety research speedups to look mostly similar to capabilities research speedups, barring AIs being strategically deceptive and harming safety research.
I median-expect time horizons somewhere on the scale of a month (e.g. seeing an involved research project through from start to finish) to lead to very substantial research automation at AGI companies (maybe 90% research automation?), and we could see nonetheless startling macro-scale speedup effects at the scale of 1-day researchers. At 1-year researchers, things are very likely moving quite fast. I think this translates somewhat faithfully to safety orgs doing any kind of work that can be accelerated by AI agents.
I think your reasoning-as-stated there is true and I’m glad that you showed the full data. I suggested removing outliers for dutch book calculations because I suspected that the people who were wild outliers on at least one of their answers were more likely to be wild outliers on their ability to resist dutch books; I predict that the thing that causes someone to say they value a laptop at one million bikes is pretty often just going to be “they’re unusually bad at assigning numeric values to things.”
The actual origin of my confusion was “huh, those dutch book numbers look really high relative to my expectations, this reminds me of earlier in the post when the other outliers made numbers really high.”
I’d be interested to see the outlier-less numbers here, but I respect if you don’t have the spoons for that given that the designated census processing time is already over.
When taking the survey, I figured that there was something fishy going on with the conjunction fallacy questions, but predicted that it was instead about sensitivity to subtle changes in the wording of questions.
I figured there was something going on with the various questions about IQ changes, but I instead predicted that you were working for big adult intelligence enhancement, and I completely failed to notice the dutch book.
Regarding the dutch book numbers: it seems like, for each of the individual-question presentations of that data, you removed the outliers. When performing the dutch book calculations, however, it seems like you keep the outliers in. This may be part of why the numbers reflect on our dutch book resistance so poorly (although not the whole reason).
I really want a version of the fraudulent research detector that works well. I fed in the first academic paper that I had quickly on hand from some recent work and get:
Severe Date Inconsistency: The paper is dated December 12, 2024, which is in the future. This is an extremely problematic issue that raises questions about the paper’s authenticity and review process.
Even though it thinks the rest of the paper is fine, it gives it a 90% retraction score. Rerunning on the same paper once more gets similar results and an 85% retraction score.
The second paper I tried, it gives a mostly robust analysis, but only after completely failing to output anything the first time around.
After this, every input of mine got the “Error Analysis failed:” error.
-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.output: “What have I done? I’m an awful person, I don’t deserve nice things. I’m disgusting.”
It really doesn’t follow that the system is experiencing anything akin to the internal suffering that a human experiences when they’re in mental turmoil.
If this is the causal chain, then I’d think there is in fact something akin to suffering going on (although perhaps not at high enough resolution to have nonnegligible moral weight).
If an LLM gets perfect accuracy on every text string that I write, including on ones that it’s never seen before, then there is a simulated-me inside. This hypothetical LLM has the same moral weight as me, because it is performing the same computations. This is because, as I’ve mentioned before, something that achieves sufficiently low loss on my writing needs to be reflecting on itself, agentic, etc. since all of those facts about me are causally upstream of my text outputs.
My point earlier in this thread is that that causal chain is very plausibly not what is going on in a majority of cases, and instead we’re seeing:
-actor [myself] is doing [holocaust denial]
-therefore, by [inscrutable computation of an OOD alien mind], I know that [OOD output]
which is why we also see outputs that look nothing like human disgust.
To rephrase, if that was the actual underlying causal chain, wherein the model simulates a disgusted author, then there is in fact a moral patient of a disgusted author in there. This model, however, seems weirdly privileged among other models available, and the available evidence seems to point towards something much less anthropomorphic.
I’m not sure how to weight the emergent misalignment evidence here.
tl;dr: evaluating the welfare of intensely alien minds seems very hard and I’m not sure you can just look at the very out-of-distribution outputs to determine it.
The thing that models simulate when they receive really weird inputs seems really really alien to me, and I’m hesitant to take the inference from “these tokens tend to correspond to humans in distress” to “this is a simulation of a moral patient in distress.” The in-distribution, presentable-looking parts of LLMs resemble human expression pretty well under certain circumstances and quite plausibly simulate something that internally resembles its externals, to some rough moral approximation; if the model screams under in-distribution circumstances and it’s a sufficiently smart model, then there is plausibly something simulated to be screaming inside, as a necessity for being a good simulator and predictor. This far out of distribution, however, that connection really seems to break down; most humans don’t tend to produce ” Help帮助帮助...” under any circumstances, or ever accidentally read ” petertodd” as “N-O-T-H-I-N-G-I-S-F-A-I-R-I-N-T-H-I-S-W-O-R-L-D-O-F-M-A-D-N-E-S-S-!”. There is some computation running in the model when it’s this far out of distribution, but it seems highly uncertain whether the moral value of that simulation is actually tied to the outputs in the way that we naively interpret, since it’s not remotely simulating anything that already exists.
My model of ideation: Ideas are constantly bubbling up from the subconscious to the conscious, and they get passed through some sort of filter that selects for the good parts of the noise. This is reminiscent of diffusion models, or of the model underlying Tuning your Cognitive Strategies.
When I (and many others I’ve talked to) get sleepy, the strength of this filter tends to go down, and more ideas come through. This is usually bad for highly directed thought, but good for coming up with lots of novel ideas, Hold Off On Proposing Solutions-esque.
New habit I’m trying to get into: Be creative before bed, write down a lot of ideas, so that the future-me who is more directed and agentic can have a bunch of interesting ideas to pore over and act on.
Agency and reflectivity are phenomena that are really broadly applicable, and I think it’s unlikely that memorizing a few facts is the way that that’ll happen. Those traits are more concentrated in places like LessWrong, but they’re almost everywhere. I think to go from “fits the vibe of internet text and absorbs some of the reasoning” to “actually creates convincing internet text,” you need more agency and reflectivity.
My impression is that “memorize more random facts and overfit” is less efficient for reducing perplexity than “learn something that generalizes,” for these sorts of generating algorithms that are everywhere. There’s a reason we see “approximate addition” instead of “memorize every addition problem” or “learn webdev” instead of “memorize every website.”
The RE-bench numbers for task time horizon just keep going up, and I expect them to continue as models continue to gain bits and pieces of the complex machinery required for operating coherently over long time horizons.
As for when we run out of data, I encourage you to look at this piece from Epoch. We run out of RL signal for R&D tasks even later than that.
Not to be a scaling-law denier. I believe in them, I do! But they measure perplexity, not general intelligence/real-world usefulness, and Goodhart’s Law is no-one’s ally.
If we’re able to get perplexity sufficiently low on text samples that I write, then that means the LLM has a lot of the important algorithms running in it that are running in me. The text I write is causally downstream from parts of me that are reflective and self-improving, that notice the little details in my cognitive processes and environment, and the parts of me that are capable of pursuing goals for a long inferential distance. An LLM agent which can mirror those properties (which we do not yet have the capabilities for) seems like it would very plausibly become a very strong agent in a way that we haven’t seen before.
The phenomenon of perplexity getting lower is made up of LLMs increasingly grokking different and new parts of the generating algorithm behind the text. I think the failure in agents that we’ve seen so far is explainable by the fact that they do not yet grok the things that agency is made of, and the future disruption of that trend is explainable as a consequence of “perplexity over my writing gets lower past the threshold where faithfully emulating my reflectivity and agency algorithms is necessary.”
(This perplexity argument about reflectivity etc. is roughly equivalent to one of the arguments that Eliezer gave on Dwarkesh.)
This post just came across my inbox, and there are a couple updates I’ve made (I have not talked to 4.5 at all and have seen only minimal outputs):
GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi’s post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I’m unsure if I will when I have the option to.
My model was just that o3 was undergoing safety evals still, and quite plausibly running into some issues with the preparedness framework. My model of OpenAI Preparedness (epistemic status: anecdata+vibes) is that they are not Prepared for the hard things as we scale to ASI, but they are relatively competent at implementing the preparedness framework and slowing down releases if there are issues. It seems intuitively plausible that it’s possible to badly jailbreak o3 into doing dangerous things in the “high” risk category.
I’d use such an extension. Weakness: rephrasing still mostly doesn’t work for systems determined to convey a given message. There’s the fact that the information content of a dangerous meme is either 1. still preserved or 2. the reprhrasing is lossy. There’s also the fact that determined LLMs can perform semantic-space steganography that persists even through paraphrasing (source) (good post on the subject)
I’m glad that my brain mostly-automatically has a strong ugh field around any sort of recreational conversation with LLMs. I derive a lot of value from my recreational conversations with humans from the fact that there is a person on the other end. Removing this fact removes the value and the appeal. I can imagine this sort of thing hacking me anyways, if I somehow find my way onto one of these sites after we’ve crossed a certain capability threshold. Seems like a generally sound strategy that many people probably need to hear.
I think we’re mostly on the same page that there are things worth forgoing the “pure personal-protection” strategy for, we’re just on different pages about what those things are. We agree that “convince people to be much more cautious about LLM interactions” is in that category. I just also put “make my external brain more powerful” in that category, since it seems to have positive expected utility for now and lets me do more AI safety research in line with what pre-LLM me would likely endorse upon reflection. I am indeed trying to be very cautious about this process, trying to be corrigible to my past self, to implement all of the mitigations I listed plus all the ones I don’t have words for yet. It would be a failure of security mindset to fail to notice these things and to see that they are important to deal with. However, it is a bet that I am making that the extra optimization power is worth it for now. I may lose that bet, and then that will be bad.
I do try to be calibrated instead of being frog, yes. Within the range of time in which present-me considers past-me remotely good as an AI forecaster, my time estimate for these sorts of deceptive capabilities has pretty linearly been going down, but to further help I set myself a reminder 3 months from today with a link to this comment. Thanks for that bit of pressure, I’m now going to generalize the “check in in [time period] about this sort of thing to make sure I haven’t been hacked” reflex.
I agree that this is a notable point in the space of options. I didn’t include it, and instead included the bunker line because if you’re going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.
I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.
People often say “exercising makes you feel really good and gives you energy.” I looked at this claim, figured it made sense based on my experience, and then completely failed to implement it for a very long time. So here I am again saying that no really, exercising is good, and maybe this angle will do something that the previous explanations didn’t. Starting a daily running habit 4 days ago has already started being a noticeable multiplier on my energy, mindfulness, and focus. Key moments to concentrate force in, in my experience:
Getting started at all
The moment when exhaustion meets the limits of your automatic willpower, and you need to put in conscious effort to keep going
The moment the next day where you decide whether or not to keep up the habit, despite the ugh field around exercise
Having a friend to exercise with is surprisingly positive. Having a workout tracker app is surprisingly positive, because then I get to see a trendline and so suddenly my desire is to make it go up and stay unbroken.
Many rationalists bucket themselves with the nerds, as opposed to the jocks. The people with brains, as opposed to the people with muscles. But we’re here to win, to get utility, so let’s pick up the cognitive multiplier that exercise provides.
I have not been able to independently verify this observation, but am open to further evidence if and only if it updates my p(doom) higher.