Yeah I would call that “gaslighting”. It looks like my initial interpretation of what you meant by it is closer than Zack’s. I think Scott isn’t doing that. I’m inclined to believe you when you say other people have behaved this way.
I don’t follow, can you restate the argument?
Good person with roughly utilitarian ethics would not agree with, for example, killing a large number of their peers to protect one human (and even less to follow random bureaucratic orders)
Is the claim that 2 or 3 implies that Claude would do that?
I think this is false. I expect Scott’s post to be heavily upvoted, if it was posted to the EA Forum to have an enormously positive agree/disagree ratio, and in-general for people to believe something pretty close to it.
The EA Forum is a trash fire, so who knows what would happen if this was published there.
My read of the social dynamics is that in places where people are inclined to defer to me or people like me, they might initially approve of the Scott thing for bad tribal reasons, but change their mind when they read criticism of it from me or someone like me (which is ofc part of why I sometimes bother commenting on things like this).
My guess is it’s also pretty close to what leadership at cG, CEA and Anthropic believe, plus it would poll pretty well at a thing like SES.
I think that Scott’s post would not overall be received positively by those people. Maybe you’re saying that one of the directions argued for by Scott’s post is approved of by those people? I agree with that more.
what are these people even arguing about?
Among other things, the fact that one of the leading ASI lab is substantially downstream of us. Separately, a lot of real actual politics that tends to happen in the community around prestige and money and talent allocation and respect, which needs to get litigated somehow (and abuse of power and legitimacy is common and if you can’t talk about it you can’t have norms about it).
Kolmogorov complexity of the human brain at one instant:
synapses (compare to neurons)10 to 1000 bits per synapse for weights
Total:
to bitsProbably not significantly compressible, considering that e.g. Claude Opus is significantly smarter than Claude Haiku
Kolmogorov complexity of “100 years of subjective experience that thinks he is [puffymist], a particular human who lived on Earth at
”?-
Temporal resolution of perception (“frame rate”): 10 to 30 frames per second
excludes audio, which has high sample rate but low bitrate
-
Uncompressed information per subjective-moment “frame”:
to bits per frameEmpirically: conscious processing
40 bits/second, or about 1 bit per “frame”Let’s say there are
to bits of felt-sense “richness” per bit of conscious processing
-
Compression: call it a factor of
(99.9% compression) to (90% compression)Low-level redundancy: video compression-like between-frame redundancy
High-level redundancy: routines, mental “well-worn grooves”, repetitive daily / yearly patterns
Semantic description: think of image / video generation from prompts of 10 to 100 words
-
Putting it all together:
Low end:
High end:
This is the consciously accessed data stream only, which is why it is much smaller than the full human brain.
“But the full latent input-output capabilities of human brain can be obtained by training the brain on its experience!” Yes, and that training makes use of data not consciously accessed, which I believe is much bigger than the consciously-accessed data stream.
Kolmogorov complexity of a human baby’s brain
A baby hasn’t begun learning, so I’ll assume that the human genome is a sufficient description of a baby’s brain.
Kolmogorov complexity
Kolmogorov complexity of any generic human-level observer
I really have no idea. The space of mind designs is huge; there are likely some very compact designs.
to bits, maybe?
It’s a really useful pointer towards a tactic that is relatively widespread and has no better word. I am personally happy to use other words, but I have the sense that sentences like “I am so very very tired of the ambiguous but ultimately strategic enough attempts at undermining my ability to orient in this situation by denying pretty clearly true parts of reality combined with intense implicit threats of consequences if I indicate I believe the wrong thing that might or might not be conscious optimizations happening in my interlocutors but have enough long-term coherence to be extremely unlikely to be the cause of random misunderstandings” would work that well.
One potential benchmark I’ve yet to see utilized for multimodal agents is getting Steam (or other platform) achievements on randomly chosen games. Should be fairly easy to implement, as far as I can tell.
Um, I think that long, detailed, audited arguments are how we do a substantial amount of social capital and resource allocation around these parts.
And also, um, it is better than most alternative ways of doing it (e.g. networking, politicking).
I stopped drinking coffee regularly because I would always get a headache when I forgot to drink coffee, e. g. on weekends. Do you have experience with withdrawl symptoms? Are they better or worse with paraxanthine? If you don’t, do you have a guess?
Don’t overupdate on insider gossip
Anthropic employees seem to be taking the Mythos results pretty seriously!I know people who work at Anthropic who are talking about buying shacks in the woods, or are spending their weekends setting up 2FAs and closing down old internet accounts. I think there’s similar hullabalo on twitter. These actions may well be high EV! But, I think people tend to overupdate from all of this lab-employee seriousness.
People at a lab are unusually likely to think that that lab’s work is a big deal. There’s both a selection effect and an intervention effect: you’re more likely to choose to work there if you expect it to be impactful, and then you’re spending all day with people who also expect that.
I imagine most people at Anthropic haven’t seen good evidence about how Mythos actually performs. They’re mostly going off the internal vibe, which is particularly seeded by the people who worked on Mythos the most. Those people have the best information, but they’re also the ones most likely to think that Mythos is a big deal that matters even more than Anthropic’s work in general.
A friend pointed out that Anthropic does have a bunch of smart, disagreeable people working there. I think disagreeableness does defend you against groupthink, but it’s much more effective when you start out disagreeing about whether an effect is real than how large it is. I think disagreeable people are often pretty good at saying “no, fuck you, I don’t think that’s true at all”. They might get dragged along with the crowd once they agree that something is some amount true
This isn’t to say that we should completely discount insider gossip. And I’m definitely not saying anything in particular about Mythos’ impact. I’d have to look much more into the model card and the patches and stuff if I wanted to form an opinion about that! I’m just saying, I’m less swayed by the miasma of panic rolling out of Howard St than many of my friends seem to be.
I went and looked at a bunch of the commits in March to popular/widely-installed open source repositories by Anthropic people. The fixes seem to mostly resolve things like buffer overflows and use-after-free bugs. These are the sort of bugs that (relatively) unskilled humans can find by grinding for long enough—but the supply of humans willing and able to do that grinding has previously been sharply limited, especially considering that actually getting value out of finding a vuln has historically been pretty hard.
If my guess that these commits are Mythos-generated is correct, and if these are representative, I think a good mental model may be “Mythos trivializes finding the vulns that security researchers have been yelling into the void about for decades (similar to what fuzzers did to the landscape, but more so, or perhaps if 2010!metasploit were dropped fully-formed into 2003)” rather than “Mythos trivializes finding new and exciting types of vulns that we didn’t even know were possible and which were not previously part of our threat model (like rowhammer)”. Basically a “quantity has a quality all of its own” style of thing.
While that may be logically true in some sense of those words, I’m not sure that even very advanced AIs will reason like that because of a) humans do not reason like that and AIs “reason” at least partly like humans, and b) because all the ambiguity of those words can lead to non-intuitive interactions of the logical claims.
I haven’t read the dialogs completely, so the answer may be obvious to others who have, but I’m wondering something now. Was Socrates’ line of questions to Euthyphro sincere? I always read it as a takedown of a pseudo intelectual by an actual intellect. But man does it hit differently if he was asking in hopes of getting an answer! That is such a different mindset. Both more admirable and more tragic.
“Gaslighting” should probably be avoided because it is anywhere between meaningless and a fighting word depending on who says it and how.
The g-word is a very nasty accusation. It gets thrown around and means a bunch of stuff down to just “saying stuff I disagree with”, but it shouldn’t.
It is originally a conscious, malicious attempt to drive someone insane by strategically lying to them.
On the substance, people are honest but wrong an awful lot, and honest but massively overstating their case even more often. Assuming your rivals are malicious or dishonest when they’re just wrong or overstating is a huge source of conflict and thereby confusion.
So I agree surely o3 can learn to reason in this kind of inhuman way; but so long as this is less efficient than writing in English, we should expect both the economic incentives at companies and the mechanical incentives of good RL procedures to lean against it, and for this kind of inhuman reasoning decrease. (As in fact has happened—o3 is an older model, and more recent reasoning models seem to do this kind of thing less [Edit: or not at all].).
Drift—neither more nor less expressive—is a distinct threat model. I don’t think it’s likely (namely, because it disagrees with the IID way models are pretrained; and how I expect midtraining to work; plus my beliefs about neural network plasticity) but I agree I haven’t argued against it here.
Ok, as a non-american and non-chinese, I notice that these last few years very very often Americans say or imply that China is aggressive towards other countries, that China is at a war through technologies with the US and that it will be very bad for the US if China builds advanced AI first. Can anyone explain to me why this is the case?
I usually use RepEng as a baseline at every layer from 30% to 80%, this is because the layer near the output seem to align to surface style, and papers like “Do llamas think in english” support this. But in the above link, I replicate a paper, and that paper uses every layer and CHaRS style cosine gating too.
Actually one other thing to consider is calibration, ideally we get the maximum steering effect we can within a given performance degradation budget. But how do you know two steering methods are calibrated well? So maybe my S space steering was just better calibrated on this model / setting.
Hawthorne gap style setups definitely are a good approach, that could be a nice follow-up here to compare behaviour on more/less realistic versions of the same eval.
Yeah, it would be useful to know, right now we have to guess, and they might be complementary.
If you want to collaborate on follow-ups, I’d be keen.
By the way I wrote up why I think the singular value space is a better target for steering here
When we steer pretrained transformers, we modify how a layer behaves. The most commonly used type of steering is activation steering, which adds a constant bias to the activations, changing the input to the layer. S-space steering instead modifies the transformation the layer applies to its inputs by reweighing the learned singular values of the weight matrix. This is a different kind of intervention: it changes how the layer processes inputs rather than directly nudging the activation output.
What kind of steering do you think Anthropic is using? I’m assuming it’s just activation addition that same as most use?
Yes, especially their visual understanding benchmarks are very impressive, sometimes significantly ahead of the competition. Unfortunately the model is really unknown outside of China. For foreigners, the website (https://www.doubao.com/chat/) redirects to a different chatbot called “Dola”. I’m not sure whether this is essentially the same model behind the scenes, just with different censorship perhaps.
Last we spoke you were talking about API or command line integration which would in principle allow a very wide range of editing/importing workflows, at least for power users.
That is now there! It’s what powers our LLM integrations:
Well, I mean, that is a hard conditional to be false since if people were to not change their mind, this would largely invalidate the premise that they are declined to defer to you. Unfortunately, I both think the vast majority of places in EA do not defer to you or people like you, and furthermore, I also think you are pretty importantly wrong about your criticisms, so I don’t quite know how to feel about this.
I do think it helps and am marginally happy about your cultural influence here (though it’s tricky, I also think a bunch of your takes here are quite dumb). I think the vast majority of the cultural influence here is downstream of not quite anyone in-particular, but more Anthropic than anywhere else, and neither you nor me can change that very much.
Yeah, I expect it to be straightforwardly positively received. I think people will be like “some parts of this seem dumb, the Ilya thing in-particular, but yeah, fuck those rationalists and MIRI people, I am with Scott on that”.
To be clear, I am not expecting consensus here, I think this will be what 75% of people who have any opinion at all on anything adjacent on this believe, but I expect people would broadly think it’s a good contribution that properly establishes norms and reflects how they think about things.
I also think it’s plausible people would be like “wow, what an uncough way that both of these people are interfacing with each other, please get away from each other children”, but then actually if you talked to them afterwards, they would be like “yeah, I mean, that was a bit of a shitshow but I do think Scott was basically right here (minus 1-2 minor things)”.
I am not enormously confident on this, but it matches my experiences of the space.