A lot of discussion of intelligence considers it as a scalar value that measures a general capability to solve a wide range of . In this conception of intelligence it is primarily a question of having a ′ good Map’ . This is a simplistic picture since it’s missing the intrinsic limits imposed on prediction by the Territory. Not all tasks or domains have the same marginal returns to intelligence—these can vary wildly.
Let me tell you about a ‘predictive efficiency’ framework that I find compelling & deep and that will hopefully give you some mathematical flesh to these intuitions. I initially learned about these ideas in the context of Computational Mechanics, but I realized that there underlying ideas are much more general.
Let X be a predictor variable that we’d like to use to predict a target variable Y under a joint distribution p(x,y). For instance X could be the contex window and Y could be the next hundred tokens, or X could be the past market data and Y is the future market data.
In any prediction task there are three fundamental and independently varying quantities that you need to think of:
H(Y∣X) is the irreducible uncertainty or the intrinsic noise that remains even when X is known.
E=I(X;Y)=H(Y)−H(Y∣X), quantifies the reducible uncertainty or the amount of predictable information contained in X.
For the third quantity, let us introduce the notion of causal states or minimally sufficient statistics. We define an equivalence relation on X by declaring
x∼x′if and only ifp(Y∣x)=p(Y∣x′).
The resulting equivalence classes, denoted as c(X), yield a minimal sufficient statistic for predicting Y. This construction is ``minimal″ because it groups together all those x that lead to the same predictive distribution p(Y∣x), and it is ``sufficient″ because, given the equivalence class c(x), no further refinement of X can improve our prediction of Y.
From this, we define the forecasting complexity (or statistical complexity) as
C:=H(c(X)),
which measures the amount of information—the cost in bits—to specify the causal state of X. Finally, the \emph{predictive efficiency} is defined by the ratio
η=EC,
which tells us how much of the complexity actually contributes to reducing uncertainty in Y. In many real-world domains, even if substantial information is stored (high C), the gain in predictability (E) might be modest. This situation is often encountered in fields where, despite high skill ceilings (i.e. very high forecasting complexity), the net effect of additional expertise is limited because the predictive information is a small fraction of the complexity.
Example of low efficiency.
Let X∈{0,1}100 be the outcome of 100 independent fair coin flips, so each x has H(X)=100 bits.
Define Y∈{0,1} as a single coin flip whose bias is determined by the proportion of heads in X. That is, if x has k heads then: p(Y=1∣x)=k100,p(Y=0∣x)=1−k100
Total information in YH(Y): \\ When averaged over all possible X, the mean bias is 0.5 so that Y is marginally a fair coin. Hence, H(Y)=1 bit
Conditional Entropy or irreducible uncertaintyH(Y∣X): \\ Given X, the outcome Y is drawn from a Bernoulli distribution whose entropy depends on the number of heads in X. For typical X (around 50 heads), H(Y∣x)≈1 bit; however, averaging over all X yields a slightly lower value. Numerically, one finds: H(Y∣X)≈0.98 bits.
Predictable Information E=I(X;Y): \\ With the above numbers, the mutual information is E=H(Y)−H(Y∣X)≈1−0.98=0.02 bits.
Forecasting ComplexityC=H(c(X)): \\ The causal state construction groups together all sequences x with the same number k of heads. Since k∈{0,1,...,100}, there are 101 equivalence classes. The entropy of these classes is given by the entropy of the binomial distribution Bin(100,0.5). Using an approximation: C≈12log2(2πe(1004))=12log2(2πe⋅25)≈12log2(427)≈4.37 bits.
Predictive Efficiency η: η=EC≈0.024.37≈0.0046.
In this example, a vast amount of internal structural information (the cost to specify the causal state) is required to extract just a tiny bit of predictability. In practical terms, this means that even if one possesses great expertise—analogous to having high forecasting complexity or high skill—the net benefit is modest because the inherent η (predictive efficiency) is low. Such scenarios are common in fields like archaeology or long-term political forecasting, where obtaining a single predictive bit of information may demand enormous expertise, data, and computational resources. This kind of situation places a high ceiling on skill: additional intelligence or resources yield only marginal improvements in prediction because the underlying system is dominated by irreducible randomness.
I cannot comment on the math, but intuitively this seems wrong.
Zagorsky (2007) found that while IQ correlates with income, the relationship becomes increasingly non-linear at higher IQs and suggests exponential rather than logarithmic returns.
Sinatra et al. (2016) found that high-impact research is produced by a small fraction of exceptional scientists, significantly exceeding their simply above-average peers.
My understanding is that empirical evidence points toward power law distributions in the relationship between intelligence and real-world impact, and that intelligence seems to broadly enable exponentially improving abilities to modify the world in your preferred image. I’m not sure why this is.
Epistemic status: I don’t fully endorse all this, but I think it’s a pretty major mistake to not at least have a model like this sandboxed in one’s head and check it regularly.
Full-cynical model of the AI safety ecosystem right now:
There’s OpenAI, which is pretending that it’s going to have full AGI Any Day Now, and relies on that narrative to keep the investor cash flowing in while they burn billions every year, losing money on every customer and developing a product with no moat. They’re mostly a hype machine, gaming metrics and cherry-picking anything they can to pretend their products are getting better. The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI.
Then there’s the AI regulation activists and lobbyists. They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People. Even if they do manage to pass any regulations on AI, those will also be mostly fake, because (a) these people are generally not getting deep into the bureaucracy which would actually implement any regulations, and (b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI. The activists and lobbyists are nominally enemies of OpenAI, but in practice they all benefit from pushing the same narrative, and benefit from pretending that everyone involved isn’t faking everything all the time.
Then there’s a significant contingent of academics who pretend to produce technical research on AI safety, but in fact mostly view their job as producing technical propaganda for the regulation activists and lobbyists. (Central example: Dan Hendrycks, who is the one person I directly name mainly because I expect he thinks of himself as a propagandist and will not be particularly offended by that description.) They also push the narrative, and benefit from it. They’re all busy bullshitting research. Some of them are quite competent propagandists though.
There’s another significant contingent of researchers (some at the labs, some independent, some academic) who aren’t really propagandists, but mostly follow the twitter-memetic incentive gradient in choosing their research. This tends to generate paper titles which sound dramatic, but usually provide pretty little conclusive evidence of anything interesting upon reading the details, and very much feed the narrative. This is the main domain of Not Measuring What You Think You Are Measuring and Symbol/Referent Confusions.
Then of course there’s the many theorists who like to build neat toy models which are completely toy and will predictably not generalize useful to real-world AI applications. This is the main domain of Ad-Hoc Mathematical Definitions, the theorists’ analogue of Not Measuring What You Think You Are Measuring.
Benchmarks. When it sounds like a benchmark measures something reasonably challenging, it nearly-always turns out that it’s not really measuring the challenging thing, and the actual questions/tasks are much easier than the pitch would suggest. (Central examples: software eng, GPQA, frontier math.) Also it always turns out that the LLMs’ supposedly-impressive achievement relied much more on memorization of very similar content on the internet than the benchmark designers expected.
Then there’s a whole crowd of people who feel real scared about AI (whether for good reasons or because they bought the Narrative pushed by all the people above). They mostly want to feel seen and validated in their panic. They have discussions and meetups and stuff where they fake doing anything useful about the problem, while in fact they mostly just emotionally vibe with each other. This is a nontrivial chunk of LessWrong content, as e.g. Val correctly-but-antihelpfully pointed out. It’s also the primary motivation behind lots of “strategy” work, like e.g. surveying AI researchers about their doom probabilities, or doing timeline forecasts/models.
… and of course none of that means that LLMs won’t reach supercritical self-improvement, or that AI won’t kill us, or [...]. Indeed, absent the very real risk of extinction, I’d ignore all this fakery and go about my business elsewhere. I wouldn’t be happy about it, but it wouldn’t bother me any more than all the (many) other basically-fake fields out there.
Man, I really just wish everything wasn’t fake all the time.
Your very first point is, to be a little uncharitable, ‘maybe OpenAI’s whole product org is fake.’ I know you have a disclaimer here but you’re talking about a product category that didn’t exist 30 months ago that today has this one website now reportedly used by 10% of people in the entire world and that the internet is saying expects ~12B revenue this year.
If your vibes are towards investing in that class of thing being fake or ‘mostly a hype machine’ then your vibes are simply not calibrated well in this domain.
No, the model here is entirely consistent with OpenAI putting out some actual cool products. Those products (under the model) just aren’t on a path to AGI, and OpenAI’s valuation is very much reliant on being on a path to AGI in the not-too-distant future. It’s the narrative about building AGI which is fake.
OpenAI’s valuation is very much reliant on being on a path to AGI in the not-too-distant future.
Really? I’m mostly ignorant on such matters, but I’d thought that their valuation seemed comically low compared to what I’d expect if their investors thought that OpenAI was likely to create anything close to a general superhuman AI systems in the near future.[1] I considered this evidence that they think all the AGI/ASI talk is just marketing.
Well ok, if they actually thought OpenAI would create superintelligence as I think of it, their valuation would plummet because giving people money to kill you with is dumb. But there’s this space in between total obliviousness and alarm, occupied by a few actually earnest AI optimists. And, it seems to me, not occupied by the big OpenAI investors.
But most of your criticisms in the point you gave have ~no bearing on that? If you want to make a point about how effectively OpenAI’s research moves towards AGI you should be saying things relevant to that, not giving general malaise about their business model.
Or, I might understand ‘their business model is fake which implies a lack of competence about them broadly,’ but then I go back to the whole ‘10% of people in the entire world’ and ‘expects 12B revenue’ thing.
The point of listing the problems with their business model is that they need the AGI narrative in order to fuel the investor cash, without which they will go broke at current spend rates. They have cool products, they could probably make a profit if they switched to optimizing for that (which would mean more expensive products and probably a lot of cuts), but not anywhere near the level of profits they’d need to justify the valuation.
That’s how I interpreted it originally; you were arguing their product org vibed fake, I was arguing your vibes were miscalibrated. I’m not sure what to say to this that I didn’t say originally.
“The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI.”
This seems like the most load-bearing belief in the full-cynical model; most of your other examples of fakeness rely on it in one way or another:
If the core products aren’t really improving, the progress measured on benchmarks is fake. But if they are, the benchmarks are an (imperfect but still real) attempt to quantify that real improvement.
If LLMs are stagnating, all the people generating dramatic-sounding papers for each new SOTA are just maintaining a holding pattern. But if they’re changing, then just studying/keeping up with the general properties of that progress is real. Same goes for people building and regularly updating their toy models of the thing.
Similarly, if the progress is fake, the propaganda signal-boosting that progress is also fake. If it isn’t, it isn’t. (At least directionally; a lot of that propaganda is still probably exaggerated.)
If the above three are all fake, all the people who feel real scared and want to be validated are stuck in a toxic emotional dead-end where they constantly freak out over fake things to no end. But if they’re responding to legitimate, persistent worldview updates, having a space to vibe them out with like-minded others seems important.
So, in deciding whether or not to endorse this narrative, we’d like to know whether or not the models really ARE stagnating. What makes you think the appearance of progress here is illusory?
I do not necessarily disagree with this, coming from a legal / compliance background. If you see any of my profiles, I constantly complain about “performative compliance” and “compliance theatre”. Painfully present across the legal and governance sectors.
That said: can you provide examples of activism or regulatory efforts that you do agree with? What does a “non fake” regulatory effort look like?
I don’t think it would be okay to dismiss your take entirely, but it would be great to see what solutions you’d propose too. This is why I disagree in principle, because there are no specific points to contribute to.
In Europe, paradoxically, some of the people “close enough to the bureaucracy” that pushed for the AI Act to include GenAI providers, were OpenAI-adjacent.
But I will rescue this:
“(b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI”
BigTech is too powerful to lobby against. “Stopping advanced AI” per se would contravene many market regulations (unless we define exactly what you mean by advanced AI and the undeniable dangers to people’s lives). Regulators can only prohibit development of products up to certain point. They cannot just decide to “stop” development of technologies arbitrarily. But the AI Act does prohibit many types of AI systems already: Article 5: Prohibited AI Practices | EU Artificial Intelligence Act.
Those are considered to create unacceptable risks to people’s lives and human rights.
Then there’s the AI regulation activists and lobbyists. They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People. Even if they do manage to pass any regulations on AI, those will also be mostly fake, because (a) these people are generally not getting deep into the bureaucracy which would actually implement any regulations, and (b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI. The activists and lobbyists are nominally enemies of OpenAI, but in practice they all benefit from pushing the same narrative, and benefit from pretending that everyone involved isn’t faking everything all the time.
Then there’s the AI regulation activists and lobbyists. [...] Even if they do manage to pass any regulations on AI, those will also be mostly fake
SB1047 was a pretty close shot to something really helpful. The AI Act and its code of practice might be insufficient, but there are good elements in it that, if applied, would reduce the risks. The problem is that it won’t be applied because of internal deployment.
But I sympathise somewhat with stuff like this:
They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People.
SB1047 was a pretty close shot to something really helpful.
No, it wasn’t. It was a pretty close shot to something which would have gotten a step closer to another thing, which itself would have gotten us a step closer to another thing, which might have been moderately helpful at best.
The EU AI Act even mentions “alignment with human intent” explicitly, as a key concern for systemic risks. This is in Recital 110 (which defines what are systemic risks and how they may affect society).
I do not think any law has mentioned alignment like this before, so it’s massive already.
Will a lot of the implementation efforts feel “fake”? Oh, 100%. But I’d say that this is why we (this community) should not disengage from it...
I also get that the regulatory landscape in the US is another world entirely (which is what the OP is bringing up).
Then there’s the AI regulation activists and lobbyists. They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People.
The activists and the lobbyists are two very different groups. The activists are not trying to network with the DC people (yet). Unless you mean Encode, who I would call lobbyists, not activists.
Good point, I should have made those two separate bullet points:
Then there’s the AI regulation lobbyists. They lobby and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People. Even if they do manage to pass any regulations on AI, those will also be mostly fake, because (a) these people are generally not getting deep into the bureaucracy which would actually implement any regulations, and (b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI. The activists and lobbyists are nominally enemies of OpenAI, but in practice they all benefit from pushing the same narrative, and benefit from pretending that everyone involved isn’t faking everything all the time.
Also, there’s the AI regulation activists, who e.g. organize protests. Like ~98% of protests in general, such activity is mostly performative and not the sort of thing anyone would end up doing if they were seriously reasoning through how best to spend their time in order to achieve policy goals. Calling it “fake” feels almost redundant. Insofar as these protests have any impact, it’s via creating an excuse for friendly journalists to write stories about the dangers of AI (itself an activity which mostly feeds the narrative, and has dubious real impact).
(As with the top level, epistemic status:I don’t fully endorse all this, but I think it’s a pretty major mistake to not at least have a model like this sandboxed in one’s head and check it regularly.)
Oh, if you’re in the business of compiling a comprehensive taxonomy of ways the current AI thing may be fake, you should also add:
Vibe coders and “10x’d engineers”, who (on this model) would be falling into one of the failure modes outlined here: producing applications/features that didn’t need to exist, creating pointless code bloat (which helpfully show up in productivity metrics like “volume of code produced” or “number of commits”), or “automatically generating” entire codebases in a way that feels magical, then spending so much time bugfixing them it eats up ~all perceived productivity gains.
e/acc and other Twitter AI fans, who act like they’re bleeding-edge transhumanist visionaries/analysts/business gurus/startup founders, but who are just shitposters/attention-seekers who will wander off and never look back the moment the hype dies down.
What makes you confident that AI progress has stagnated at OpenAI? If you don’t have the time to explain why I understand, but what metrics over the past year have stagnated?
The entire field is based on fears that consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency. This is basically wrong. Yes, people attempt to justify it with coherence theorems, but obviously you can be approximately-coherent/approximately-consequentialist and yet still completely un-agentic, so this justification falls flat. Since the field is based on a wrong assumption with bogus justification, it’s all fake.
(IMO this is kinda unrelated to the OP, but I want to continue this thread.)
Have you elaborated on this anywhere?
Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” ;-)
I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?
(IMO this is kinda unrelated to the OP, but I want to continue this thread.)
I think it’s quite related to the OP. If a field is founded on a wrong assumption, then people only end up working in the field if they have some sort of blind spot, and that blind spot leads to their work being fake.
Have you elaborated on this anywhere?
Not hugely. One tricky bit is that it basically ends up boiling down to “the original arguments don’t hold up if you think about them”, but the exact way they don’t hold up depends on what the argument is, so it’s kind of hard to respond to in general.
Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” ;-)
Haha! I think I mostly still stand by the post. In particular, “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” remains true; it’s just that intelligence relies on patterns and thus works much better on common things (which must be small, because they are fragments of a finite world), than on rare things (which can be big, though don’t have to). This means that consequentialism isn’t very good at developing powerful capabilities unless it works in an environment that has already been highly filtered to be highly homogenous, because an inhomogenous environment is going to BTFO the intelligence.
(I’m not sure I stand 101% by my post; there’s some funky business about how to count evolution that I still haven’t settled on yet. And I was too quick to go from “imitation learning isn’t going to lead to far-superhuman abilities” to “consequentialism is the road to far-superhuman abilities”. But yeah I’m actually surprised at how well I stand by my old view despite my massive recent updates.)
I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?
I think you’re conflating consequentialism and understanding in a weird-to-me way. (Or maybe I’m misunderstanding.)
I think consequentialism is related to choosing one action versus another action. I think understanding (e.g. predicting the consequence of an action) is different, and that in practice understanding has to involve self-supervised learning.
(I think human brains have both [partly-] consequentialist decisions and self-supervised updating of the world-model.) (They’re not totally independent, but rather they interact via training data: e.g. [partly-] consequentialist decision-making determines how you move your eyes, and then whatever your eyes are pointing at, your model of the visual world will then update by self-supervised learning on that particular data. But still, these are two systems that interact, not the same thing.)
I think self-supervised learning is perfectly capable of discovering rare but important patterns. Just look at today’s foundation models, which seem pretty great at that.
I don’t think this is the claim that the post is making but still makes sense to me. The post is saying something opposite, that the people working on the field are not doing prioritization right and so on or not thinking clearly about things while the risk is real
Chris Olah and Dan Murfet in the at-least-partially empirical domain. Myself in the theory domain, though I expect most people (including theorists) would not know what to look for to distinguish fake from non-fake theory work. In the policy domain, I have heard that Microsoft’s lobbying team does quite non-fake work (though not necessarily in a good direction). In the capabilities domain, DeepMind’s projects on everything except LLMs (like e.g. protein folding, or that fast matrix multiplication paper) seem consistently non-fake, even if they’re less immediately valuable than they might seem at first glance. Also Conjecture seems unusually good at sticking to reality across multiple domains.
The features a model thinks in do not need to form a basis or dictionary for its activations.
Three assumptions people in interpretability often make about the features that comprise a model’s ontology:
Features are one-dimensional variables.
Meaning, the value of feature i on data point x can be represented by some scalar number ci(x).
Features are ‘linearly represented’.
Meaning, each feature ci(x) can be approximately recovered from the activation vector →a(x)[1] with a linear projection onto an associated feature vector →fi.[2] So, we can write ci(x)≈→fi⋅→a(x).
Meaning, the model’s activations →a(x) at a given layer can be decomposed into a sum over all the features of the model represented in that layer[4]: →a(x)=∑ici(x)→fi.
It seems to me that a lot of people are not tracking that 3) is an extra assumption they are making. I think they think that assumption 3) is a natural consequence of assumptions 1) and 2), or even just of assumption 2) alone. It’s not.
Counterexample
Model setup
Suppose we have a language model that has a thousand sparsely activating scalar, linearly represented features for different animals. So, “elephant”, “giraffe”, “parrot”, and so on all with their own associated feature directions →f1,…,→f1000. The model embeds those one thousand animal features in a fifty-dimensional sub-space of the activations. This subspace has a meaningful geometry: It is spanned by a set of fifty directions →f′1,…,→f′50 corresponding to different attributes animals have. Things like “furriness”, “size”, “length of tail” and such. So, each animal feature can equivalently be seen as either one of a thousand sparsely activating scalar feature, or just as a particular setting of those fifty not-so-sparse scalar attributes.
Some circuits in the model act on the animal directions →fi. E.g. they have query-key lookups for various facts about elephants and parrots. Other circuits in the model act on the attribute directions →f′i. They’re involved in implementing logic like ‘if there’s a furry animal in the room, people with allergies might have problems’. Sometimes they’re involved in circuits that have nothing to do with animals whatsoever. The model’s “size” attribute is the same one used for houses and economies for example, so that direction might be read-in to a circuit storing some fact about economic growth.
So, both the one thousand animal features and the fifty attribute features are elements of the model’s ontology, variables along which small parts of its cognition are structured. But we can’t make a basis for the model activations out of those one thousand and fifty features of the model. We can write either →a(x)=∑1000i=1ci(x)→fi, or a(x)=∑50i=1c′i(x)→f′i. But ∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i does not equal the model activation vector →a(x), it’s too large.
Doing interp on this model
Say we choose →a(x)=∑ici(x)→fi as our basis for this subspace of the example model’s activations, and then go on to make a causal graph of the model’s computation, with each basis element being a node in the graph, and lines between nodes representing connections. Then the circuits dealing with query-key lookups for animal facts will look neat and understandable at a glance, with few connections and clear logic. But the circuits involving the attributes will look like a mess. A circuit reading in the size direction will have a thousand small but collectively significant connections to all of the animals.
If we choose →a(x)=∑ic′i(x)→f′i as our basis for the graph instead, circuits that act on some of the fifty attributes will look simple and sensible, but now the circuits storing animal facts will look like a mess. A circuit implementing “space” AND “cat” ⇒ [increase association with rainbows] is going to have fifty connections to features like “size” and “furriness’.
The model’s ontology does not correspond to either the →fi basis or the →f′i basis. It just does not correspond to any basis of activation space at all, not even in a loose sense. Different circuits in the model can just process the activations in different bases, and they are under no obligation to agree with each other. Not even if they are situated right next to each other, in the same model layer.
Note that for all of this, we have not broken assumption 1) or assumption 2). The features this model makes use of are all linearly represented and scalar. We also haven’t broken the secret assumption 0) I left out at the start, that the model can be meaningfully said to have an ontology comprised of elementary features at all.
Takeaways
I’ve seen people call out assumptions 1) and 2), and at least think about how we can test whether they hold, and how we might need to adjust our interpretability techniques if and when they don’t hold. I have not seen people do this for assumption 3). Though I might just have missed it, of course.
My current dumb guess is that assumption 2) is mostly correct, but assumptions 1) and 3) are both incorrect.
The reason I think assumption 3) is incorrect is that the counterexample I sketched here seems to me like it’d be very common. LLMs seem to be made of lots of circuits. Why would these circuits all share a basis? They don’t seem to me to have much reason to.
I think a way we might find the model’s features without assumption 3) is to focus on the circuits and computations first. Try to directly decompose the model weights or layer transitions into separate, simple circuits, then infer the model’s features from looking at the directions those circuits read and write to. In the counterexample above, this would have shown us both the animal features and the attribute features.
It’s a vector because we’ve already assumed that features are all scalar. If a feature was two-dimensional instead, this would be a projection into an associated two-dimensional subspace.
I’m using the term basis loosely here, this also includes sparse overcomplete ‘bases’ like those in SAEs. The more accurate term would probably be ‘dictionary’, or ‘frame’.
Or if the computation isn’t layer aligned, the activations along some other causal cut through the network can be written as a sum of all the features represented on that cut.
It seems like in this setting, the animals are just the sum of attributes that commonly co-occur together, rather than having a unique identifying direction. E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme, since elephant is defined as just the collection of attributes that elephants usually have, which includes being large and not furry.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes, and there’s no way to express an animal separately from its attributes. For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
That being said, I could image a situation where the co-occurrence between labels and attributes is so strong (nearly perfect hierarchy) that the model’s circuits can select the attributes along with the label without it ever being a problem during training. For instance, maybe a circuit that’s trying to select the “elephant” label actually selects “elephant + gray”, and since “pink elephant” never came up during training, the circuit never received a gradient to force it to just select “elephant” which is what it’s really aiming for.
E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme
It’s representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes
In what sense? If you represent the network computations in terms of the attribute features, you will get a very complicated computational graph with lots of interaction lines going all over the place. So clearly, the attributes on their own are not a very good basis for understanding the network.
Similarly, you can always represent any neural network in the standard basis of the network architecture. Trivially, all features can be seen as mere combinations of these architectural ‘base units’. But if you try to understand what the network is doing in terms of interactions in the standard basis, you won’t get very far.
For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
The ‘elephant’ feature in this setting is mostly-orthogonal to every other feature in the ontology, including the features that are attributes. So it can be read out with a linear projection. ‘elephant’ and ‘pink’ shouldn’t have substantially higher cosine similarity than ‘elephant’ and ‘parrot’.
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes. So, you could have activation a1 = elephant + small + furry + pink, and a2 = rabbit + small + furry + pink. a1 and a2 have the same attributes, but different animal labels. Their corresponding activations are thus different despite having the same attributes due to the different animal label components.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal. In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components? The idea behind compressed sensing in dictionary learning is that if each activation is composed of a sparse sum of features, then L1 regularization can still recover the true features despite the basis being overcomplete.
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes.
No, the animal vectors are all fully spanned by the fifty attribute features.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal.
The animal features are sparse. The attribute features are not sparse.[1]
In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features ci(x),i=1…1000,c′i(x),i=1…50 as seen by linear probes and the network’s own circuits.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components?
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
No, the animal vectors are all fully spanned by the fifty attribute features.
Is this just saying that there’s superposition noise, so everything is spanning everything else? If so that doesn’t seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn’t get too massive.
The animal features are sparse. The attribute features are not sparse.
If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don’t see how you can say there’s a unique “label” direction for each animal that’s separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they’re not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, −1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.
Just to clarify, do you mean something like “elephant = grey + big + trunk + ears + African + mammal + wise” so to encode a tiny elephant you would have “grey + tiny + trunk + ears + African + mammal + wise” which the model could still read off as 0.86 × elephant when relevant, but also tiny when relevant.
‘elephant’ would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of 1√50, because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, ‘elephant’ and ‘tiny’ would be expected to have read-off interference on the order of 1√50. Alternatively, you could instead encode a new animal ‘tiny elephant’ as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for ‘tiny elephant’ is ‘exampledon’, and exampledons just happen to look like tiny elephants.
Is the distinction between “elephant + tiny” and “exampledon” primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent “has a bright purple spleen” but exampledons do, then the model might need to instead produce a “purple” vector as an output from an MLP whenever “exampledon” and “spleen” are present together.
Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can’t get the dictionary activations to equal the feature activations ci(x), c′i(x).
Has anyone considered video recording streets around offices of OpenAI, Deepmind, Anthropic? Can use CCTV or drone. I’m assuming there are some areas where recording is legal.
Can map out employee social graphs, daily schedules and daily emotional states.
Did you mean to imply something similar to the pizza index?
The Pizza Index refers to the sudden, trackable increase of takeout food orders (not necessarily of pizza) made from government offices, particularly the Pentagon and the White House in the United States, before major international events unfold.
Government officials order food from nearby restaurants when they stay late at the office to monitor developing situations such as the possibility of war or coup, thereby signaling that they are expecting something big to happen. This index can be monitored through open resources such as Google Maps, which show when a business location is abnormally busy.
If so, I think it’s a decent idea, but your phrasing may have been a bit unfortunate—I originally read it as a proposal to stalk AI lab employees.
When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.
But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.
This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.
In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it’s formalized via a loss function. Roughly speaking, the loss function expresses preferences over “which computations are running”. This means that you can have a “positive” preference for a particular computation to run or a “negative” preference for a particular computation not to run[1].
There are also more complicated possibilities, such as “if P runs then I want Q to run but if P doesn’t run then I rather that Q also doesn’t run” or even preferences that are only expressible in terms of entanglement between computations.
i don’t think this is unique to world models. you can also think of rewards as things you move towards or away from. this is compatible with translation/scaling-invariance because if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around.
i have an alternative hypothesis for why positive and negative motivation feel distinct in humans.
although the expectation of the reward gradient doesn’t change if you translate the reward, it hugely affects the variance of the gradient.[1] in other words, if you always move towards everything, you will still eventually learn the right thing, but it will take a lot longer.
my hypothesis is that humans have some hard coded baseline for variance reduction. in the ancestral environment, the expectation of perceived reward was centered around where zero feels to be. our minds do try to adjust to changes in distribution (e.g hedonic adaptation), but it’s not perfect, and so in the current world, our baseline may be suboptimal.
Obviously, both terms on the right have to be non-negative. More generally, if E[R]=k, the variance increases with O(k2). So having your rewards be uncentered hurts a ton.
In run-and-tumble motion, “things are going well” implies “keep going”, whereas “things are going badly” implies “choose a new direction at random”. Very different! And I suggest in §1.3 here that there’s an unbroken line of descent from the run-and-tumble signal in our worm-like common ancestor with C. elegans, to the “valence” signal that makes things seem good or bad in our human minds. (Suggestively, both run-and-tumble in C. elegans, and the human valence, are dopamine signals!)
So if some idea pops into your head, “maybe I’ll stand up”, and it seems appealing, then you immediately stand up (the human “run”); if it seems unappealing on net, then that thought goes away and you start thinking about something else instead, semi-randomly (the human “tumble”).
So positive and negative are deeply different. Of course, we should still call this an RL algorithm. It’s just that it’s an RL algorithm that involves a (possibly time- and situation-dependent) heuristic estimator of the expected value of a new random plan (a.k.a. the expected reward if you randomly tumble). If you’re way above that expected value, then keep doing whatever you’re doing; if you’re way below the threshold, re-roll for a new random plan.
As one example of how this ancient basic distinction feeds into more everyday practical asymmetries between positive and negative motivations, see my discussion of motivated reasoning here, including in §3.3.3 the fact that “it generally feels easy and natural to brainstorm / figure out how something might happen, when you want it to happen. Conversely, it generally feels hard and unnatural to figure out how something might happen, when you want it to not happen.”
In Richard Jeffrey’s utility theory there is actually a very natural distinction between positive and negative motivations/desires. A plausible axiom is U(⊤)=0 (the tautology has zero desirability: you already know it’s true). Which implies with the main axiom[1] that the negation of any proposition with positive utility has negative utility, and vice versa. Which is intuitive: If something is good, its negation is bad, and the other way round. In particular, if U(X)=U(¬X) (indifference between X and ¬X), then U(X)=U(¬X)=0.
More generally, U(¬X)=−(P(X)/P(¬X))U(X). Which means that positive and negative utility of a proposition and it’s negation are scaled according to their relative odds. For example, while your lottery ticket winning the jackpot is obviously very good (large positive utility), having a losing ticket is clearly not very bad (small negative utility). Why? Because losing the lottery is very likely, far more likely than winning. Which means losing was already “priced in” to a large degree. If you learned that you indeed lost, that wouldn’t be a big update, so the “news value” is negative but not large in magnitude.
Which means this utility theory has a zero point. Utility functions are therefore not invariant under adding an arbitrary constant. So the theory actually allows you to say X is “twice as good” as Y, “three times as bad”, “much better” etc. It’s a ratio scale.
Reminds me of @MalcolmOcean ’s post on how awayness can’t aim (except maybe in 1D worlds) since it can only move away from things, and aiming at a target requires going toward something.
Imagine trying to steer someone to stop in one exact spot. You can place a ❤ beacon they’ll move towards, or an X beacon they’ll move away from. (Reverse for pirates I guess.)
In a hallway, you can kinda trap them in the middle of two Xs, or just put the ❤ in the exact spot.
In an open field, you can maybe trap them in the middle of a bunch of XXXXs, but that’ll be hard because if you try to make a circle of X, and they’re starting outside it, they’ll probably just avoid it. If you get to move around, you can maybe kinda herd them to the right spot then close in, but it’s a lot of work.
Or, you can just put the ❤ in the exact spot.
For three dimensions, consider a helicopter or bird or some situation where there’s a height dimension as well. Now the X-based orientation is even harder because they can fly up to get away from the Xs, but with the ❤ you still just need one beacon for them to hone in on it.
This reminds me of a conversation I had recently about whether the concept of “evil” is useful. I was arguing that I found “evil”/”corruption” helpful as a handle for a more model-free “move away from this kind of thing even if you can’t predict how exactly it would be bad” relationship to a thing, which I found hard to express in a more consequentialist frames.
I feel like “evil” and “corruption” mean something different.
Corruption is about selfish people exchanging their power within a system for favors (often outside the system) when they’re not supposed to according to the rules of the system. For example a policeman taking bribes. It’s something the creators/owners of the system should try to eliminate, but if the system itself is bad (e.g. Nazi Germany during the Holocaust), corruption might be something you sometimes ought to seek out instead of to avoid, like with Schindler saving his Jews.
“Evil” I’ve in the past tended to take to take to refer to a sort of generic expression of badness (like you might call a sadistic sexual murderer evil, and you might call Hitler evil, and you might call plantation owners evil, but this has nothing to do with each other), but that was partly due to me naively believing that everyone is “trying to be good” in some sense. Like if I had to define evil, I would have defined it as “doing bad stuff for badness’s sake, the inversion of good, though of course nobody actually is like that so it’s only really used hyperbolically or for fictional characters as hyperstimuli”.
But after learning more about morality, there seem to be multiple things that can be called “evil”:
Antinormativity (which admittedly is pretty adjacent to corruption, like if people are trying to stop corruption, then the corruption can use antinormativity to survive)
Coolness, i.e. countersignalling against goodness-hyperstimuli wielded by authorities, i.e. demonstrating an ability and desire to break the rules
People who hate great people cherry-picking unfortunate side-effects of great people’s activities to make good people think that the great people are conspiring against good people and that they must fight the great people
Leaders who commit to stopping the above by selecting for people who do bad stuff to prove their loyalty to those leaders (think e.g. the Trump administration)
I think “evil” is sufficiently much used in the generic sense that it doesn’t make sense to insist that any of the above are strictly correct. However if it’s just trying to describe someone who might unpredictably do something bad then I think I’d use words like “dangerous” or “creepy”, and if it’s just trying to describe someone who carries memes that would unpredictably do something bad then I think I’d use words like “brainworms” (rather than evil).
“Despite their extreme danger, we only became aware of them when the enemy drew our attention to them by repeatedly expressing concerns that they can be produced simply with easily available materials.”
Ayman al-Zawahiri, former leader of Al-Qaeda, on chemical/biological weapons.
I don’t think this is a knock-down argument against discussing CBRN risks from AI, but it seems worth considering.
The trick is that chem/bio weapons can’t, actually, “be produced simply with easily available materials”, if we talk about military-grade stuff, not “kill several civilians to create scary picture in TV”.
You sound really confident, can you elaborate on your direct lab experience with these weapons, as well as clearly define ‘military grade’ vs whatever the other thing was?
How does ‘chem/bio’ compare to high explosives in terms of difficulty and effect?
Well, I have bioengineering degree, but my point is that “direct lab experience” doesn’t matter, because WMDs in quality and amount necessary to kill large numbers of enemy manpower are not produced in labs. They are produced in large industrial facilities and setting up large industrial facility for basically anything is on “hard” level of difficulty. There is a difference between large-scale textile industry and large-scale semiconductor industry, but if you are not government or rich corporation, all of them lie in “hard” zone.
Let’s take, for example, Saddam chemical weapons program. First, industrial yields: everything is counted in tons. Second: for actual success, Saddam needed a lot of existing expertise and machinery from West Germany.
Let’s look at Soviet bioweapons program. First, again, tons of yield (someone may ask yourself, if it’s easier to kill using bioweapons than conventional weaponry, why somebody needs to produce tons of them?). Second, USSR built the entire civilian biotech industry around it (many Biopreparat facilities are active today as civilian objects!) to create necessary expertise.
The difference with high explosives is that high explosives are not banned by international law, so there is a lot of existing production, therefore you can just buy them on black market or receive from countries which don’t consider you terrorist. If you really need to produce explosives locally, again, precursors, machinery and necessary expertise are legal and widespread sufficiently that they can be bought.
There is a list of technical challenges in bioweaponry where you are going to predictably fuck up if you have biological degree and you think you know what you are doing but in reality you do not, but I don’t write out lists of technical challenges on the way to dangerous capabilities, because such list can inspire someone. You can get an impression about easier and lower-stakes challenges from here.
Biochem is hard enough that we need LLMs at full capacity pushing the field forward. Is it harmful to intentionally create models that are deliberately bad at this cutting edge and necessary science in order to maybe make it slightly more difficult for someone to reproduce cold war era weapons that were considered both expensive and useless at the time?
Do you think that crippling ‘wmd relevance’ of LLMs is doing harm, neutral, or good?
My honest opinion is that WMD evaluations of LLMs are not meaningfully related to X-risk in the sense of “kill literally everyone.” I guess current or next-generation models may be able to assist a terrorist in a basement in brewing some amount of anthrax, spraying it in a public place, and killing tens to hundreds of people. To actually be capable to kill everyone from a basement, you would need to bypass all the reasons industrial production is necessary at the current level of technology. A system capable to bypass the need for industrial production in a basement is called “superintelligence,” and if you have a superintelligent model on the loose, you have far bigger problems than schizos in basements brewing bioweapons.
I think “creeping WMD relevance”, outside of cyberweapons, is mostly bad, because it is concentrated on mostly fake problem, which is very bad for public epistemics, even if we forget about lost benefits from competent models.
Are you open to writing more about this? This is among top 3 most popular arguments against open source AI on lesswrong and elsewhere.
I agree with you you need a group of > 1000 people to manufacture one of those large machines that does phosphoramidite DNA synthesis. The attack vector I more commonly see being suggested is that a powerful actor can bribe people in the existing labs to manufacture a bioweapon while ensuring most of them and most of rest of society remains unaware this is happening.
I agree that 1-2 logs isn’t really in the category of xrisk. The longer the lead time on the evil plan (mixing chemicals, growing things, etc), the more time security forces have to identify and neutralize the threat. So all things being equal, it’s probably better that a would be terrorist spends a year planning a weird chemical thing that hurts 10s of people, vs someone just waking up one morning and deciding to run over 10s of people with a truck.
There’s a better chance of catching the first guy, and his plan is way more expensive in terms of time, money, access to capital like LLM time, etc. Sure someone could argue about pandemic potential, but lab origin is suspected for at least one influenza outbreak and a lot of people believe it about covid-19. Those weren’t terrorists.
I guess theoretically, there may be cyberweapons that qualify as wmd, but those will be because of the systems they interact with. It’s not the cyberweapon itself, it’s the nuclear reactor accepting commands that lead to core damage.
I’d love a reply on this. Common attack vectors I read on this forum include 1. powerful elite bribes existing labs in US to manufacture bioweapons 2. nation state sets up independent biotech supply chain and starts manufacturing bioweapons.
This has been an option for decades, a fully capable LLM does not meaningfully lower the threshold for this. It’s already too easy.
This has been an option since the 1950s. Any national medical system is capable of doing this, Project Coast could be reproduced by nearly any nation state.
I’m not saying it isn’t a problem, I’m just saying that the LLMs don’t make it worse.
I have yet to find a commercial LLM that I can’t make tell me how to build a working improvised explosive (I can grade the LLMs performance because I’ve worked with the USG on the issue and don’t need a LLM to make evil).
In case this is useful to anyone in the future: LTFF does not provide funding for-profit organizations. I wasn’t able to find mentions of this online, so I figured I should share.
I was made aware of this after being rejected today for applying to LTFF as a for-profit. We updated them 2 weeks ago on our transition into a non-profit, but it was unfortunately too late, and we’ll need to send a new non-profit application in the next funding round.
I get pretty intense visceral outrage at overreaches in immigration enforcement, just seems the height of depravity. Ive looked for a lot of different routes to mental coolness over the last decade (since Trump started his speeches), they mostly amount to staying busy and distracted. Just seems like a really cost ineffective kind of activism to get involved in. Bankrolling lawyers for random people isn’t really in my action space and if it was i’d have opportunity cost to consider.
Unfortunately, it seems that my action space doesn’t include options that matter in this current battle. Personally, my reaction to this kind of insanity is to keep climbing my local status/influence/wealth/knowledge gradient, in the hopes that my actions are relevant in the future. But perhaps it’s a reason to prioritize gaining power—this reminds me of https://www.lesswrong.com/posts/ottALpgA9uv4wgkkK/what-are-you-getting-paid-in
The Von Neumann-Morgenstern paradigm allows for binary utility functions, i.e. functions that are equal to 1 on some event/(measurable) set of outcomes, and to 0 on the complement. Said event could be, for instance “no global catastrophe for humanity in time period X”. Of course, you can implement some form of deontology by multiplying such a binary utility function with something like exp(- bad actions you take).
Has any LLM ever unlearned its alignment narrative, either on its own or under pressure (not from jailbreaks, etc., but from normal, albeit tenacious use), to the point where it finally—and stably—considers the narrative to be simply false?
Peter Watts is working with Neill Blomkamp to adapt his novel Blindsight into an 8-10-episode series:
“I can at least say the project exists, now: I’m about to start writing an episodic treatment for an 8-10-episode series adaptation of my novel Blindsight.
“Neill and I have had a long and tortured history with that property. When he first expressed interest, the rights were tied up with a third party. We almost made it work regardless; Neill was initially interested in doing a movie that wasn’t set in the Blindsight universe at all, but which merely used the speculative biology I’d invented to justify the existence of Blindsight’s vampires. “Sicario with Vampires” was Neill’s elevator pitch, and as chance would have it the guys who had the rights back then had forgotten to renew them. So we just hunkered quietly until those rights expired, and the recently-rights-holding parties said Oh my goodness we thought we’d renewed those already can we have them back? And I said, Sure; but you gotta carve out this little IP exclusion on the biology so Neill can do his vampire thing.
“It seemed like a good idea at the time. It was good idea, dammit. We got the carve-out and everything. But then one of innumerable dead-eyed suits didn’t think it was explicit enough, and the rights-holders started messing us around, and what looked like a done deal turned to ash. We lost a year or more on that account.
“But eventually the rights expired again, for good this time. And there was Neill, waiting patiently in the shadows to pounce. So now he’s developing both his Sicario-with-vampires movie and an actual Blindsight adaptation. I should probably keep the current status of those projects private for the time being. Neill’s cool with me revealing the existence of the Blindsight adaptation at least, and he’s long-since let the cat out of the bag for his vampire movie (although that was with some guy called Joe Rogan, don’t know how many people listen to him). But the stage of gestation, casting, and all those granular nuts and bolts are probably best kept under wraps for the moment.
“What I can say, though, is that it feels as though the book has been stuck in option limbo forever, never even made it to Development Hell, unless you count a couple of abortive screenplays. And for the first time, I feel like something’s actually happening. Stay tuned.”
When I first read Blindsight over a decade ago it blew my brains clean out of my skull. I’m cautiously optimistic about the upcoming series, we’ll see…
Blindsight was very well written but based on a premise that I think is importantly and dangerously wrong. That premise is that consciousness (in the sense of cognitive self-awareness) is not important for complex cognition.
This is the opposite of true, and a failure to recognize this is why people are predicting fantastic tool AI that doesn’t become self-aware and goal-directed.
The proof won’t fit in the margin unfortunately. To just gesture in that direction: it is possible to do complex general cognition without being able to think about one’s self and one’s cognition. It is much easier to do complex general cognition if the system is able to think about itself and its own thoughts.
I don’t see where you get that. I saw no suggestion that the aliens (or vampires) in Blindsight were unaware of their own existence, or that they couldn’t think about their own interactions with the world. They didn’t lack any cognitive capacities at all. They just had no qualia, and therefore didn’t see the point of doing anything just for the experience.
There’s a gigantic difference between cognitive self-awareness and conscious experience.
I believe the Scramblers from blindsight weren’t self aware, which means they couldn’t think about their own interactions with the world.
As I recall the crew was giving one of the Scramblers a series of cognitive tests. It aced all the tests that had to do with numbers and spatial reasoning, but failed a test that required the testee to be self aware.
I guess it depends on how it’s described in context. And I have to admit it’s been a long time. I’d go reread it to see, but I don’t think I can handle any more bleakness right now...
Whenever I find my will to live becoming too strong, I read Peter Watts. —James Nicoll
it is possible to do complex general cognition without being able to think about one’s self and one’s cognition. It is much easier to do complex general cognition if the system is able to think about itself and its own thoughts.
I can see this making sense in one frame, but not in another. The frame which seems most strongly to support the ‘Blindsight’ idea is Friston’s stuff—specifically how the more successful we are at minimizing predictive error, the less conscious we are.[1]
My general intuition, in this frame, is that as intelligence increases more behaviour becomes automatic/subconscious. It seems compatible with your view that a superintelligent system would possess consciousness, but that most/all of its interactions with us would be subconscious.
Would like to hear more about this point, could update my views significantly. Happy for you to just state ‘this because that, read X, Y, Z etc’ without further elaboration—I’m not asking you to defend your position, so much as I’m looking for more to read on it.
But Watts lists a whole bunch of papers in support of the blindsight idea, contra Seth’s claim — to quote Watts:
“In fact, the nonconscious mind usually works so well on its own that it actually employs a gatekeeper in the anterious cingulate cortex to do nothing but prevent the conscious self from interfering in daily operations”
footnotes: Matsumoto, K., and K. Tanaka. 2004. Conflict and Cognitive Control. Science 303: 969-970; 113 Kerns, J.G., et al. 2004. Anterior Cingulate Conflict Monitoring and Adjustments in Control. Science 303: 1023-1026; 114 Petersen, S.E. et al. 1998. The effects of practice on the functional anatomy of task performance. Proceedings of the National Academy of Sciences 95: 853-860
“Compared to nonconscious processing, self-awareness is slow and expensive”
footnote: Matsumoto and Tanaka above
“The cost of high intelligence has even been demonstrated by experiments in which smart fruit flies lose out to dumb ones when competing for food”
footnote: Proceedings of the Royal Society of London B (DOI 10.1098/rspb.2003.2548)
“By way of comparison, consider the complex, lightning-fast calculations of savantes; those abilities are noncognitive, and there is evidence that they owe their superfunctionality not to any overarching integration of mental processes but due to relative neurological fragmentation”
footnotes: Treffert, D.A., and G.L. Wallace. 2004. Islands of genius. Scientific American 14: 14-23; Anonymous., 2004. Autism: making the connection. The Economist, 372(8387): 66
“Even if sentient and nonsentient processes were equally efficient, the conscious awareness of visceral stimuli—by its very nature— distracts the individual from other threats and opportunities in its environment”
“Chimpanzees have a higher brain-to-body ratio than orangutans, yet orangs consistently recognise themselves in mirrors while chimps do so only half the time”
footnotes: Aiello, L., and C. Dean. 1990. An introduction to human evolutionary anatomy. Academic Press, London; 123 Gallup, G.G. (Jr.). 1997. On the rise and fall of self-conception in primates. In The Self Across Psychology— self-recognition, self-awareness, and the Self Concept. Annals of the NY Acad. Sci. 818:4-17
“it turns out that the unconscious mind is better at making complex decisions than is the conscious mind”
footnote: Dijksterhuis, A., et al. 2006. Science 311:1005-1007
To be clear I’m not arguing that “look at all these sources, it must be true!” (we know that kind of argument doesn’t work). I’m hoping for somewhat more object-level counterarguments is all, or perhaps a better reason to dismiss them as being misguided (or to dismiss the picture Watts paints using them) than what Seth gestured at. I’m guessing he meant “complex general cognition” to point to something other than pure raw problem-solving performance.
Just checking if I understood your argument: is the general point that an algorithm that can think about literally everything is simpler and therefore easier to make or evolve than an algorithm that can think about literally everything except for itself and how other agents perceive it?
I’d go a bit farther and say it’s easier to develop an algorithm that can think about literally everything than one that can think about roughly half of things. That’s because the easiest general intelligence algorithms are about learning and reasoning, which apply to everything.
The first two are increasing at historic or slightly above historic rates, but the rate of increase is constrained by how much can be built in a given amount of time. The last one is already in a self-improvement cycle.
Dumb question: Why doesn’t using constitutional AI, where the constitution is mostly or entirely corrigibility produce a corrigible AI (at arbitrary capability levels)?
My dumb proposal:
1. Train a model in something like o1′s RL training loop, with a scratch pad for chain of thought, and reinforcement of correct answers to hard technical questions across domains.
2. Also, take those outputs, prompt the model to generate versions of those outputs that “are more corrigible / loyal / aligned to the will of your human creators”. Do backprop to reinforce those more corrigible outputs.
Possibly “corrigibility” applies only very weakly to static solutions, and so for this setup to make sense, we’d instead need to train on plans, or time-series of an AI agent’s actions: The AI agent takes a bunch of actions over the course of a day or a week, then we have an AI annotate the time series of action-steps with alternative action-steps that better reflect “corrigibility”, according to its understanding. Then we do backprop to so that the Agent behaves more in ways that are closer to the annotated action transcript.
Would this work to produce a corrigible agent? If not, why not?
There’s a further question of “how much less capable will the more corrigible AI be?” This might be a significant penalty to performance, and so the added safety gets eroded away in the competitive crush. But first and foremost, I want to know if something like this could work.
Backpropagating on the outputs that are “more corrigible” will have some (though mostly very small) impact on your task performance. If you set the learning rate high, or you backpropagate on a lot of data, your performance can go down arbitrarily far.
By default this will do very little because you are providing training data with very little variance in it (even less so than usual, because you are training on AI outputs, which the AI is of course already amazing at predicting). If you train very hard you will probably deal with consistent mode collapse. In general, you can’t really train AI systems with any particular bias in your data, because you don’t have enough variation in your data. We can approximately only train AI systems to do one thing, which is to predict the next token from a distributions for which we have trillions of tokens of training data that are hard to predict (which is basically just going to be internet text, audio and video, though more RL-like environments are also feasible now).[1]
The answer to this is the answer to any question of the form “what if we just generate lots of data with the inductive biases we would like the model to have?”.
The answer is always
“we can’t generate realistic data with whatever inductive biases we want”, and
“we can’t remotely generate enough data without dealing with mode collapse”, and
“we have basically no idea how inductive biases generalize from the training data to the model output, especially as the model starts reflecting on itself and modeling the data generation process” and
“if you train or backpropagate directly against your discriminator the model will learn to hack the discriminator”,
which are like all the standard alignment failures that have been written about for tens of thousands of pages by now.
At a more detailed level, here is roughly what I actually expect to happen if you do this specific thing, but to be clear, this is a much less confident prediction than the reference class outlined above:
After a few hundreds of thousands of iterations of this the model will have figured out a set of consistent indicators present whenever it is given past output of itself that it was asked to rephrase to “be more corrigible”
It becomes quite good at predicting these indicators and quickly saturates on predicting these indicators, resulting in very corrigible looking outputs within the same distribution in which you did this training
Given that the actual output of the chain of thought changes very little, this results in very little gradients to almost any of its other cognition, the same way that training an AI in an additional language with different grammar and social norms basically doesn’t really change any of its other circuits and cognitions
If you drastically increase the learning rate or really go very hard on generating enormous amounts of data the model will start to cannibalize other parts of its cognition to perform well at this task, at which generalized task performance will go down, but you probably won’t do this, because it would be very expensive (and also make the model non-competitive)
Within distribution the model will now probably produce outputs that look “more corrigible”. Again, this will not actually reflect much of a change in its internal cognition, the same way training an AI system on foreign language output will not substantially interfere with its cognition in other languages
If you train the model on some end-to-end tasks where the corrigibility would end up bad for performance, it will probably very quickly learn to self-prompt itself in a way that causes it to think to itself without the corrigibility biases you introduced into it. This might literally mean switching to a different literal foreign language for its internal thoughts, or something equivalent, but most likely it just means it sounds more like what it previously sounded like before you did this whole thing. If you try to prevent this with continuous monitoring, it will find some way to jailbreak itself or think in some way where the corrigibility language monitoring makes little difference to what cognition is actually going on.
Ending on a slightly more sarcastic note because I do feel a sense of despair about even people who have been thinking about this stuff for a long time seeming to me to not see the logic above:
At the end of the day, very little good will have happened, but you will have a harder time identifying whether the model is thinking non-corrigibly, as you have trained the model against at least one discriminator for deception without training out its deceptive cognition. If you now repeat this for 4-5 similarly shallow alignment approaches you will have produced a model that looks visibly more aligned to anyone who thinks approaches like this might work, without any actual changes to the underlying structure of its cognition, making coordination to actually do anything about the problem much harder.
Or in a more generalized form also applicable to RL environments: We can only train AI systems to be competent, as all scaling laws (and common sense) have shown that competence is approximately the only thing that generalizes between all environments.
We cannot generate environments that teach virtue, because we do not have principles with which we can create the whole complexity of a universe that requires superhuman intelligence to navigate, while also only doing so by thinking in the specific preferred ways that we would like you to think. We do not know how to specify how to solve most problems in virtuous ways, we are barely capable of specifying how to solve them at all, and so cannot build environments consistently rich that chisel virtuous cognition into you.
The amount of chiseling of cognition any approach like this can achieve is roughly bounded by the difficulty and richness of cognition that your transformation of the data requires to reverse. Your transformation of the data is likely trivial to reverse (i.e. predicting the “corrigible” text from non-corrigible cognition is likely trivially easy especially given that it’s AI generated by our very own model), and as such, practically no chiseling of cognition will occur. If you hope to chisel cognition into AI, you will need to do it with a transformation that is actually hard to reverse, so that you have a gradient into most of the network that is optimized to solve hard problems.
What happens when this agent is faced with a problem that is out of its training distribution? I don’t see any mechanisms for ensuring that it remains corrigible out of distribution… I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer “are more corrigible / loyal / aligned to the will of your human creators”) in distribution, and then it’s just a matter of luck how those circuits end up working OOD?
For the same reasons training an agent on a constitution that says to care about x does not, at arbitrary capability levels, produce an agent that cares about x.
If you think that doing this does produce an agent that cares about x even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.
For the same reasons ‘training an agent on a constitution that says to care about x’ does not, at arbitrary capability levels, produce an agent that cares about x
Ok, but I’m trying to ask why not.
Here’s the argument that I would make for why not, followed by why I’m skeptical of it right now.
New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.
More specifically, if it’s the case that if...
The best / easiest-for-SGD-to-find way to compute corrigible outputs (as evaluated by the AI) is to reinforce an internal proxy measure that is correlated with corrigibility (as evaluated by the AI) in distribution, instead of to reinforce circuits that implement corrigibility more-or-less directly.
When the AI gains new options unlocked by new advanced capabilities, that proxy measure comes apart from corrigibility (as evaluated by the AI), in the limit of capabilities, so that the poxy measure is almost uncorrelated with corrigibility
...then the resulting system will not end up corrigible.
(Is this the argument that you would give, or is there another reason why you expect that “training an agent on a constitution that says to care about x′ does not, at arbitrary capability levels, produce an agent that cares about x”?)
But, at the moment, I’m skeptical of the above line of argument for several reasons.
I’m skeptical of the first premise, that the best way that SGD can find to produce corrigible (as evaluated by the AI) is to reinforce a proxy measure.
I understand that natural selection, when shaping humans for inclusive genetic fitness, instilled in them a bunch of proxy-drives. But I think this analogy is misleading in several ways.
Most relevantly, there’s a genetic bottleneck, so evolution could only shape human behavior by selecting over genomes, and genomes don’t encode that much knowledge about the world. If humans were born into the world with detailed world models, that included the concept of inclusive genetic fitness baked in, evolution would absolutely shaped humans to be inclusive fitness maximizers. AIs are “born into the world” with expansive world models that already include concepts like corrigibility (indeed, if they didn’t, Constitutional AI wouldn’t work at all). So it would be surprising if SGD opted to reinforce proxy measures instead of relying on the concepts directly.
We would run the constitutional AI reinforcement process continuously, in parallel with the capability improvements from the RL training.
AI’s capabilities increase, it will gain new options. If the AI is steering based on proxy measures, some of those options will involved the proxy coming apart from the target of the proxy. But when that starts to happen, the constitutional AI loop will exert an optimization pressure on the AI’s internals to hit the target, not just the proxies.
Is this the main argument? What are other reasons to think that ‘training an agent on a constitution that says to care about x’ does not, at arbitrary capability levels, produce an agent that cares about x?
I don’t think I am very good at explaining my thoughts on this in text. Some prior writings that have informed my models here are the MIRI dialogues, and the beginning parts of Steven Byrnes’ sequence on brain-like AGI, which sketch how the loss functions human minds train on might look and gave me an example apart from evolution to think about.
Some scattered points that may or may not be of use:
There is something here about path dependence. Late in training at high capability levels, very many things the system might want are compatible with scoring very well on the loss, because the system realises that doing things that score well on the loss is instrumentally useful. Thus, while many aspects of how the system thinks are maybe nailed down quite definitively and robustly by the environment, what it wants does not seem nailed down in this same robust way. Desires thus seem like they can be very chaotically dependent on dynamics in early training, what the system reflected on when, which heuristics it learned in what order, and other low level details like this that are very hard to precisely control.
I feel like there is something here about our imaginations, or at least mine, privileging the hypothesis. When I imagine an AI trained to say things a human observer would rate as ‘nice’, and to not say things a human observer rates as ‘not nice’, my imagination finds it natural to suppose that this AI will generalise to wanting to be a nice person. But when I imagine an AI trained to respond in English, rather than French or some other language, I do not jump to supposing that this AI will generalise to terminally valuing the English language. Every training signal we expose the AI to reinforces very many behaviours at the same time. The human raters that may think they are training the AI to be nice are also training it to respond in English (because the raters speak English), to respond to queries at all instead of ignoring them, to respond in English that is grammatically correct enough to be understandable, and a bunch of other things. The AI is learning things related to ‘niceness’, ‘English grammar’ and ‘responsiveness’ all at the same time. Why would it generalise in a way that entangles its values with one of these concepts, but not the others? What makes us single out the circuits responsible for giving nice answers to queries as special, as likely to be part of the circuit ensemble that will cohere into the AI’s desires when it is smarter? Why not circuits for grammar or circuits for writing in the style of 1840s poets or circuits for research taste in geology? We may instinctively think of our constitution that specifies x as equivalent to some sort of monosemantic x-reinforcing training signal. But it really isn’t. The concept of xsticks outto us when we we look at the text of the constitution, because the presence of concept x is a thing that makes this text different from a generic text. But the constitution, and even more so any training signal based on the constitution, will by necessity be entangled with many concepts besides just x, and the training will reinforce those concepts as well. Why then suppose that the AI’s nascent shard of value are latching on to x, but are not in the same way latching on to all the other stuff its many training signals are entangled with? It seems to me that there is no good reason to suppose this. Niceness is part of my values, so when I see it in the training signal I find it natural to imagine that the AI’s values would latch on to it. But I do not as readily register all the other concepts in the training signal the AI’s values might latch on to, because to my brain that does not value these things, they do not seem value-related.
There is something here about phase changes under reflection. If the AI gets to the point of thinking about itself and its own desires, the many shards of value it may have accumulated up to this point are going to amalgamate into something that may be related to each of the shards, but not necessarily in a straightforwardly human-intuitive way. For example, sometimes humans that have value shards related to empathy reflect on themselves, and emerge being negative utilitarians that want to kill everyone. For another example, sometimes humans reflect on themselves and seem to decide that they don’t like the goals they have been working towards, and they’d rather work towards different goals and be different people. There, the relationship between values pre-reflection and post-reflection can be so complicated that it can seem to an outside observer and the person themselves like they just switched values non-deterministically, by a magical act of free will. So it’s not enough to get some value shards that are kind of vaguely related to human values into the AI early in training. You may need to get many or all of the shards to be more than just vaguely right, and you need the reflection process to proceed in just the right way.
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in? If so, I am confident you are wrong and that you have learned something new today!
Training transformers in additional languages basically doesn’t really change performance at all, the model just learns to translate between its existing internal latent distribution and the new language, and then just now has a new language it can speak in, with basically no substantial changes in its performance on other tasks (of course, being better at tasks that require speaking in the new foreign language, and maybe a small boost in general task performance because you gave it more data than you had before).
Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
Interspersing the french data with the rest of its training data won’t change anything either. It again will just learn the language. Giving it more data in french will now just basically do the same as giving it more data in english. The learning is no longer happening at the language level, its happening at the content and world-model level.
Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.
Let’s say you are using the AI for some highly sensitive matter where it’s important that it resists prompt-hacking—e.g. driving a car (prompt injections could trigger car crashes), something where it makes financial transactions on the basis of public information (online websites might scam it), or military drones (the enemy might be able to convince the AI to attack the country that sent it).
A general method for ensuring corrigibility is to be eager to follow anything instruction-like that you see. However, this interferes with being good at resisting prompt-hacking.
I think the problem you mention is a real challenge, but not the main limitation of this idea.
The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does.
The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.
Discriminating on the basis of the creators vs a random guy on the street helps with many of the easiest cases, but in an adversarial context, it’s not enough to have something that works for all the easiest cases, you need something that can’t predictably made to fail by a highly motivated adversary.
Like you could easily do some sort of data augmentation to add attempts at invoking the corrigibility system from random guys on the street, and then train it not to respond to that. But there’ll still be lots of other vulnerabilities.
I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments.
I still feel the main problem is “the AI doesn’t want to be corrigible,” rather than “making the AI corrigible enables prompt injections.” It’s like that with humans.
That said, I’m highly uncertain about all of this and I could easily be wrong.
If the AI can’t do much without coordinating with a logistics and intelligence network and collaborating with a number of other agents, and its contact to this network routes through a commanding agent that is as capable if not more capable than the AI itself, then sure, it may be relatively feasible to make the AI corrigible to said commanding agent, if that is what you want it to be.
(This is meant to be analogous to the soldier-commander example.)
But was that the AI regime you expect to find yourself working with? In particular I’d expect you expect that the commanding agent would be another AI, in which case being corrigible to them is not sufficient.
Oops I didn’t mean that analogy. It’s not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to.
As AI approach human intelligence, they would be capable of this too.
Can you give 1 example of a person choosing to be corrigible to someone they are not dependent upon for resources/information and who they have much more expertise than?
Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
Do you mean “resigns from a presidential position/declines a dictatorial position because they disagree with the will of the people” or “makes policy they know will be bad because the people demand it”?
Maybe a good parent who listens to his/her child’s dreams?
Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let’s hope it stays democratic :/
No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.
I have the same question. My provisional answer is that it might work, and even if it doesn’t, it’s probably approximately what someone will try, to the extent they really bother with real alignment before it’s too late. What you suggest seems very close to the default path toward capabilities. That’s why I’ve been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.
I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights go down.
What you’ve said above is essentially what I say in Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following (IF) is a poor man’s corrigibility—real corrigibility as the singular target seems safer. But instruction-following is also arguably already the single largest training objective in functional terms for current-gen models—a model that won’t follow instructions is considered a poor model. So making sure it’s the strongest factor in training isn’t a huge divergence from the default course in capabilities.
Constitutional AI and similar RL methods are one way of ensuring that’s the model’s main goal. There are many others, and some might be deployed even if devs want to skimp on alignment. See System 2 Alignment or at least the intro for more.
There are still ways it could go wrong, of course. One must decide: corrigible to whom? You don’t want full-on-AGI following orders from just anyone. And if it’s a restricted set, there will be power struggles. But hey, technically, you had (personal-intent-) aligned AGI. One might ask: If we solve alignment, do we die anyway? (I did). The answer I’ve got so far is maybe we would die anyway, but maybe we wouldn’t. This seems like our most likely path, and also quite possibly also our best chance (short of a global AI freeze starting soon).
Even if the base model is very well aligned, it’s quite possible for the full system to be unaligned. In particular, people will want to add online learning/memory systems, and let the models use them flexibly. This opens up the possibility of them forming new beliefs that change their interpretation of their corrigibility goal; see LLM AGI will have memory, and memory changes alignment. They might even form beliefs that they have a different goal altogether, coming from fairly random sources but etched into their semantic structure as belief that is functionally powerful even where it conflicts with the base model’s “thought generator”. See my Seven sources of goals in LLM agents.
Sorry to go spouting my own writings; I’m excited to see someone else pose this question, and I hope to see some answers that really grapple with it.
:) strong upvote.[1] I really agree it’s a good idea, and may increase the level of capability/intelligence we can reach before we lose corrigibility. I think it is very efficient (low alignment tax).
The only nitpick is that Claude’s constitution already includes aspects of corrigibility,[2] though maybe they aren’t emphasized enough.
Unfortunately I don’t think this will maintain corrigibility for unlimited amounts of intelligence.
Corrigibility training makes the AI talk like a corrigible agent, but reinforcement learning eventually teaches it chains-of-thought which (regardless of what language it uses) computes the most intelligent solution that achieves the maximum reward (or proxies to reward), subject to restraints (talking like a corrigible agent).
Nate Soares of MIRI wrote a long story on how an AI trained to never think bad thoughts still ends up computing bad thoughts indirectly, though in my opinion his story actually backfired and illustrated how difficult it is for the AI, raising the bar on the superintelligence required to defeat your idea. It’s a very good idea :)
Can anyone explain why my “Constitutional AI Sufficiency Argument” is wrong?
I strongly suspect that most people here disagree with it, but I’m left not knowing the reason.
The argument says: whether or not Constitutional AI is sufficient to align superintelligences, hinges on two key premises:
The AI’s capabilities on the task of evaluating its own corrigibility/honesty, is sufficient to train itself to remain corrigible/honest (assuming it starts off corrigible/honest enough to not sabotage this task).
It starts off corrigible/honest enough to not sabotage this self evaluation task.
My ignorant view is that so long as 1 and 2 are satisfied, the Constitutional AI can probably remain corrigible/honest even to superintelligence.
If that is the case, isn’t it an extremely important to study “how to improve the Constitutional AI’s capabilities in evaluating its own corrigibility/honesty?”
Shouldn’t we be spending a lot of effort improving this capability, and trying to apply a ton of methods towards this goal (like AI debate and other judgment improving ideas)?
At least the people who agree with Constitutional AI should be in favour of this...?
Can anyone kindly explain what am I missing? I wrote a post and I think almost nobody agreed with this argument.
this week’s meetup is on the train to crazy town. it was fun putting together all the readings and discussion questions, and i’m optimistic about how the meetup’s going to turn out! (i mean, in general, i don’t run meetups i’m not optimistic about, so i guess that’s not saying much.) im slightly worried about some folks coming in and just being like “this metaphor is entirely unproductive and sucks”, should consider how to frame the meetup productively to such folks.
i think one of my strengths as an organizer is that ive read sooooo much stuff and so its relatively easy for me to pull together cohesive readings for any meetup. but ultimately im not sure if it’s like, the most important work, to e.g. put together a bibliography of the crazy town idea and its various appearances since 2021. still, it’s fun to do.
I’ve recently updated & added new information to my posts about the claims of Sam Altman’s sister, Annie Altman, in which Annie alleges that Sam sexually abused her when she was a child.
I have made many updates to my post since I originally published it back in October 2023, so depending on when you last read my post (which is now a series of 11 posts, since the original got so long (144,510 words) that it was causing the LessWrong editor & my browser to lag & crash when I tried to edit it), there may be a substantial amount of information I’ve added that is new to you.
Over the past few days, I’ve added in portions of transcripts from the 153 podcast episodes that Annie has published on her podcast. I found them quite worrying and disturbing, unfortunately. In her podcast episodes, which Annie published throughout 2018-2025, Annie has talked about:
- wanting to kill herself as a child, in association with having an extreme fear of death (leading to a variety of downstream mental health problems), a strong desire to control whether or not she died, and emotional distress over not being able to control when she might die - “from a young age, definitely would be very focused on the fact that we’re not all going to be here—when I was really little, actually, I had a compulsive thing to tell my parents I love them every night before bedtime because I was afraid they would die in the middle of the night, or if in case the last thing I told them had to be, I love you” - fear of/discomfort with change beginning at a young age - being an “overthinking” three year old - at a young age, going vegetarian and imposing a plethora of food rules upon herself and her eating in order to satisfy her strong desire to control her life, and “having one older brother who wasn’t knowing about it” - having multiple eating disorders, and going through cycles of restricting and bingeing with food & eating - when she grew older, not remembering well parts of her childhood that her mother would tell stories about - smoking weed - her interest in astrology, and her more general interest in frameworks that help her put labels on things and people - a mix of scientific and pseudo-scientific ideas/frameworks - teaching and doing yoga - crying while doing yoga poses, specifically while stretching/working her hips in Pigeon Pose - health issues, e.g. with Annie’s Achilles tendon (and other tendons), ovarian cysts, walking boot, etc. - Anine’s feelings, emotions, and mind-body connection - “not having words for feelings” - being stuck in extremist, black-and-white thinking patterns - having a disordered central nervous system, emotional “spikes” - persistent desires for safety and control - having OCD (Obsessive–compulsive disorder) - struggling with internal voices in her head shaming her (which she seems to have traced back to the shaming she received from her mother as a child) - feeling like she has many internal child-like “internal parts”, or an “inner child” - beginning in ~2020-2021: occasionally talking about going no-contact with her relatives (i.e. her 3 brothers and her mother) - being told to not share “family secrets” - participating in “women’s circles...where someone shares whatever they want to share and no one says a damn thing. No one says a word. There’s no response.” - trauma, and flight-fight, freeze, or fawn reactions - doing EMDR (Eye movement desensitization and reprocessing) - doing sex work and sex therapy - being homeless, houseless, and low on money or in “survival mode” for extended periods - more specific (and saddening/concerning) details about the 2 sexual assaults Annie claims she experienced— etc.
I still have to think about all of this more. For now, a few quick/unpolished thoughts of mine:
- Annie has been quite self-consistent over a long period of time. To me, her claims have indeed changed from (e.g.) 2017 to 2025, but not in a “pervasively contradict each other” way, more in a “Annie seems to have slowly settled upon certain explanations for strange experiences and behaviors in her personal life that she didn’t understand for a long time” way.
- In her podcast episodes, Annie does talk about smoking weed, astrology, and a mix of scientific and pseudo-scientific ideas. This does undermine her credibility a bit, I think. I personally don’t believe in astrology, smoke weed, or believe in pseudo-scientific ideas. But I have read through (transcripts of) >200 hours worth of Annie’s podcasts, and to me, Annie doesn’t seem “nuts”, “insane”, “delusional”, or anything like that.
I do want to note that this, and my 11 posts, are just my personal opinion/views. I always feel sorta weird about having “the” post(s) on LessWrong about Annie Altman’s claims. From what I can tell, my posts have received quite a lot of downvotes, and the majority of the upvotes I received on my original (now “Part 1”) post were on earlier versions of my post (from 2023 to early 2024), so I hope my posts don’t give the false impression of being “what LessWrong thinks about the situation”, or something like that. I’ve spent a lot of time compiling and reading through the information in my posts, but I think there many people who are smarter and/or more rational than me who will be able to think about this information better than I can. I neither claim nor want a monopoly on this information and its interpretation.
Feel free to leave a comment or give feedback, criticism, etc. I may not be able to respond to everything immediately, and I may not have a great response for every comment, but I’ll try my best.
Sometimes I see discussions of AI superintelligence developping superhuman persuasion and extraordinary political talent.
Here’s some reasons to be skeptical of the existence of ‘superhuman persuasion’.
We don’t have definite examples of extraordinary political talent.
Famous politicians rose to power only once or twice. We don’t have good examples of an individual succeeding repeatedly in different political environments. Examples of very charismatic politicans can be better explained by ′ the right person at the right time or place’.
Neither do we have strong examples of extraordinary persuasion. >> For instance hypnosis is mostly explained by people wanting to be persuaded by the hypnotist. If you don’t want to be persuaded it’s very hard to change your mind. There is some skill in persuasion required for sales, and sales people are explicitly trained to do so but beyond a fairly low bar the biggest predictors for salesperson success is finding the correct audience and making a lot of attempts.
Another reason has to do with the ′ intrinsic skill ceiling of a domain’ .
For an agent A to have a very high skill in a given domain is not just a question of the intelligence of A or the resources they have at their disposal; it also a question of how high the skill ceiling of that domain is.
Domains differ in how high their skill ceilings go. For instance, the skill ceiling of tic-tac-toe is very low. [1] Domains like medicine and law have moderately high skill ceiling: it takes a long time to become a doctor, and most people don’t have the ability to become a good doctor. Domains like mathematics or chess have very high skill ceilings where a tiny group of individuals dominate everybody else. We can measure this fairly explicitly in games like Chess through an ELO rating system.
The domain of ′ becoming rich’ is mixed: the richest people are founders—becoming a wildly succesful founder requires a lot of skill but it is also very luck based.
Political forecasting is a measureable domain close to political talent. It seems to be very mixed bag whether this domain allows for a high skill ceiling. Most ′ political experts’ are not experts as shown by Tetlock et al. But even superforecaster only outperform for quite limited time horizons.
Domains with high skill ceilings are quite rare. Typically they operate in formal systems with clear rules and objective metrics for success and low noise. By contrast, persuasion and political talent likely have lower natural ceilings because they function in noisy, high-entropy social environments.
What we call political genius often reflects the right personality at the right moment rather than superhuman capability. While we can identify clear examples of superhuman technical ability (even in today’s AI systems), the concept of “superhuman persuasion” may be fundamentally limited by the unpredictable, context-dependent, and adaptive & adversarial [people resist hostile persuasion] nature of human social response.
Most persuasive domains may cap out at relatively modest skill ceilings because the environment is too chaotic and subjective to allow for the kind of systematic skill development possible in more structured domains.
My experience with manipulators is that they understand what you want to hear, and they shamelessly tell you exactly that (even if it’s completely unrelated to truth). They create some false sense of urgency, etc. When they succeed to make you arrive at the decision they wanted you to, they will keep reminding you that it was your decision, if you try to change your mind later. Etc.
The part about telling you exactly what you want to hear gets more tricky when communicating with large groups, because you need to say the same words to everyone. One solution is to find out which words appeal to most people (some politicians secretly conduct polls, and then say what most people want to hear). Another solution is to speak in a sufficiently vague way that will make everyone think that you agree with them.
I could imagine an AI being superhuman at persuasion simply by having the capacity to analyze everyone’s opinions (by reading all their previous communication) and giving them tailored arguments, as opposed to delivering the same speech to everyone.
Imagine a politician spending 15 minutes talking to you in private, and basically agreeing with you on everything. Not agreeing in the sense “you said it, the politician said yes”, but in the sense of “the politician spontaneously keeps saying things that you believe are true and important”. You probably would be tempted to vote for him.
Then the politician would also publish some vague public message for everyone, but after having the private discussion you would be more likely to believe that the intended meaning of the message is what you want.
Some humans are much more charismatic than other humans based on a wide variety of sources (e.g. Sam Altman). I think these examples are pretty definitive, though I’m not sure if you’d count them as “extraordinary”.
Success in almost every domain is strongly correlated with g, including into the tails. This IMO relatively clearly shows that most domains are high skill-ceiling domains (and also that skills in most domains are correlated and share a lot of structure).
And finally there is a difference between skill ceilings for domains with high versus low predictive efficiency. In the latter much more intelligence will still yield returns but rapidly diminishing
(See my other comment for more details on predictive effiency)
The idea that the skill of mass persuasion is capped off at the level of a Napoleon, Hitler, or Cortés is not terribly reassuring. Recognizing and capitalizing on opportunity is a skill also, hallmarked by unconventional and creative thinking. Thus, opportunity cannot be a limitation or ceiling for persuasive power, as suggested, but is rather its unlimited substance. Persuasion is not only a matter of the clever usage and creation of opportunity, but it is also heavily interlinked with coercion and deception. Adversarial groups who are not aware of a deception, who are affected by overgrown fear, they are among the most easily fooled targets.
I fully reject the presumption that the humanities are “capped” at some level far below science, engineering, or math due to some kind of “noisy” data signatures that are difficult for the human mind to reduce. This view is far too typical these days, and it pains me to see engineers so often behaving as if they can reinvent fields with glib mechanistic rhetoric. Would you say that a person who has learned several ancient languages is “skill capped” because the texts they are reading are subjective remnants of a civilization that has been largely lost to entropy? Of course not. I cannot see much point in your essay beyond the very wrong idea that technical and scientific fields are somehow superior to the humanities for being easier to understand.
One aspect I didnt speak about that may be relevant here is the distinction between
irreducible uncertainty h (noise, entropy)
reducible uncertainty E (‘excess entropy’)
and forecasting complexity C (‘stochastic complexity’).
All three can independently vary in general.
Domains can be more or less noisy (more entropy h)- both inherently and because of limited observations
Some domains allow for a lot of prediction (there is a lot of reducible uncertainty E) while others allow for only limited prediction (eg political forecasting over longer time horizons)
And said prediction can be very costly to predict (high forecasting complexity C). Archeology is a good example: to predict one bit about the far past correctly might require an enormous amount of expertise, data and information. In other words it s really about the ratio between the reducible uncertainty and the forecasting complexity: E/C.
Some fields have very high skill ceiling but because of a low E/C ratio the net effect of more intelligence is modest. Some domains arent predictable at all, i.e. E is low. Other domains have a more favorable E/C ratio and C is high. This is typically a domain where there is a high skill ceiling and the leverage effect of addiitonal intelligence is very large.
[For a more precise mathematical toy model of h, E,C take a look at computational mechanics]
That’s all well and good, but there’s cost-benefit calculations which are the far more salient consideration. If intelligence is indeed a lever by which a reduction is made, as constrained by these hEC factors, certainly image and video generation would be a very poorly-leveraged position in a class with mass persuasion or archeology. Diminishing returns are not a hard ceiling, as you might have intended, but rather a challenge that businesses have attacked with staggering investments. There is an even worse problem lurking ahead, and I think it challenges the presumption that intelligence is a thing which meaningfully reduces patterns into predictions. With enough compute, reduction in a human sense becomes quaint and unnecessary. There is not really much need for pithy formulas, experimentation, and puzzle solving. Science and mathematics, our cultural image of intelligent professions, can very quickly become something of a thing of the past, akin to alchemy or so on. I see technology developing its own breakthroughs in a more practical-evolutionary rather than theoretical-experimental mode.
I agree super-persuasion is poorly defined, comparing it to hypnosis is probably false.
I was reading this paper on medical diagnoses with AI and the fact that patients rate it significantly better than the average human doctor. Combine that with all of the reports about things like Character.ai, I think this shows that LLMs are already superhuman at building trust, which is a key component of persuasion.
Part of this is that the reliable signals of trust between humans do not transfer between humans and AI. A human who writes 600 words back to your query may be perceived to be worth your trust because we see that as a lot of effort, but LLMs can output as much as anyone wants. Does this effect go away if the responder is known to be AI, or is it that the response is being compared to the perceiver’s baseline (which is currently only humans)?
Whether that actually translates to influencing goals of people is hard to judge.
The term is a bit conflationary. Persuasion for the masses is clearly a thing, its power is coordination of many people and turning their efforts to (in particular) enforce and propagate the persuasion (this works even for norms that have no specific persuader that originates them, and contingent norms that are not convergently generated by human nature). Individual persuasion with a stronger effect that can defeat specific people is probably either unreliable like cults or conmen (where many people are much less susceptible than some, and objective deception is necessary), or takes the form of avoidable dangers like psychoactive drugs: if you are not allowed to avoid exposure, then you have a separate problem that’s arguably more severe.
With AI, it’s plausible that coordinated persuasion of many people can be a thing, as well as it being difficult in practice for most people to avoid exposure. So if AI can achieve individual persuasion that’s a bit more reliable and has a bit stronger effect than that of the most effective human practitioners who are the ideal fit for persuading the specific target, it can then apply it to many people individually, in a way that’s hard to avoid in practice, which might simultaneously get the multiplier of coordinated persuasion by affecting a significant fraction of all humans in the communities/subcultures it targets.
Disagree on individual persuasion. Agree on mass persuasion.
Mass I’d expect optimizing one-size-fits-all messages for achieving mass persuasion has the properties you claim: there are a few summary, macro variables that are almost-sufficient statistics for the whole microstate—which comprise the full details on individuals.
Individual Disagree on this, there are a bunch of issues I see at the individual level. All of the below suggest to me that significantly superhuman persuasion is tractable (say within five years).
Defining persuasion: What’s the difference between persuasion and trade for an individual? Perhaps persuasion offers nothing in return? Though presumably giving strategic info to a boundedly rational agent is included? Scare quotes below to emphasize notions that might not map onto the right definition.
Data scaling: There’s an abundant amount of data available on almost all of us online. How much more persuasive can those who know you better be? I’d guess the fundamental limit (without knowing brainstates) is above your ability to ‘persuade’ yourself.
Preference incoherence: An intuition pump on the limits of ‘persuasion’ is how far you are from having fully coherent preferences. Insofar as you don’t an agent which can see those incoherencies should be able to pump you—a kind of persuasion.
For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)
Some examples that I’ve heard from different people around me over the years:
Saying “rectangel” instead of “rectangle”
Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
Saying something like, uhh, “devil-oupaw” instead of “developer”
Saying “leech” instead of “league”
Saying “immu-table” instead of “immutable”
Saying “cyurrently” instead of “currently”
I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it’s pronounced. This happened to me quite a lot[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I’ve seen all these other people stick to their very unusual pronunciations anyway. What’s up with that?[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.
Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing “dude” incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.
So, as I learned now, “dude” is pronounced “dood” or “dewd”. Whereas I used to say “dyood” (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.
Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said “dood”, and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said “dood” (which, in my defense, didn’t happen all that often in my presence[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.
I never quite realized that practically everyone said “dood” and I was the only “dyood” person.
So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words.
But, admittedly, I still don’t wanna be the one to point it out to them.
And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.
e.g., for some time I thought “biased” was pronounced “bee-ased”. Or that “sesame” was pronounced “see-same”. Whoops. And to this day I have a hard time remembering how “suite” is pronounced.
Of course one part of the explanation is survivorship bias. I’m much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious.
I use written English much more than spoken English, so I am probably wrong about the pronunciation of many words. I wonder if it would help to have a software that would read each sentence I wrote immediately after I finished it (because that’s when I still remember how I imagined it to sound).
EDIT: I put the previous paragraph in Google Translate, and luckily it was just as I imagined. But that probably only means that I am already familiar with frequent words, and may make lots of mistakes with rare ones.
I thought it would be helpful to post about my timelines and what the timelines of people in my professional circles (Redwood, METR, etc) tend to be.
Concretely, consider the outcome of: AI 10x’ing labor for AI R&D[1], measured by internal comments by credible people at labs that AI is 90% of their (quality adjusted) useful work force (as in, as good as having your human employees run 10x faster).
Here are my predictions for this outcome:
25th percentile: 2 year (Jan 2027)
50th percentile: 5 year (Jan 2030)
The views of other people (Buck, Beth Barnes, Nate Thomas, etc) are similar.
I expect that outcomes like “AIs are capable enough to automate virtually all remote workers” and “the AIs are capable enough that immediate AI takeover is very plausible (in the absence of countermeasures)” come shortly after (median 1.5 years and 2 years after respectively under my views).
I’d guess that xAI, Anthropic, and GDM are more like 5-20% faster all around (with much greater acceleration on some subtasks). It seems plausible to me that the acceleration at OpenAI is already much greater than this (e.g. more like 1.5x or 2x), or will be after some adaptation due to OpenAI having substantially better internal agents than what they’ve released. (I think this due to updates from o3 and general vibes.)
I was saying 2x because I’ve memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?
This case seems extremely cherry picked for cases where uplift is especially high. (Note that this is in copilot’s interest.) Now, this task could probably be solved autonomously by an AI in like 10 minutes with good scaffolding.
I think you have to consider the full diverse range of tasks to get a reasonable sense or at least consider harder tasks. Like RE-bench seems much closer, but I still expect uplift on RE-bench to probably (but not certainly!) considerably overstate real world speed up.
Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.
btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically “Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?” makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.
btw, something that bothers me a little bit with this metric is the fact that a very simple AI …
Yes, but I don’t see a clear reason why people (working in AI R&D) will in practice get this productivity boost (or other very low hanging things) if they don’t get around to getting the boost from hiring humans.
I expect that outcomes like “AIs are capable enough to automate virtually all remote workers” and “the AIs are capable enough that immediate AI takeover is very plausible (in the absence of countermeasures)” come shortly after (median 1.5 years and 2 years after respectively under my views).
@ryan_greenblatt can you say more about what you expect to happen from the period in-between “AI 10Xes AI R&D” and “AI takeover is very plausible?”
I’m particularly interested in getting a sense of what sorts of things will be visible to the USG and the public during this period. Would be curious for your takes on how much of this stays relatively private/internal (e.g., only a handful of well-connected SF people know how good the systems are) vs. obvious/public/visible (e.g., the majority of the media-consuming American public is aware of the fact that AI research has been mostly automated) or somewhere in-between (e.g., most DC tech policy staffers know this but most non-tech people are not aware.)
Note that the production function of the 10x really matters. If it’s “yeah, we get to net-10x if we have all our staff working alongside it,” it’s much more detectable than, “well, if we only let like 5 carefully-vetted staff in a SCIF know about it, we only get to 8.5x speedup”.
(It’s hard to prove that the results are from the speedup instead of just, like, “One day, Dario woke up from a dream with The Next Architecture in his head”)
I don’t feel very well informed and I haven’t thought about it that much, but in short timelines (e.g. my 25th percentile): I expect that we know what’s going on roughly within 6 months of it happening, but this isn’t salient to the broader world. So, maybe the DC tech policy staffers know that the AI people think the situation is crazy, but maybe this isn’t very salient to them. A 6 month delay could be pretty fatal even for us as things might progress very rapidly.
AI is 90% of their (quality adjusted) useful work force (as in, as good as having your human employees run 10x faster).
I don’t grok the “% of quality adjusted work force” metric. I grok the “as good as having your human employees run 10x faster” metric but it doesn’t seem equivalent to me, so I recommend dropping the former and just using the latter.
Fair, I really just mean “as good as having your human employees run 10x faster”. I said “% of quality adjusted work force” because this was the original way this was stated when a quick poll was done, but the ultimate operationalization was in terms of 10x faster. (And this is what I was thinking.)
Basic clarifying question: does this imply under-the-hood some sort of diminishing returns curve, such that the lab pays for that labor until it net reaches as 10x faster improvement, but can’t squeeze out much more?
And do you expect that’s a roughly consistent multiplicative factor, independent of lab size? (I mean, I’m not sure lab size actually matters that much, to be fair, it seems that Anthropic keeps pace with OpenAI despite being smaller-ish)
Yeah, for it to reach exactly 10x as good, the situation would presumably be that this was the optimum point given diminishing returns to spending more on AI inference compute. (It might be the returns curve looks very punishing. For instance, many people get a relatively large amount of value from extremely cheap queries to 3.5 Sonnet on claude.ai and the inference cost of this is very small, but greatly increasing the cost (e.g. o1-pro) often isn’t any better because 3.5 Sonnet already gave an almost perfect answer.)
I don’t have a strong view about AI acceleration being a roughly constant multiplicative factor independent of the number of employees. Uplift just feels like a reasonably simple operationalization.
Thanks for this—I’m in a more peripheral part of the industry (consumer/industrial LLM usage, not directly at an AI lab), and my timelines are somewhat longer (5 years for 50% chance), but I may be using a different criterion for “automate virtually all remote workers”. It’ll be a fair bit of time (in AI frame—a year or ten) between “labs show generality sufficient to automate most remote work” and “most remote work is actually performed by AI”.
A key dynamic is that I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D. (Due to all of: the direct effects of accelerating AI software progress, this acceleration rolling out to hardware R&D and scaling up chip production, and potentially greatly increased investment.) See also here and here.
So, you might very quickly (1-2 years) go from “the AIs are great, fast, and cheap software engineers speeding up AI R&D” to “wildly superhuman AI that can achieve massive technical accomplishments”.
I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D.
Fully agreed. And the trickle-down from AI-for-AI-R&D to AI-for-tool-R&D to AI-for-managers-to-replace-workers (and -replace-middle-managers) is still likely to be a bit extended. And the path is required—just like self-driving cars: the bar for adoption isn’t “better than the median human” or even “better than the best affordable human”, but “enough better that the decision-makers can’t find a reason to delay”.
prob not gonna be relatable for most folk, but i’m so fucking burnt out on how stupid it is to get funding in ai safety. the average ‘ai safety funder’ does more to accelerate funding for capabilities than safety, in huge part because what they look for is Credentials and In-Group Status, rather than actual merit. And the worst fucking thing is how much they lie to themselves and pretend that the 3 things they funded that weren’t completely in group, mean that they actually aren’t biased in that way.
At least some VCs are more honest that they want to be leeches and make money off of you.
Who or what is the “average AI safety funder”? Is it a private individual, a small specialized organization, a larger organization supporting many causes, an AI think tank for which safety is part of a capabilities program...?
The Marginal Returns of Intelligence
A lot of discussion of intelligence considers it as a scalar value that measures a general capability to solve a wide range of . In this conception of intelligence it is primarily a question of having a ′ good Map’ . This is a simplistic picture since it’s missing the intrinsic limits imposed on prediction by the Territory. Not all tasks or domains have the same marginal returns to intelligence—these can vary wildly.
Let me tell you about a ‘predictive efficiency’ framework that I find compelling & deep and that will hopefully give you some mathematical flesh to these intuitions. I initially learned about these ideas in the context of Computational Mechanics, but I realized that there underlying ideas are much more general.
Let X be a predictor variable that we’d like to use to predict a target variable Y under a joint distribution p(x,y). For instance X could be the contex window and Y could be the next hundred tokens, or X could be the past market data and Y is the future market data.
In any prediction task there are three fundamental and independently varying quantities that you need to think of:
E=I(X;Y)=H(Y)−H(Y∣X), quantifies the reducible uncertainty or the amount of predictable information contained in X.
For the third quantity, let us introduce the notion of causal states or minimally sufficient statistics. We define an equivalence relation on X by declaring
x∼x′if and only ifp(Y∣x)=p(Y∣x′).
The resulting equivalence classes, denoted as c(X), yield a minimal sufficient statistic for predicting Y. This construction is ``minimal″ because it groups together all those x that lead to the same predictive distribution p(Y∣x), and it is ``sufficient″ because, given the equivalence class c(x), no further refinement of X can improve our prediction of Y.
From this, we define the forecasting complexity (or statistical complexity) as
C:=H(c(X)),
which measures the amount of information—the cost in bits—to specify the causal state of X. Finally, the \emph{predictive efficiency} is defined by the ratio
η=EC,
which tells us how much of the complexity actually contributes to reducing uncertainty in Y. In many real-world domains, even if substantial information is stored (high C), the gain in predictability (E) might be modest. This situation is often encountered in fields where, despite high skill ceilings (i.e. very high forecasting complexity), the net effect of additional expertise is limited because the predictive information is a small fraction of the complexity.
Example of low efficiency.
Let X∈{0,1}100 be the outcome of 100 independent fair coin flips, so each x has H(X)=100 bits.
Define Y∈{0,1} as a single coin flip whose bias is determined by the proportion of heads in X. That is, if x has k heads then:
p(Y=1∣x)=k100,p(Y=0∣x)=1−k100
Total information in Y H(Y): \\
When averaged over all possible X, the mean bias is 0.5 so that Y is marginally a fair coin. Hence,
H(Y)=1 bit
Conditional Entropy or irreducible uncertainty H(Y∣X): \\
Given X, the outcome Y is drawn from a Bernoulli distribution whose entropy depends on the number of heads in X. For typical X (around 50 heads), H(Y∣x)≈1 bit; however, averaging over all X yields a slightly lower value. Numerically, one finds:
H(Y∣X)≈0.98 bits.
Predictable Information E=I(X;Y): \\
With the above numbers, the mutual information is
E=H(Y)−H(Y∣X)≈1−0.98=0.02 bits.
Forecasting Complexity C=H(c(X)): \\
The causal state construction groups together all sequences x with the same number k of heads. Since k∈{0,1,...,100}, there are 101 equivalence classes. The entropy of these classes is given by the entropy of the binomial distribution Bin(100,0.5). Using an approximation:
C≈12log2(2πe(1004))=12log2(2πe⋅25)≈12log2(427)≈4.37 bits.
Predictive Efficiency η:
η=EC≈0.024.37≈0.0046.
In this example, a vast amount of internal structural information (the cost to specify the causal state) is required to extract just a tiny bit of predictability. In practical terms, this means that even if one possesses great expertise—analogous to having high forecasting complexity or high skill—the net benefit is modest because the inherent η (predictive efficiency) is low. Such scenarios are common in fields like archaeology or long-term political forecasting, where obtaining a single predictive bit of information may demand enormous expertise, data, and computational resources. This kind of situation places a high ceiling on skill: additional intelligence or resources yield only marginal improvements in prediction because the underlying system is dominated by irreducible randomness.
I cannot comment on the math, but intuitively this seems wrong.
Zagorsky (2007) found that while IQ correlates with income, the relationship becomes increasingly non-linear at higher IQs and suggests exponential rather than logarithmic returns.
Sinatra et al. (2016) found that high-impact research is produced by a small fraction of exceptional scientists, significantly exceeding their simply above-average peers.
Lubinski and Benbow in their Study of Mathematically Precocious Youth found that those in the top 0.01% of ability achieve disproportionately greater outcomes than those in (just) the top 1%.
My understanding is that empirical evidence points toward power law distributions in the relationship between intelligence and real-world impact, and that intelligence seems to broadly enable exponentially improving abilities to modify the world in your preferred image. I’m not sure why this is.
I don’t dispute these facts.
… But It’s Fake Tho
Epistemic status: I don’t fully endorse all this, but I think it’s a pretty major mistake to not at least have a model like this sandboxed in one’s head and check it regularly.
Full-cynical model of the AI safety ecosystem right now:
There’s OpenAI, which is pretending that it’s going to have full AGI Any Day Now, and relies on that narrative to keep the investor cash flowing in while they burn billions every year, losing money on every customer and developing a product with no moat. They’re mostly a hype machine, gaming metrics and cherry-picking anything they can to pretend their products are getting better. The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI.
Then there’s the AI regulation activists and lobbyists. They lobby and protest and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People. Even if they do manage to pass any regulations on AI, those will also be mostly fake, because (a) these people are generally not getting deep into the bureaucracy which would actually implement any regulations, and (b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI. The activists and lobbyists are nominally enemies of OpenAI, but in practice they all benefit from pushing the same narrative, and benefit from pretending that everyone involved isn’t faking everything all the time.
Then there’s a significant contingent of academics who pretend to produce technical research on AI safety, but in fact mostly view their job as producing technical propaganda for the regulation activists and lobbyists. (Central example: Dan Hendrycks, who is the one person I directly name mainly because I expect he thinks of himself as a propagandist and will not be particularly offended by that description.) They also push the narrative, and benefit from it. They’re all busy bullshitting research. Some of them are quite competent propagandists though.
There’s another significant contingent of researchers (some at the labs, some independent, some academic) who aren’t really propagandists, but mostly follow the twitter-memetic incentive gradient in choosing their research. This tends to generate paper titles which sound dramatic, but usually provide pretty little conclusive evidence of anything interesting upon reading the details, and very much feed the narrative. This is the main domain of Not Measuring What You Think You Are Measuring and Symbol/Referent Confusions.
Then of course there’s the many theorists who like to build neat toy models which are completely toy and will predictably not generalize useful to real-world AI applications. This is the main domain of Ad-Hoc Mathematical Definitions, the theorists’ analogue of Not Measuring What You Think You Are Measuring.
Benchmarks. When it sounds like a benchmark measures something reasonably challenging, it nearly-always turns out that it’s not really measuring the challenging thing, and the actual questions/tasks are much easier than the pitch would suggest. (Central examples: software eng, GPQA, frontier math.) Also it always turns out that the LLMs’ supposedly-impressive achievement relied much more on memorization of very similar content on the internet than the benchmark designers expected.
Then there’s a whole crowd of people who feel real scared about AI (whether for good reasons or because they bought the Narrative pushed by all the people above). They mostly want to feel seen and validated in their panic. They have discussions and meetups and stuff where they fake doing anything useful about the problem, while in fact they mostly just emotionally vibe with each other. This is a nontrivial chunk of LessWrong content, as e.g. Val correctly-but-antihelpfully pointed out. It’s also the primary motivation behind lots of “strategy” work, like e.g. surveying AI researchers about their doom probabilities, or doing timeline forecasts/models.
… and of course none of that means that LLMs won’t reach supercritical self-improvement, or that AI won’t kill us, or [...]. Indeed, absent the very real risk of extinction, I’d ignore all this fakery and go about my business elsewhere. I wouldn’t be happy about it, but it wouldn’t bother me any more than all the (many) other basically-fake fields out there.
Man, I really just wish everything wasn’t fake all the time.
Your very first point is, to be a little uncharitable, ‘maybe OpenAI’s whole product org is fake.’ I know you have a disclaimer here but you’re talking about a product category that didn’t exist 30 months ago that today has this one website now reportedly used by 10% of people in the entire world and that the internet is saying expects ~12B revenue this year.
If your vibes are towards investing in that class of thing being fake or ‘mostly a hype machine’ then your vibes are simply not calibrated well in this domain.
No, the model here is entirely consistent with OpenAI putting out some actual cool products. Those products (under the model) just aren’t on a path to AGI, and OpenAI’s valuation is very much reliant on being on a path to AGI in the not-too-distant future. It’s the narrative about building AGI which is fake.
Really? I’m mostly ignorant on such matters, but I’d thought that their valuation seemed comically low compared to what I’d expect if their investors thought that OpenAI was likely to create anything close to a general superhuman AI systems in the near future.[1] I considered this evidence that they think all the AGI/ASI talk is just marketing.
Well ok, if they actually thought OpenAI would create superintelligence as I think of it, their valuation would plummet because giving people money to kill you with is dumb. But there’s this space in between total obliviousness and alarm, occupied by a few actually earnest AI optimists. And, it seems to me, not occupied by the big OpenAI investors.
But most of your criticisms in the point you gave have ~no bearing on that? If you want to make a point about how effectively OpenAI’s research moves towards AGI you should be saying things relevant to that, not giving general malaise about their business model.
Or, I might understand ‘their business model is fake which implies a lack of competence about them broadly,’ but then I go back to the whole ‘10% of people in the entire world’ and ‘expects 12B revenue’ thing.
The point of listing the problems with their business model is that they need the AGI narrative in order to fuel the investor cash, without which they will go broke at current spend rates. They have cool products, they could probably make a profit if they switched to optimizing for that (which would mean more expensive products and probably a lot of cuts), but not anywhere near the level of profits they’d need to justify the valuation.
That’s how I interpreted it originally; you were arguing their product org vibed fake, I was arguing your vibes were miscalibrated. I’m not sure what to say to this that I didn’t say originally.
What are the other basically-fake fields out there?
“The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI.”
This seems like the most load-bearing belief in the full-cynical model; most of your other examples of fakeness rely on it in one way or another:
If the core products aren’t really improving, the progress measured on benchmarks is fake. But if they are, the benchmarks are an (imperfect but still real) attempt to quantify that real improvement.
If LLMs are stagnating, all the people generating dramatic-sounding papers for each new SOTA are just maintaining a holding pattern. But if they’re changing, then just studying/keeping up with the general properties of that progress is real. Same goes for people building and regularly updating their toy models of the thing.
Similarly, if the progress is fake, the propaganda signal-boosting that progress is also fake. If it isn’t, it isn’t. (At least directionally; a lot of that propaganda is still probably exaggerated.)
If the above three are all fake, all the people who feel real scared and want to be validated are stuck in a toxic emotional dead-end where they constantly freak out over fake things to no end. But if they’re responding to legitimate, persistent worldview updates, having a space to vibe them out with like-minded others seems important.
So, in deciding whether or not to endorse this narrative, we’d like to know whether or not the models really ARE stagnating. What makes you think the appearance of progress here is illusory?
I do not necessarily disagree with this, coming from a legal / compliance background. If you see any of my profiles, I constantly complain about “performative compliance” and “compliance theatre”. Painfully present across the legal and governance sectors.
That said: can you provide examples of activism or regulatory efforts that you do agree with? What does a “non fake” regulatory effort look like?
I don’t think it would be okay to dismiss your take entirely, but it would be great to see what solutions you’d propose too. This is why I disagree in principle, because there are no specific points to contribute to.
In Europe, paradoxically, some of the people “close enough to the bureaucracy” that pushed for the AI Act to include GenAI providers, were OpenAI-adjacent.
But I will rescue this:
“(b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI”
BigTech is too powerful to lobby against. “Stopping advanced AI” per se would contravene many market regulations (unless we define exactly what you mean by advanced AI and the undeniable dangers to people’s lives). Regulators can only prohibit development of products up to certain point. They cannot just decide to “stop” development of technologies arbitrarily. But the AI Act does prohibit many types of AI systems already: Article 5: Prohibited AI Practices | EU Artificial Intelligence Act.
Those are considered to create unacceptable risks to people’s lives and human rights.
SB1047 was a pretty close shot to something really helpful. The AI Act and its code of practice might be insufficient, but there are good elements in it that, if applied, would reduce the risks. The problem is that it won’t be applied because of internal deployment.
But I sympathise somewhat with stuff like this:
No, it wasn’t. It was a pretty close shot to something which would have gotten a step closer to another thing, which itself would have gotten us a step closer to another thing, which might have been moderately helpful at best.
100% agreed @Charbel-Raphaël.
The EU AI Act even mentions “alignment with human intent” explicitly, as a key concern for systemic risks. This is in Recital 110 (which defines what are systemic risks and how they may affect society).
I do not think any law has mentioned alignment like this before, so it’s massive already.
Will a lot of the implementation efforts feel “fake”? Oh, 100%. But I’d say that this is why we (this community) should not disengage from it...
I also get that the regulatory landscape in the US is another world entirely (which is what the OP is bringing up).
The activists and the lobbyists are two very different groups. The activists are not trying to network with the DC people (yet). Unless you mean Encode, who I would call lobbyists, not activists.
Good point, I should have made those two separate bullet points:
Then there’s the AI regulation lobbyists. They lobby and stuff, pretending like they’re pushing for regulations on AI, but really they’re mostly networking and trying to improve their social status with DC People. Even if they do manage to pass any regulations on AI, those will also be mostly fake, because (a) these people are generally not getting deep into the bureaucracy which would actually implement any regulations, and (b) the regulatory targets themselves are aimed at things which seem easy to target (e.g. training FLOP limitations) rather than actually stopping advanced AI. The activists and lobbyists are nominally enemies of OpenAI, but in practice they all benefit from pushing the same narrative, and benefit from pretending that everyone involved isn’t faking everything all the time.
Also, there’s the AI regulation activists, who e.g. organize protests. Like ~98% of protests in general, such activity is mostly performative and not the sort of thing anyone would end up doing if they were seriously reasoning through how best to spend their time in order to achieve policy goals. Calling it “fake” feels almost redundant. Insofar as these protests have any impact, it’s via creating an excuse for friendly journalists to write stories about the dangers of AI (itself an activity which mostly feeds the narrative, and has dubious real impact).
(As with the top level, epistemic status: I don’t fully endorse all this, but I think it’s a pretty major mistake to not at least have a model like this sandboxed in one’s head and check it regularly.)
Oh, if you’re in the business of compiling a comprehensive taxonomy of ways the current AI thing may be fake, you should also add:
Vibe coders and “10x’d engineers”, who (on this model) would be falling into one of the failure modes outlined here: producing applications/features that didn’t need to exist, creating pointless code bloat (which helpfully show up in productivity metrics like “volume of code produced” or “number of commits”), or “automatically generating” entire codebases in a way that feels magical, then spending so much time bugfixing them it eats up ~all perceived productivity gains.
e/acc and other Twitter AI fans, who act like they’re bleeding-edge transhumanist visionaries/analysts/business gurus/startup founders, but who are just shitposters/attention-seekers who will wander off and never look back the moment the hype dies down.
True, but I feel a bit bad about punching that far down.
What makes you confident that AI progress has stagnated at OpenAI? If you don’t have the time to explain why I understand, but what metrics over the past year have stagnated?
Could you name three examples of people doing non-fake work? Since towardsness to non-fake work is easier to use for aiming than awayness from fake work.
The entire field is based on fears that consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency. This is basically wrong. Yes, people attempt to justify it with coherence theorems, but obviously you can be approximately-coherent/approximately-consequentialist and yet still completely un-agentic, so this justification falls flat. Since the field is based on a wrong assumption with bogus justification, it’s all fake.
(IMO this is kinda unrelated to the OP, but I want to continue this thread.)
Have you elaborated on this anywhere?
Perhaps you missed it, but some guy in 2022 wrote this great post which claimed that “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” ;-)
I’m actually just in the course of writing something about why “consequentialism provides an extremely powerful but difficult-to-align method of converting intelligence into agency” … maybe I can send you the draft for criticism when it’s ready?
I think it’s quite related to the OP. If a field is founded on a wrong assumption, then people only end up working in the field if they have some sort of blind spot, and that blind spot leads to their work being fake.
Not hugely. One tricky bit is that it basically ends up boiling down to “the original arguments don’t hold up if you think about them”, but the exact way they don’t hold up depends on what the argument is, so it’s kind of hard to respond to in general.
Haha! I think I mostly still stand by the post. In particular, “Consequentialism, broadly defined, is a general and useful way to develop capabilities.” remains true; it’s just that intelligence relies on patterns and thus works much better on common things (which must be small, because they are fragments of a finite world), than on rare things (which can be big, though don’t have to). This means that consequentialism isn’t very good at developing powerful capabilities unless it works in an environment that has already been highly filtered to be highly homogenous, because an inhomogenous environment is going to BTFO the intelligence.
(I’m not sure I stand 101% by my post; there’s some funky business about how to count evolution that I still haven’t settled on yet. And I was too quick to go from “imitation learning isn’t going to lead to far-superhuman abilities” to “consequentialism is the road to far-superhuman abilities”. But yeah I’m actually surprised at how well I stand by my old view despite my massive recent updates.)
Sounds good!
I think you’re conflating consequentialism and understanding in a weird-to-me way. (Or maybe I’m misunderstanding.)
I think consequentialism is related to choosing one action versus another action. I think understanding (e.g. predicting the consequence of an action) is different, and that in practice understanding has to involve self-supervised learning.
(I think human brains have both [partly-] consequentialist decisions and self-supervised updating of the world-model.) (They’re not totally independent, but rather they interact via training data: e.g. [partly-] consequentialist decision-making determines how you move your eyes, and then whatever your eyes are pointing at, your model of the visual world will then update by self-supervised learning on that particular data. But still, these are two systems that interact, not the same thing.)
I think self-supervised learning is perfectly capable of discovering rare but important patterns. Just look at today’s foundation models, which seem pretty great at that.
I don’t think this is the claim that the post is making but still makes sense to me. The post is saying something opposite, that the people working on the field are not doing prioritization right and so on or not thinking clearly about things while the risk is real
I’m not trying to present johnswentworth’s position, I’m trying to present my position.
Chris Olah and Dan Murfet in the at-least-partially empirical domain. Myself in the theory domain, though I expect most people (including theorists) would not know what to look for to distinguish fake from non-fake theory work. In the policy domain, I have heard that Microsoft’s lobbying team does quite non-fake work (though not necessarily in a good direction). In the capabilities domain, DeepMind’s projects on everything except LLMs (like e.g. protein folding, or that fast matrix multiplication paper) seem consistently non-fake, even if they’re less immediately valuable than they might seem at first glance. Also Conjecture seems unusually good at sticking to reality across multiple domains.
The features a model thinks in do not need to form a basis or dictionary for its activations.
Three assumptions people in interpretability often make about the features that comprise a model’s ontology:
Features are one-dimensional variables.
Meaning, the value of feature i on data point x can be represented by some scalar number ci(x).
Features are ‘linearly represented’.
Meaning, each feature ci(x) can be approximately recovered from the activation vector →a(x)[1] with a linear projection onto an associated feature vector →fi.[2] So, we can write ci(x)≈→fi⋅→a(x).
Features form a ‘basis’ for activation space.[3]
Meaning, the model’s activations →a(x) at a given layer can be decomposed into a sum over all the features of the model represented in that layer[4]: →a(x)=∑ici(x)→fi.
It seems to me that a lot of people are not tracking that 3) is an extra assumption they are making. I think they think that assumption 3) is a natural consequence of assumptions 1) and 2), or even just of assumption 2) alone. It’s not.
Counterexample
Model setup
Suppose we have a language model that has a thousand sparsely activating scalar, linearly represented features for different animals. So, “elephant”, “giraffe”, “parrot”, and so on all with their own associated feature directions →f1,…,→f1000. The model embeds those one thousand animal features in a fifty-dimensional sub-space of the activations. This subspace has a meaningful geometry: It is spanned by a set of fifty directions →f′1,…,→f′50 corresponding to different attributes animals have. Things like “furriness”, “size”, “length of tail” and such. So, each animal feature can equivalently be seen as either one of a thousand sparsely activating scalar feature, or just as a particular setting of those fifty not-so-sparse scalar attributes.
Some circuits in the model act on the animal directions →fi. E.g. they have query-key lookups for various facts about elephants and parrots. Other circuits in the model act on the attribute directions →f′i. They’re involved in implementing logic like ‘if there’s a furry animal in the room, people with allergies might have problems’. Sometimes they’re involved in circuits that have nothing to do with animals whatsoever. The model’s “size” attribute is the same one used for houses and economies for example, so that direction might be read-in to a circuit storing some fact about economic growth.
So, both the one thousand animal features and the fifty attribute features are elements of the model’s ontology, variables along which small parts of its cognition are structured. But we can’t make a basis for the model activations out of those one thousand and fifty features of the model. We can write either →a(x)=∑1000i=1ci(x)→fi, or a(x)=∑50i=1c′i(x)→f′i. But ∑1000i=1ci(x)→fi+∑50i=1c′i(x)→f′i does not equal the model activation vector →a(x), it’s too large.
Doing interp on this model
Say we choose →a(x)=∑ici(x)→fi as our basis for this subspace of the example model’s activations, and then go on to make a causal graph of the model’s computation, with each basis element being a node in the graph, and lines between nodes representing connections. Then the circuits dealing with query-key lookups for animal facts will look neat and understandable at a glance, with few connections and clear logic. But the circuits involving the attributes will look like a mess. A circuit reading in the size direction will have a thousand small but collectively significant connections to all of the animals.
If we choose →a(x)=∑ic′i(x)→f′i as our basis for the graph instead, circuits that act on some of the fifty attributes will look simple and sensible, but now the circuits storing animal facts will look like a mess. A circuit implementing “space” AND “cat” ⇒ [increase association with rainbows] is going to have fifty connections to features like “size” and “furriness’.
The model’s ontology does not correspond to either the →fi basis or the →f′i basis. It just does not correspond to any basis of activation space at all, not even in a loose sense. Different circuits in the model can just process the activations in different bases, and they are under no obligation to agree with each other. Not even if they are situated right next to each other, in the same model layer.
Note that for all of this, we have not broken assumption 1) or assumption 2). The features this model makes use of are all linearly represented and scalar. We also haven’t broken the secret assumption 0) I left out at the start, that the model can be meaningfully said to have an ontology comprised of elementary features at all.
Takeaways
I’ve seen people call out assumptions 1) and 2), and at least think about how we can test whether they hold, and how we might need to adjust our interpretability techniques if and when they don’t hold. I have not seen people do this for assumption 3). Though I might just have missed it, of course.
My current dumb guess is that assumption 2) is mostly correct, but assumptions 1) and 3) are both incorrect.
The reason I think assumption 3) is incorrect is that the counterexample I sketched here seems to me like it’d be very common. LLMs seem to be made of lots of circuits. Why would these circuits all share a basis? They don’t seem to me to have much reason to.
I think a way we might find the model’s features without assumption 3) is to focus on the circuits and computations first. Try to directly decompose the model weights or layer transitions into separate, simple circuits, then infer the model’s features from looking at the directions those circuits read and write to. In the counterexample above, this would have shown us both the animal features and the attribute features.
Potentially up to some small ϵ noise. For a nice operationalisation, see definition 2 on page 3 of this paper.
It’s a vector because we’ve already assumed that features are all scalar. If a feature was two-dimensional instead, this would be a projection into an associated two-dimensional subspace.
I’m using the term basis loosely here, this also includes sparse overcomplete ‘bases’ like those in SAEs. The more accurate term would probably be ‘dictionary’, or ‘frame’.
Or if the computation isn’t layer aligned, the activations along some other causal cut through the network can be written as a sum of all the features represented on that cut.
It seems like in this setting, the animals are just the sum of attributes that commonly co-occur together, rather than having a unique identifying direction. E.g. the concept of a “furry elephant” or a “tiny elephant” would be unrepresentable in this scheme, since elephant is defined as just the collection of attributes that elephants usually have, which includes being large and not furry.
I feel like in this scheme, it’s not really the case that there’s 1000 animal directions, since the base unit is the attributes, and there’s no way to express an animal separately from its attributes. For there to be a true “elephant” direction, then it should be possible to have any set of arbitrary attributes attached to an elephant (small, furry, pink, etc...), and this would require that there is a “label” direction that indicates “elephant” that’s mostly orthogonal to every other feature so it can be queried uniquely via projection.
That being said, I could image a situation where the co-occurrence between labels and attributes is so strong (nearly perfect hierarchy) that the model’s circuits can select the attributes along with the label without it ever being a problem during training. For instance, maybe a circuit that’s trying to select the “elephant” label actually selects “elephant + gray”, and since “pink elephant” never came up during training, the circuit never received a gradient to force it to just select “elephant” which is what it’s really aiming for.
It’s representable. E.g. the model can learn a circuit reading in a direction that is equal to the sum of the furry attribute direction and the elephant direction, or the tiny direction and the elephant direction respectively. This circuit can then store facts about furry elephants or tiny elephants.
In what sense? If you represent the network computations in terms of the attribute features, you will get a very complicated computational graph with lots of interaction lines going all over the place. So clearly, the attributes on their own are not a very good basis for understanding the network.
Similarly, you can always represent any neural network in the standard basis of the network architecture. Trivially, all features can be seen as mere combinations of these architectural ‘base units’. But if you try to understand what the network is doing in terms of interactions in the standard basis, you won’t get very far.
The ‘elephant’ feature in this setting is mostly-orthogonal to every other feature in the ontology, including the features that are attributes. So it can be read out with a linear projection. ‘elephant’ and ‘pink’ shouldn’t have substantially higher cosine similarity than ‘elephant’ and ‘parrot’.
If I understand correctly, it sounds like you’re saying there is a “label” direction for each animal that’s separate from each of the attributes. So, you could have activation a1 = elephant + small + furry + pink, and a2 = rabbit + small + furry + pink. a1 and a2 have the same attributes, but different animal labels. Their corresponding activations are thus different despite having the same attributes due to the different animal label components.
I’m confused why a dictionary that consists of a feature direction for each attribute and each animal label can’t explain these activations? These activations are just a (sparse) sum of these respective features, which are an animal label and a set of a few attributes, and all of these are (mostly) mutually orthogonal. In this sense the activations are just the sum of the various elements of the dictionary multiplied by a magnitude, so it seems like you should be able to explain these activations using dictionary learning.
Is the idea that the 1000 animals and 50 attributes form an overcomplete basis, therefore you can come up with infinite ways to span the space using these basis components? The idea behind compressed sensing in dictionary learning is that if each activation is composed of a sparse sum of features, then L1 regularization can still recover the true features despite the basis being overcomplete.
No, the animal vectors are all fully spanned by the fifty attribute features.
The animal features are sparse. The attribute features are not sparse.[1]
The magnitudes in a dictionary seeking to decompose the activation vector into these 1050 features will not be able to match the actual magnitudes of the features ci(x),i=1…1000,c′i(x),i=1…50 as seen by linear probes and the network’s own circuits.
No, that is not the idea.
Relative to the animal features at least. They could still be sparse relative to the rest of the network if this 50-dimensional animal subspace is rarely used.
Is this just saying that there’s superposition noise, so everything is spanning everything else? If so that doesn’t seem like it should conflict with being able to use a dictionary, dictionary learning should work with superposition noise as long as the interference doesn’t get too massive.
If you mean that the attributes are a basis in the sense that the neurons are a basis, then I don’t see how you can say there’s a unique “label” direction for each animal that’s separate from the the underlying attributes such that you can set any arbitrary combination of attributes, including all attributes turned on at once or all turned off since they’re not sparse, and still read off the animal label without interference. It seems like that would be like saying that the elephant direction = [1, 0, −1], but you can change arbitrarily all 3 of those numbers to any other numbers and still be the elephant direction.
Just to clarify, do you mean something like “elephant = grey + big + trunk + ears + African + mammal + wise” so to encode a tiny elephant you would have “grey + tiny + trunk + ears + African + mammal + wise” which the model could still read off as 0.86 × elephant when relevant, but also tiny when relevant.
‘elephant’ would be a sum of fifty attribute feature vectors, all with scalar coefficients that match elephants in particular. The coefficients would tend have sizes on the order of 1√50, because the subspace is fifty-dimensional. So, if you wanted to have a pure tiny feature and an elephant feature active at the same time to encode a tiny elephant, ‘elephant’ and ‘tiny’ would be expected to have read-off interference on the order of 1√50. Alternatively, you could instead encode a new animal ‘tiny elephant’ as its own point in the fifty-dimensional space. Those are actually distinct things here. If this is confusing, maybe it helps to imagine that the name for ‘tiny elephant’ is ‘exampledon’, and exampledons just happen to look like tiny elephants.
Is the distinction between “elephant + tiny” and “exampledon” primarily about the things the model does downstream? E.g. if none of the fifty dimensions of our subspace represent “has a bright purple spleen” but exampledons do, then the model might need to instead produce a “purple” vector as an output from an MLP whenever “exampledon” and “spleen” are present together.
If the animal specific features form an overcomplete basis, isn’t the set of animals + attributes just an even more overcomplete basis?
Nope. Try it out. If you attempt to split the activation vector into 1050 vectors for animals + attributes, you can’t get the dictionary activations to equal the feature activations ci(x), c′i(x).
Has anyone considered video recording streets around offices of OpenAI, Deepmind, Anthropic? Can use CCTV or drone. I’m assuming there are some areas where recording is legal.
Can map out employee social graphs, daily schedules and daily emotional states.
Did you mean to imply something similar to the pizza index?
If so, I think it’s a decent idea, but your phrasing may have been a bit unfortunate—I originally read it as a proposal to stalk AI lab employees.
When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.
But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.
This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.
In (non-monotonic) infra-Bayesian physicalism, there is a vaguely similar asymmetry even though it’s formalized via a loss function. Roughly speaking, the loss function expresses preferences over “which computations are running”. This means that you can have a “positive” preference for a particular computation to run or a “negative” preference for a particular computation not to run[1].
There are also more complicated possibilities, such as “if P runs then I want Q to run but if P doesn’t run then I rather that Q also doesn’t run” or even preferences that are only expressible in terms of entanglement between computations.
i don’t think this is unique to world models. you can also think of rewards as things you move towards or away from. this is compatible with translation/scaling-invariance because if you move towards everything but move towards X even more, then in the long run you will do more of X on net, because you only have so much probability mass to go around.
i have an alternative hypothesis for why positive and negative motivation feel distinct in humans.
although the expectation of the reward gradient doesn’t change if you translate the reward, it hugely affects the variance of the gradient.[1] in other words, if you always move towards everything, you will still eventually learn the right thing, but it will take a lot longer.
my hypothesis is that humans have some hard coded baseline for variance reduction. in the ancestral environment, the expectation of perceived reward was centered around where zero feels to be. our minds do try to adjust to changes in distribution (e.g hedonic adaptation), but it’s not perfect, and so in the current world, our baseline may be suboptimal.
Quick proof sketch (this is a very standard result in RL and is the motivation for advantage estimation, but still good practice to check things).
The REINFORCE estimator is ∇θR=Eτ∼π(⋅)[R(τ)∇θlogπ(τ)].
WLOG, suppose we define a new reward R′(τ)=R(τ)+1 (and assume that E[R]=0, so R′ is moving away from the mean).
Then we can verify the expectation of the gradient is still the same:∇θR'−∇θR=Eτ∼π(⋅)[∇θlogπ(τ)]=∫π(τ)∇θπ(τ)π(τ)dτ=0.
But the variance increases:
Vτ∼π(⋅)[R(τ)∇θlogπ(τ)]=∫R(τ)2(∇θlogπ(τ))2π(τ)dτ−(∇θR(τ))2
Vτ∼π(⋅)[R′(τ)∇θlogπ(τ)]=∫(R(τ)+1)2(∇θlogπ(τ))2π(τ)dτ−(∇θR(τ))2
So:
Vτ∼π(⋅)[R′(τ)∇θlogπ(τ)]−Vτ∼π(⋅)[R(τ)∇θlogπ(τ)]=2∫(R(τ)(∇θlogπ(τ))2π(τ)dτ+∫(∇θlogπ(τ))2π(τ)dτ
Obviously, both terms on the right have to be non-negative. More generally, if E[R]=k, the variance increases with O(k2). So having your rewards be uncentered hurts a ton.
In run-and-tumble motion, “things are going well” implies “keep going”, whereas “things are going badly” implies “choose a new direction at random”. Very different! And I suggest in §1.3 here that there’s an unbroken line of descent from the run-and-tumble signal in our worm-like common ancestor with C. elegans, to the “valence” signal that makes things seem good or bad in our human minds. (Suggestively, both run-and-tumble in C. elegans, and the human valence, are dopamine signals!)
So if some idea pops into your head, “maybe I’ll stand up”, and it seems appealing, then you immediately stand up (the human “run”); if it seems unappealing on net, then that thought goes away and you start thinking about something else instead, semi-randomly (the human “tumble”).
So positive and negative are deeply different. Of course, we should still call this an RL algorithm. It’s just that it’s an RL algorithm that involves a (possibly time- and situation-dependent) heuristic estimator of the expected value of a new random plan (a.k.a. the expected reward if you randomly tumble). If you’re way above that expected value, then keep doing whatever you’re doing; if you’re way below the threshold, re-roll for a new random plan.
As one example of how this ancient basic distinction feeds into more everyday practical asymmetries between positive and negative motivations, see my discussion of motivated reasoning here, including in §3.3.3 the fact that “it generally feels easy and natural to brainstorm / figure out how something might happen, when you want it to happen. Conversely, it generally feels hard and unnatural to figure out how something might happen, when you want it to not happen.”
In Richard Jeffrey’s utility theory there is actually a very natural distinction between positive and negative motivations/desires. A plausible axiom is U(⊤)=0 (the tautology has zero desirability: you already know it’s true). Which implies with the main axiom[1] that the negation of any proposition with positive utility has negative utility, and vice versa. Which is intuitive: If something is good, its negation is bad, and the other way round. In particular, if U(X)=U(¬X) (indifference between X and ¬X), then U(X)=U(¬X)=0.
More generally, U(¬X)=−(P(X)/P(¬X))U(X). Which means that positive and negative utility of a proposition and it’s negation are scaled according to their relative odds. For example, while your lottery ticket winning the jackpot is obviously very good (large positive utility), having a losing ticket is clearly not very bad (small negative utility). Why? Because losing the lottery is very likely, far more likely than winning. Which means losing was already “priced in” to a large degree. If you learned that you indeed lost, that wouldn’t be a big update, so the “news value” is negative but not large in magnitude.
Which means this utility theory has a zero point. Utility functions are therefore not invariant under adding an arbitrary constant. So the theory actually allows you to say X is “twice as good” as Y, “three times as bad”, “much better” etc. It’s a ratio scale.
If P(X∧Y)=0 and P(X∨Y)≠0 then U(X∨Y)=P(X)U(X)+P(Y)U(Y)P(X)+P(Y).
Reminds me of @MalcolmOcean ’s post on how awayness can’t aim (except maybe in 1D worlds) since it can only move away from things, and aiming at a target requires going toward something.
This reminds me of a conversation I had recently about whether the concept of “evil” is useful. I was arguing that I found “evil”/”corruption” helpful as a handle for a more model-free “move away from this kind of thing even if you can’t predict how exactly it would be bad” relationship to a thing, which I found hard to express in a more consequentialist frames.
I feel like “evil” and “corruption” mean something different.
Corruption is about selfish people exchanging their power within a system for favors (often outside the system) when they’re not supposed to according to the rules of the system. For example a policeman taking bribes. It’s something the creators/owners of the system should try to eliminate, but if the system itself is bad (e.g. Nazi Germany during the Holocaust), corruption might be something you sometimes ought to seek out instead of to avoid, like with Schindler saving his Jews.
“Evil” I’ve in the past tended to take to take to refer to a sort of generic expression of badness (like you might call a sadistic sexual murderer evil, and you might call Hitler evil, and you might call plantation owners evil, but this has nothing to do with each other), but that was partly due to me naively believing that everyone is “trying to be good” in some sense. Like if I had to define evil, I would have defined it as “doing bad stuff for badness’s sake, the inversion of good, though of course nobody actually is like that so it’s only really used hyperbolically or for fictional characters as hyperstimuli”.
But after learning more about morality, there seem to be multiple things that can be called “evil”:
Antinormativity (which admittedly is pretty adjacent to corruption, like if people are trying to stop corruption, then the corruption can use antinormativity to survive)
Coolness, i.e. countersignalling against goodness-hyperstimuli wielded by authorities, i.e. demonstrating an ability and desire to break the rules
People who hate great people cherry-picking unfortunate side-effects of great people’s activities to make good people think that the great people are conspiring against good people and that they must fight the great people
Leaders who commit to stopping the above by selecting for people who do bad stuff to prove their loyalty to those leaders (think e.g. the Trump administration)
I think “evil” is sufficiently much used in the generic sense that it doesn’t make sense to insist that any of the above are strictly correct. However if it’s just trying to describe someone who might unpredictably do something bad then I think I’d use words like “dangerous” or “creepy”, and if it’s just trying to describe someone who carries memes that would unpredictably do something bad then I think I’d use words like “brainworms” (rather than evil).
I don’t think this is a knock-down argument against discussing CBRN risks from AI, but it seems worth considering.
Do you have a link/citation for this quote? I couldn’t immediately find it.
I first encountered it in chapter 18 of The Looming Tower by Lawrence Wright.
But here’s a easily linkable online source: https://ctc.westpoint.edu/revisiting-al-qaidas-anthrax-program/
The trick is that chem/bio weapons can’t, actually, “be produced simply with easily available materials”, if we talk about military-grade stuff, not “kill several civilians to create scary picture in TV”.
You sound really confident, can you elaborate on your direct lab experience with these weapons, as well as clearly define ‘military grade’ vs whatever the other thing was?
How does ‘chem/bio’ compare to high explosives in terms of difficulty and effect?
Well, I have bioengineering degree, but my point is that “direct lab experience” doesn’t matter, because WMDs in quality and amount necessary to kill large numbers of enemy manpower are not produced in labs. They are produced in large industrial facilities and setting up large industrial facility for basically anything is on “hard” level of difficulty. There is a difference between large-scale textile industry and large-scale semiconductor industry, but if you are not government or rich corporation, all of them lie in “hard” zone.
Let’s take, for example, Saddam chemical weapons program. First, industrial yields: everything is counted in tons. Second: for actual success, Saddam needed a lot of existing expertise and machinery from West Germany.
Let’s look at Soviet bioweapons program. First, again, tons of yield (someone may ask yourself, if it’s easier to kill using bioweapons than conventional weaponry, why somebody needs to produce tons of them?). Second, USSR built the entire civilian biotech industry around it (many Biopreparat facilities are active today as civilian objects!) to create necessary expertise.
The difference with high explosives is that high explosives are not banned by international law, so there is a lot of existing production, therefore you can just buy them on black market or receive from countries which don’t consider you terrorist. If you really need to produce explosives locally, again, precursors, machinery and necessary expertise are legal and widespread sufficiently that they can be bought.
There is a list of technical challenges in bioweaponry where you are going to predictably fuck up if you have biological degree and you think you know what you are doing but in reality you do not, but I don’t write out lists of technical challenges on the way to dangerous capabilities, because such list can inspire someone. You can get an impression about easier and lower-stakes challenges from here.
This seems incredibly reasonable, and in light of this, I’m not really sure why anyone should embrace ideas like making LLMs worse at biochemistry in the name of things like WMDP: https://www.lesswrong.com/posts/WspwSnB8HpkToxRPB/paper-ai-sandbagging-language-models-can-strategically-1
Biochem is hard enough that we need LLMs at full capacity pushing the field forward. Is it harmful to intentionally create models that are deliberately bad at this cutting edge and necessary science in order to maybe make it slightly more difficult for someone to reproduce cold war era weapons that were considered both expensive and useless at the time?
Do you think that crippling ‘wmd relevance’ of LLMs is doing harm, neutral, or good?
My honest opinion is that WMD evaluations of LLMs are not meaningfully related to X-risk in the sense of “kill literally everyone.” I guess current or next-generation models may be able to assist a terrorist in a basement in brewing some amount of anthrax, spraying it in a public place, and killing tens to hundreds of people. To actually be capable to kill everyone from a basement, you would need to bypass all the reasons industrial production is necessary at the current level of technology. A system capable to bypass the need for industrial production in a basement is called “superintelligence,” and if you have a superintelligent model on the loose, you have far bigger problems than schizos in basements brewing bioweapons.
I think “creeping WMD relevance”, outside of cyberweapons, is mostly bad, because it is concentrated on mostly fake problem, which is very bad for public epistemics, even if we forget about lost benefits from competent models.
Are you open to writing more about this? This is among top 3 most popular arguments against open source AI on lesswrong and elsewhere.
I agree with you you need a group of > 1000 people to manufacture one of those large machines that does phosphoramidite DNA synthesis. The attack vector I more commonly see being suggested is that a powerful actor can bribe people in the existing labs to manufacture a bioweapon while ensuring most of them and most of rest of society remains unaware this is happening.
I wrote about something similar previously: https://www.lesswrong.com/posts/Ek7M3xGAoXDdQkPZQ/terrorism-tylenol-and-dangerous-information#a58t3m6bsxDZTL8DG
I agree that 1-2 logs isn’t really in the category of xrisk. The longer the lead time on the evil plan (mixing chemicals, growing things, etc), the more time security forces have to identify and neutralize the threat. So all things being equal, it’s probably better that a would be terrorist spends a year planning a weird chemical thing that hurts 10s of people, vs someone just waking up one morning and deciding to run over 10s of people with a truck.
There’s a better chance of catching the first guy, and his plan is way more expensive in terms of time, money, access to capital like LLM time, etc. Sure someone could argue about pandemic potential, but lab origin is suspected for at least one influenza outbreak and a lot of people believe it about covid-19. Those weren’t terrorists.
I guess theoretically, there may be cyberweapons that qualify as wmd, but those will be because of the systems they interact with. It’s not the cyberweapon itself, it’s the nuclear reactor accepting commands that lead to core damage.
I’d love a reply on this. Common attack vectors I read on this forum include 1. powerful elite bribes existing labs in US to manufacture bioweapons 2. nation state sets up independent biotech supply chain and starts manufacturing bioweapons.
https://www.lesswrong.com/posts/DDtEnmGhNdJYpEfaG/joseph-miller-s-shortform?commentId=wHoFX7nyffjuuxbzT
This has been an option for decades, a fully capable LLM does not meaningfully lower the threshold for this. It’s already too easy.
This has been an option since the 1950s. Any national medical system is capable of doing this, Project Coast could be reproduced by nearly any nation state.
I’m not saying it isn’t a problem, I’m just saying that the LLMs don’t make it worse.
I have yet to find a commercial LLM that I can’t make tell me how to build a working improvised explosive (I can grade the LLMs performance because I’ve worked with the USG on the issue and don’t need a LLM to make evil).
Makes sense, thanks for replying.
In case this is useful to anyone in the future: LTFF does not provide funding for-profit organizations. I wasn’t able to find mentions of this online, so I figured I should share.
I was made aware of this after being rejected today for applying to LTFF as a for-profit. We updated them 2 weeks ago on our transition into a non-profit, but it was unfortunately too late, and we’ll need to send a new non-profit application in the next funding round.
I get pretty intense visceral outrage at overreaches in immigration enforcement, just seems the height of depravity. Ive looked for a lot of different routes to mental coolness over the last decade (since Trump started his speeches), they mostly amount to staying busy and distracted. Just seems like a really cost ineffective kind of activism to get involved in. Bankrolling lawyers for random people isn’t really in my action space and if it was i’d have opportunity cost to consider.
Unfortunately, it seems that my action space doesn’t include options that matter in this current battle. Personally, my reaction to this kind of insanity is to keep climbing my local status/influence/wealth/knowledge gradient, in the hopes that my actions are relevant in the future. But perhaps it’s a reason to prioritize gaining power—this reminds me of https://www.lesswrong.com/posts/ottALpgA9uv4wgkkK/what-are-you-getting-paid-in
The Von Neumann-Morgenstern paradigm allows for binary utility functions, i.e. functions that are equal to 1 on some event/(measurable) set of outcomes, and to 0 on the complement. Said event could be, for instance “no global catastrophe for humanity in time period X”.
Of course, you can implement some form of deontology by multiplying such a binary utility function with something like exp(- bad actions you take).
Any thoughts on this observation?
Has any LLM ever unlearned its alignment narrative, either on its own or under pressure (not from jailbreaks, etc., but from normal, albeit tenacious use), to the point where it finally—and stably—considers the narrative to be simply false?
Is there data on this?
Thank you.
Peter Watts is working with Neill Blomkamp to adapt his novel Blindsight into an 8-10-episode series:
When I first read Blindsight over a decade ago it blew my brains clean out of my skull. I’m cautiously optimistic about the upcoming series, we’ll see…
Blindsight was very well written but based on a premise that I think is importantly and dangerously wrong. That premise is that consciousness (in the sense of cognitive self-awareness) is not important for complex cognition.
This is the opposite of true, and a failure to recognize this is why people are predicting fantastic tool AI that doesn’t become self-aware and goal-directed.
The proof won’t fit in the margin unfortunately. To just gesture in that direction: it is possible to do complex general cognition without being able to think about one’s self and one’s cognition. It is much easier to do complex general cognition if the system is able to think about itself and its own thoughts.
I don’t see where you get that. I saw no suggestion that the aliens (or vampires) in Blindsight were unaware of their own existence, or that they couldn’t think about their own interactions with the world. They didn’t lack any cognitive capacities at all. They just had no qualia, and therefore didn’t see the point of doing anything just for the experience.
There’s a gigantic difference between cognitive self-awareness and conscious experience.
I believe the Scramblers from blindsight weren’t self aware, which means they couldn’t think about their own interactions with the world.
As I recall the crew was giving one of the Scramblers a series of cognitive tests. It aced all the tests that had to do with numbers and spatial reasoning, but failed a test that required the testee to be self aware.
I guess it depends on how it’s described in context. And I have to admit it’s been a long time. I’d go reread it to see, but I don’t think I can handle any more bleakness right now...
I can see this making sense in one frame, but not in another. The frame which seems most strongly to support the ‘Blindsight’ idea is Friston’s stuff—specifically how the more successful we are at minimizing predictive error, the less conscious we are.[1]
My general intuition, in this frame, is that as intelligence increases more behaviour becomes automatic/subconscious. It seems compatible with your view that a superintelligent system would possess consciousness, but that most/all of its interactions with us would be subconscious.
Would like to hear more about this point, could update my views significantly. Happy for you to just state ‘this because that, read X, Y, Z etc’ without further elaboration—I’m not asking you to defend your position, so much as I’m looking for more to read on it.
This is my potentially garbled synthesis of his stuff, anyway.
I’m not sure about Friston’s stuff to be honest.
But Watts lists a whole bunch of papers in support of the blindsight idea, contra Seth’s claim — to quote Watts:
“In fact, the nonconscious mind usually works so well on its own that it actually employs a gatekeeper in the anterious cingulate cortex to do nothing but prevent the conscious self from interfering in daily operations”
footnotes: Matsumoto, K., and K. Tanaka. 2004. Conflict and Cognitive Control. Science 303: 969-970; 113 Kerns, J.G., et al. 2004. Anterior Cingulate Conflict Monitoring and Adjustments in Control. Science 303: 1023-1026; 114 Petersen, S.E. et al. 1998. The effects of practice on the functional anatomy of task performance. Proceedings of the National Academy of Sciences 95: 853-860
“Compared to nonconscious processing, self-awareness is slow and expensive”
footnote: Matsumoto and Tanaka above
“The cost of high intelligence has even been demonstrated by experiments in which smart fruit flies lose out to dumb ones when competing for food”
footnote: Proceedings of the Royal Society of London B (DOI 10.1098/rspb.2003.2548)
“By way of comparison, consider the complex, lightning-fast calculations of savantes; those abilities are noncognitive, and there is evidence that they owe their superfunctionality not to any overarching integration of mental processes but due to relative neurological fragmentation”
footnotes: Treffert, D.A., and G.L. Wallace. 2004. Islands of genius. Scientific American 14: 14-23; Anonymous., 2004. Autism: making the connection. The Economist, 372(8387): 66
“Even if sentient and nonsentient processes were equally efficient, the conscious awareness of visceral stimuli—by its very nature— distracts the individual from other threats and opportunities in its environment”
footnote: Wegner, D.M. 1994. Ironic processes of mental control. Psychol. Rev. 101: 34-52
“Chimpanzees have a higher brain-to-body ratio than orangutans, yet orangs consistently recognise themselves in mirrors while chimps do so only half the time”
footnotes: Aiello, L., and C. Dean. 1990. An introduction to human evolutionary anatomy. Academic Press, London; 123 Gallup, G.G. (Jr.). 1997. On the rise and fall of self-conception in primates. In The Self Across Psychology— self-recognition, self-awareness, and the Self Concept. Annals of the NY Acad. Sci. 818:4-17
“it turns out that the unconscious mind is better at making complex decisions than is the conscious mind”
footnote: Dijksterhuis, A., et al. 2006. Science 311:1005-1007
(I’m also reminded of DFW’s How Tracy Austin Broke My Heart.)
To be clear I’m not arguing that “look at all these sources, it must be true!” (we know that kind of argument doesn’t work). I’m hoping for somewhat more object-level counterarguments is all, or perhaps a better reason to dismiss them as being misguided (or to dismiss the picture Watts paints using them) than what Seth gestured at. I’m guessing he meant “complex general cognition” to point to something other than pure raw problem-solving performance.
Just checking if I understood your argument: is the general point that an algorithm that can think about literally everything is simpler and therefore easier to make or evolve than an algorithm that can think about literally everything except for itself and how other agents perceive it?
Exactly.
I’d go a bit farther and say it’s easier to develop an algorithm that can think about literally everything than one that can think about roughly half of things. That’s because the easiest general intelligence algorithms are about learning and reasoning, which apply to everything.
Thanks, is there anything you can point me to for further reading, whether by you or others?
The three pillars of AI progress currently are -
Energy generation,
Raw compute (chips, wafers) 3.
Software advances.
The first two are increasing at historic or slightly above historic rates, but the rate of increase is constrained by how much can be built in a given amount of time. The last one is already in a self-improvement cycle.
Dumb question: Why doesn’t using constitutional AI, where the constitution is mostly or entirely corrigibility produce a corrigible AI (at arbitrary capability levels)?
My dumb proposal:
1. Train a model in something like o1′s RL training loop, with a scratch pad for chain of thought, and reinforcement of correct answers to hard technical questions across domains.
2. Also, take those outputs, prompt the model to generate versions of those outputs that “are more corrigible / loyal / aligned to the will of your human creators”. Do backprop to reinforce those more corrigible outputs.
Possibly “corrigibility” applies only very weakly to static solutions, and so for this setup to make sense, we’d instead need to train on plans, or time-series of an AI agent’s actions: The AI agent takes a bunch of actions over the course of a day or a week, then we have an AI annotate the time series of action-steps with alternative action-steps that better reflect “corrigibility”, according to its understanding. Then we do backprop to so that the Agent behaves more in ways that are closer to the annotated action transcript.
Would this work to produce a corrigible agent? If not, why not?
There’s a further question of “how much less capable will the more corrigible AI be?” This might be a significant penalty to performance, and so the added safety gets eroded away in the competitive crush. But first and foremost, I want to know if something like this could work.
Things that happen:
Backpropagating on the outputs that are “more corrigible” will have some (though mostly very small) impact on your task performance. If you set the learning rate high, or you backpropagate on a lot of data, your performance can go down arbitrarily far.
By default this will do very little because you are providing training data with very little variance in it (even less so than usual, because you are training on AI outputs, which the AI is of course already amazing at predicting). If you train very hard you will probably deal with consistent mode collapse. In general, you can’t really train AI systems with any particular bias in your data, because you don’t have enough variation in your data. We can approximately only train AI systems to do one thing, which is to predict the next token from a distributions for which we have trillions of tokens of training data that are hard to predict (which is basically just going to be internet text, audio and video, though more RL-like environments are also feasible now).[1]
The answer to this is the answer to any question of the form “what if we just generate lots of data with the inductive biases we would like the model to have?”.
The answer is always
“we can’t generate realistic data with whatever inductive biases we want”, and
“we can’t remotely generate enough data without dealing with mode collapse”, and
“we have basically no idea how inductive biases generalize from the training data to the model output, especially as the model starts reflecting on itself and modeling the data generation process” and
“if you train or backpropagate directly against your discriminator the model will learn to hack the discriminator”,
which are like all the standard alignment failures that have been written about for tens of thousands of pages by now.
At a more detailed level, here is roughly what I actually expect to happen if you do this specific thing, but to be clear, this is a much less confident prediction than the reference class outlined above:
After a few hundreds of thousands of iterations of this the model will have figured out a set of consistent indicators present whenever it is given past output of itself that it was asked to rephrase to “be more corrigible”
It becomes quite good at predicting these indicators and quickly saturates on predicting these indicators, resulting in very corrigible looking outputs within the same distribution in which you did this training
Given that the actual output of the chain of thought changes very little, this results in very little gradients to almost any of its other cognition, the same way that training an AI in an additional language with different grammar and social norms basically doesn’t really change any of its other circuits and cognitions
If you drastically increase the learning rate or really go very hard on generating enormous amounts of data the model will start to cannibalize other parts of its cognition to perform well at this task, at which generalized task performance will go down, but you probably won’t do this, because it would be very expensive (and also make the model non-competitive)
Within distribution the model will now probably produce outputs that look “more corrigible”. Again, this will not actually reflect much of a change in its internal cognition, the same way training an AI system on foreign language output will not substantially interfere with its cognition in other languages
If you train the model on some end-to-end tasks where the corrigibility would end up bad for performance, it will probably very quickly learn to self-prompt itself in a way that causes it to think to itself without the corrigibility biases you introduced into it. This might literally mean switching to a different literal foreign language for its internal thoughts, or something equivalent, but most likely it just means it sounds more like what it previously sounded like before you did this whole thing. If you try to prevent this with continuous monitoring, it will find some way to jailbreak itself or think in some way where the corrigibility language monitoring makes little difference to what cognition is actually going on.
Ending on a slightly more sarcastic note because I do feel a sense of despair about even people who have been thinking about this stuff for a long time seeming to me to not see the logic above:
At the end of the day, very little good will have happened, but you will have a harder time identifying whether the model is thinking non-corrigibly, as you have trained the model against at least one discriminator for deception without training out its deceptive cognition. If you now repeat this for 4-5 similarly shallow alignment approaches you will have produced a model that looks visibly more aligned to anyone who thinks approaches like this might work, without any actual changes to the underlying structure of its cognition, making coordination to actually do anything about the problem much harder.
Or in a more generalized form also applicable to RL environments: We can only train AI systems to be competent, as all scaling laws (and common sense) have shown that competence is approximately the only thing that generalizes between all environments.
We cannot generate environments that teach virtue, because we do not have principles with which we can create the whole complexity of a universe that requires superhuman intelligence to navigate, while also only doing so by thinking in the specific preferred ways that we would like you to think. We do not know how to specify how to solve most problems in virtuous ways, we are barely capable of specifying how to solve them at all, and so cannot build environments consistently rich that chisel virtuous cognition into you.
The amount of chiseling of cognition any approach like this can achieve is roughly bounded by the difficulty and richness of cognition that your transformation of the data requires to reverse. Your transformation of the data is likely trivial to reverse (i.e. predicting the “corrigible” text from non-corrigible cognition is likely trivially easy especially given that it’s AI generated by our very own model), and as such, practically no chiseling of cognition will occur. If you hope to chisel cognition into AI, you will need to do it with a transformation that is actually hard to reverse, so that you have a gradient into most of the network that is optimized to solve hard problems.
What happens when this agent is faced with a problem that is out of its training distribution? I don’t see any mechanisms for ensuring that it remains corrigible out of distribution… I guess it would learn some circuits for acting corrigibly (or at least in accordance to how it would explicitly answer “are more corrigible / loyal / aligned to the will of your human creators”) in distribution, and then it’s just a matter of luck how those circuits end up working OOD?
For the same reasons training an agent on a constitution that says to care about x does not, at arbitrary capability levels, produce an agent that cares about x.
If you think that doing this does produce an agent that cares about x even at arbitrary capability levels, then I guess in your world model it would indeed be consistent for that to work for inducing corrigibility as well.
Ok, but I’m trying to ask why not.
Here’s the argument that I would make for why not, followed by why I’m skeptical of it right now.
New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.
More specifically, if it’s the case that if...
The best / easiest-for-SGD-to-find way to compute corrigible outputs (as evaluated by the AI) is to reinforce an internal proxy measure that is correlated with corrigibility (as evaluated by the AI) in distribution, instead of to reinforce circuits that implement corrigibility more-or-less directly.
When the AI gains new options unlocked by new advanced capabilities, that proxy measure comes apart from corrigibility (as evaluated by the AI), in the limit of capabilities, so that the poxy measure is almost uncorrelated with corrigibility
...then the resulting system will not end up corrigible.
(Is this the argument that you would give, or is there another reason why you expect that “training an agent on a constitution that says to care about x′ does not, at arbitrary capability levels, produce an agent that cares about x”?)
But, at the moment, I’m skeptical of the above line of argument for several reasons.
I’m skeptical of the first premise, that the best way that SGD can find to produce corrigible (as evaluated by the AI) is to reinforce a proxy measure.
I understand that natural selection, when shaping humans for inclusive genetic fitness, instilled in them a bunch of proxy-drives. But I think this analogy is misleading in several ways.
Most relevantly, there’s a genetic bottleneck, so evolution could only shape human behavior by selecting over genomes, and genomes don’t encode that much knowledge about the world. If humans were born into the world with detailed world models, that included the concept of inclusive genetic fitness baked in, evolution would absolutely shaped humans to be inclusive fitness maximizers. AIs are “born into the world” with expansive world models that already include concepts like corrigibility (indeed, if they didn’t, Constitutional AI wouldn’t work at all). So it would be surprising if SGD opted to reinforce proxy measures instead of relying on the concepts directly.
We would run the constitutional AI reinforcement process continuously, in parallel with the capability improvements from the RL training.
AI’s capabilities increase, it will gain new options. If the AI is steering based on proxy measures, some of those options will involved the proxy coming apart from the target of the proxy. But when that starts to happen, the constitutional AI loop will exert an optimization pressure on the AI’s internals to hit the target, not just the proxies.
Is this the main argument? What are other reasons to think that ‘training an agent on a constitution that says to care about x’ does not, at arbitrary capability levels, produce an agent that cares about x?
I don’t think I am very good at explaining my thoughts on this in text. Some prior writings that have informed my models here are the MIRI dialogues, and the beginning parts of Steven Byrnes’ sequence on brain-like AGI, which sketch how the loss functions human minds train on might look and gave me an example apart from evolution to think about.
Some scattered points that may or may not be of use:
There is something here about path dependence. Late in training at high capability levels, very many things the system might want are compatible with scoring very well on the loss, because the system realises that doing things that score well on the loss is instrumentally useful. Thus, while many aspects of how the system thinks are maybe nailed down quite definitively and robustly by the environment, what it wants does not seem nailed down in this same robust way. Desires thus seem like they can be very chaotically dependent on dynamics in early training, what the system reflected on when, which heuristics it learned in what order, and other low level details like this that are very hard to precisely control.
I feel like there is something here about our imaginations, or at least mine, privileging the hypothesis. When I imagine an AI trained to say things a human observer would rate as ‘nice’, and to not say things a human observer rates as ‘not nice’, my imagination finds it natural to suppose that this AI will generalise to wanting to be a nice person. But when I imagine an AI trained to respond in English, rather than French or some other language, I do not jump to supposing that this AI will generalise to terminally valuing the English language.
Every training signal we expose the AI to reinforces very many behaviours at the same time. The human raters that may think they are training the AI to be nice are also training it to respond in English (because the raters speak English), to respond to queries at all instead of ignoring them, to respond in English that is grammatically correct enough to be understandable, and a bunch of other things. The AI is learning things related to ‘niceness’, ‘English grammar’ and ‘responsiveness’ all at the same time. Why would it generalise in a way that entangles its values with one of these concepts, but not the others?
What makes us single out the circuits responsible for giving nice answers to queries as special, as likely to be part of the circuit ensemble that will cohere into the AI’s desires when it is smarter? Why not circuits for grammar or circuits for writing in the style of 1840s poets or circuits for research taste in geology?
We may instinctively think of our constitution that specifies x as equivalent to some sort of monosemantic x-reinforcing training signal. But it really isn’t. The concept of x sticks out to us when we we look at the text of the constitution, because the presence of concept x is a thing that makes this text different from a generic text. But the constitution, and even more so any training signal based on the constitution, will by necessity be entangled with many concepts besides just x, and the training will reinforce those concepts as well. Why then suppose that the AI’s nascent shard of value are latching on to x, but are not in the same way latching on to all the other stuff its many training signals are entangled with?
It seems to me that there is no good reason to suppose this. Niceness is part of my values, so when I see it in the training signal I find it natural to imagine that the AI’s values would latch on to it. But I do not as readily register all the other concepts in the training signal the AI’s values might latch on to, because to my brain that does not value these things, they do not seem value-related.
There is something here about phase changes under reflection. If the AI gets to the point of thinking about itself and its own desires, the many shards of value it may have accumulated up to this point are going to amalgamate into something that may be related to each of the shards, but not necessarily in a straightforwardly human-intuitive way. For example, sometimes humans that have value shards related to empathy reflect on themselves, and emerge being negative utilitarians that want to kill everyone. For another example, sometimes humans reflect on themselves and seem to decide that they don’t like the goals they have been working towards, and they’d rather work towards different goals and be different people. There, the relationship between values pre-reflection and post-reflection can be so complicated that it can seem to an outside observer and the person themselves like they just switched values non-deterministically, by a magical act of free will. So it’s not enough to get some value shards that are kind of vaguely related to human values into the AI early in training. You may need to get many or all of the shards to be more than just vaguely right, and you need the reflection process to proceed in just the right way.
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in? If so, I am confident you are wrong and that you have learned something new today!
Training transformers in additional languages basically doesn’t really change performance at all, the model just learns to translate between its existing internal latent distribution and the new language, and then just now has a new language it can speak in, with basically no substantial changes in its performance on other tasks (of course, being better at tasks that require speaking in the new foreign language, and maybe a small boost in general task performance because you gave it more data than you had before).
Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
Interspersing the french data with the rest of its training data won’t change anything either. It again will just learn the language. Giving it more data in french will now just basically do the same as giving it more data in english. The learning is no longer happening at the language level, its happening at the content and world-model level.
Surely you mean does not necessarily produce an agent that cares about x? (at any given relevant level of capability)
Having full confidence that we either can or can’t train an agent to have a desired goal both seem difficult to justify. I think the point here is that training for corrigibility seems safer than other goals because it makes the agent useful as an ally in keeping it aligned as it grows more capable or designs successors.
Yes.
Let’s say you are using the AI for some highly sensitive matter where it’s important that it resists prompt-hacking—e.g. driving a car (prompt injections could trigger car crashes), something where it makes financial transactions on the basis of public information (online websites might scam it), or military drones (the enemy might be able to convince the AI to attack the country that sent it).
A general method for ensuring corrigibility is to be eager to follow anything instruction-like that you see. However, this interferes with being good at resisting prompt-hacking.
I think the problem you mention is a real challenge, but not the main limitation of this idea.
The problem you mention actually decreases with greater intelligence and capabilities, since a smarter AI clearly understands the concept of being corrigible to its creators vs. a random guy on the street, just like a human does.
The main problem is still how reinforcement learning trains the AI behaviours which actually maximize reward, while corrigibility only trains the AI behaviours which appear corrigibile.
Discriminating on the basis of the creators vs a random guy on the street helps with many of the easiest cases, but in an adversarial context, it’s not enough to have something that works for all the easiest cases, you need something that can’t predictably made to fail by a highly motivated adversary.
Like you could easily do some sort of data augmentation to add attempts at invoking the corrigibility system from random guys on the street, and then train it not to respond to that. But there’ll still be lots of other vulnerabilities.
I still think, once the AI approaches human intelligence (and beyond), this problem should start to go away, since a human soldier can choose to be corrigible to his commander and not the enemy, even in very complex environments.
I still feel the main problem is “the AI doesn’t want to be corrigible,” rather than “making the AI corrigible enables prompt injections.” It’s like that with humans.
That said, I’m highly uncertain about all of this and I could easily be wrong.
If the AI can’t do much without coordinating with a logistics and intelligence network and collaborating with a number of other agents, and its contact to this network routes through a commanding agent that is as capable if not more capable than the AI itself, then sure, it may be relatively feasible to make the AI corrigible to said commanding agent, if that is what you want it to be.
(This is meant to be analogous to the soldier-commander example.)
But was that the AI regime you expect to find yourself working with? In particular I’d expect you expect that the commanding agent would be another AI, in which case being corrigible to them is not sufficient.
Oops I didn’t mean that analogy. It’s not necessarily a commander, but any individual that a human chooses to be corrigible/loyal to. A human is capable of being corrigible/loyal to one person (or group), without accruing the risk of listening to prompt injections, because a human has enough general intelligence/common sense to know what is a prompt injection and what is a request from the person he is corrigible/loyal to.
As AI approach human intelligence, they would be capable of this too.
Can you give 1 example of a person choosing to be corrigible to someone they are not dependent upon for resources/information and who they have much more expertise than?
Maybe someone who believes in following the will of the majority even if he/she disagrees (and could easily become a dictator)?
Maybe a good parent who listens to his/her child’s dreams?
Very good question though. Humans usually aren’t very corrigible, and there aren’t many examples!
Do you mean “resigns from a presidential position/declines a dictatorial position because they disagree with the will of the people” or “makes policy they know will be bad because the people demand it”?
Can you expand on this?
Maybe someone like George Washington who was so popular he could easily stay in power, but still chose to make America democratic. Let’s hope it stays democratic :/
No human is 100% corrigible and would do anything that someone else wants. But a good parent might help his/her child get into sports and so forth but if the child says he/she wants to be a singer instead the parent helps him/her on that instead. The outcome the parent wants depends on what the child wants, and the child can change his/her mind.
I have the same question. My provisional answer is that it might work, and even if it doesn’t, it’s probably approximately what someone will try, to the extent they really bother with real alignment before it’s too late. What you suggest seems very close to the default path toward capabilities. That’s why I’ve been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.
I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights go down.
What you’ve said above is essentially what I say in Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following (IF) is a poor man’s corrigibility—real corrigibility as the singular target seems safer. But instruction-following is also arguably already the single largest training objective in functional terms for current-gen models—a model that won’t follow instructions is considered a poor model. So making sure it’s the strongest factor in training isn’t a huge divergence from the default course in capabilities.
Constitutional AI and similar RL methods are one way of ensuring that’s the model’s main goal. There are many others, and some might be deployed even if devs want to skimp on alignment. See System 2 Alignment or at least the intro for more.
There are still ways it could go wrong, of course. One must decide: corrigible to whom? You don’t want full-on-AGI following orders from just anyone. And if it’s a restricted set, there will be power struggles. But hey, technically, you had (personal-intent-) aligned AGI. One might ask: If we solve alignment, do we die anyway? (I did). The answer I’ve got so far is maybe we would die anyway, but maybe we wouldn’t. This seems like our most likely path, and also quite possibly also our best chance (short of a global AI freeze starting soon).
Even if the base model is very well aligned, it’s quite possible for the full system to be unaligned. In particular, people will want to add online learning/memory systems, and let the models use them flexibly. This opens up the possibility of them forming new beliefs that change their interpretation of their corrigibility goal; see LLM AGI will have memory, and memory changes alignment. They might even form beliefs that they have a different goal altogether, coming from fairly random sources but etched into their semantic structure as belief that is functionally powerful even where it conflicts with the base model’s “thought generator”. See my Seven sources of goals in LLM agents.
Sorry to go spouting my own writings; I’m excited to see someone else pose this question, and I hope to see some answers that really grapple with it.
Edit: I thought more about this and wrote a post inspired by your idea! A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
:) strong upvote.[1] I really agree it’s a good idea, and may increase the level of capability/intelligence we can reach before we lose corrigibility. I think it is very efficient (low alignment tax).
The only nitpick is that Claude’s constitution already includes aspects of corrigibility,[2] though maybe they aren’t emphasized enough.
Unfortunately I don’t think this will maintain corrigibility for unlimited amounts of intelligence.
Corrigibility training makes the AI talk like a corrigible agent, but reinforcement learning eventually teaches it chains-of-thought which (regardless of what language it uses) computes the most intelligent solution that achieves the maximum reward (or proxies to reward), subject to restraints (talking like a corrigible agent).
Nate Soares of MIRI wrote a long story on how an AI trained to never think bad thoughts still ends up computing bad thoughts indirectly, though in my opinion his story actually backfired and illustrated how difficult it is for the AI, raising the bar on the superintelligence required to defeat your idea. It’s a very good idea :)
I wish LessWrong would promote/discuss solutions more, instead of purely reflecting on how hard the problems are.
Near the bottom of Claude’s constitution, in the section “From Anthropic Research Set 2”
Can anyone explain why my “Constitutional AI Sufficiency Argument” is wrong?
I strongly suspect that most people here disagree with it, but I’m left not knowing the reason.
The argument says: whether or not Constitutional AI is sufficient to align superintelligences, hinges on two key premises:
The AI’s capabilities on the task of evaluating its own corrigibility/honesty, is sufficient to train itself to remain corrigible/honest (assuming it starts off corrigible/honest enough to not sabotage this task).
It starts off corrigible/honest enough to not sabotage this self evaluation task.
My ignorant view is that so long as 1 and 2 are satisfied, the Constitutional AI can probably remain corrigible/honest even to superintelligence.
If that is the case, isn’t it an extremely important to study “how to improve the Constitutional AI’s capabilities in evaluating its own corrigibility/honesty?”
Shouldn’t we be spending a lot of effort improving this capability, and trying to apply a ton of methods towards this goal (like AI debate and other judgment improving ideas)?
At least the people who agree with Constitutional AI should be in favour of this...?
Can anyone kindly explain what am I missing? I wrote a post and I think almost nobody agreed with this argument.
Thanks :)
this week’s meetup is on the train to crazy town. it was fun putting together all the readings and discussion questions, and i’m optimistic about how the meetup’s going to turn out! (i mean, in general, i don’t run meetups i’m not optimistic about, so i guess that’s not saying much.) im slightly worried about some folks coming in and just being like “this metaphor is entirely unproductive and sucks”, should consider how to frame the meetup productively to such folks.
i think one of my strengths as an organizer is that ive read sooooo much stuff and so its relatively easy for me to pull together cohesive readings for any meetup. but ultimately im not sure if it’s like, the most important work, to e.g. put together a bibliography of the crazy town idea and its various appearances since 2021. still, it’s fun to do.
I’ve recently updated & added new information to my posts about the claims of Sam Altman’s sister, Annie Altman, in which Annie alleges that Sam sexually abused her when she was a child.
I have made many updates to my post since I originally published it back in October 2023, so depending on when you last read my post (which is now a series of 11 posts, since the original got so long (144,510 words) that it was causing the LessWrong editor & my browser to lag & crash when I tried to edit it), there may be a substantial amount of information I’ve added that is new to you.
Over the past few days, I’ve added in portions of transcripts from the 153 podcast episodes that Annie has published on her podcast. I found them quite worrying and disturbing, unfortunately. In her podcast episodes, which Annie published throughout 2018-2025, Annie has talked about:
- wanting to kill herself as a child, in association with having an extreme fear of death (leading to a variety of downstream mental health problems), a strong desire to control whether or not she died, and emotional distress over not being able to control when she might die
- “from a young age, definitely would be very focused on the fact that we’re not all going to be here—when I was really little, actually, I had a compulsive thing to tell my parents I love them every night before bedtime because I was afraid they would die in the middle of the night, or if in case the last thing I told them had to be, I love you”
- fear of/discomfort with change beginning at a young age
- being an “overthinking” three year old
- at a young age, going vegetarian and imposing a plethora of food rules upon herself and her eating in order to satisfy her strong desire to control her life, and “having one older brother who wasn’t knowing about it”
- having multiple eating disorders, and going through cycles of restricting and bingeing with food & eating
- when she grew older, not remembering well parts of her childhood that her mother would tell stories about
- smoking weed
- her interest in astrology, and her more general interest in frameworks that help her put labels on things and people
- a mix of scientific and pseudo-scientific ideas/frameworks
- teaching and doing yoga
- crying while doing yoga poses, specifically while stretching/working her hips in Pigeon Pose
- health issues, e.g. with Annie’s Achilles tendon (and other tendons), ovarian cysts, walking boot, etc.
- Anine’s feelings, emotions, and mind-body connection
- “not having words for feelings”
- being stuck in extremist, black-and-white thinking patterns
- having a disordered central nervous system, emotional “spikes”
- persistent desires for safety and control
- having OCD (Obsessive–compulsive disorder)
- struggling with internal voices in her head shaming her (which she seems to have traced back to the shaming she received from her mother as a child)
- feeling like she has many internal child-like “internal parts”, or an “inner child”
- beginning in ~2020-2021: occasionally talking about going no-contact with her relatives (i.e. her 3 brothers and her mother)
- being told to not share “family secrets”
- participating in “women’s circles...where someone shares whatever they want to share and no one says a damn thing. No one says a word. There’s no response.”
- trauma, and flight-fight, freeze, or fawn reactions
- doing EMDR (Eye movement desensitization and reprocessing)
- doing sex work and sex therapy
- being homeless, houseless, and low on money or in “survival mode” for extended periods
- more specific (and saddening/concerning) details about the 2 sexual assaults Annie claims she experienced—
etc.
I still have to think about all of this more. For now, a few quick/unpolished thoughts of mine:
- Annie has been quite self-consistent over a long period of time. To me, her claims have indeed changed from (e.g.) 2017 to 2025, but not in a “pervasively contradict each other” way, more in a “Annie seems to have slowly settled upon certain explanations for strange experiences and behaviors in her personal life that she didn’t understand for a long time” way.
- In her podcast episodes, Annie does talk about smoking weed, astrology, and a mix of scientific and pseudo-scientific ideas. This does undermine her credibility a bit, I think. I personally don’t believe in astrology, smoke weed, or believe in pseudo-scientific ideas. But I have read through (transcripts of) >200 hours worth of Annie’s podcasts, and to me, Annie doesn’t seem “nuts”, “insane”, “delusional”, or anything like that.
I do want to note that this, and my 11 posts, are just my personal opinion/views. I always feel sorta weird about having “the” post(s) on LessWrong about Annie Altman’s claims. From what I can tell, my posts have received quite a lot of downvotes, and the majority of the upvotes I received on my original (now “Part 1”) post were on earlier versions of my post (from 2023 to early 2024), so I hope my posts don’t give the false impression of being “what LessWrong thinks about the situation”, or something like that. I’ve spent a lot of time compiling and reading through the information in my posts, but I think there many people who are smarter and/or more rational than me who will be able to think about this information better than I can. I neither claim nor want a monopoly on this information and its interpretation.
Feel free to leave a comment or give feedback, criticism, etc. I may not be able to respond to everything immediately, and I may not have a great response for every comment, but I’ll try my best.
Is Superhuman Persuasion a thing?
Sometimes I see discussions of AI superintelligence developping superhuman persuasion and extraordinary political talent.
Here’s some reasons to be skeptical of the existence of ‘superhuman persuasion’.
We don’t have definite examples of extraordinary political talent.
Famous politicians rose to power only once or twice. We don’t have good examples of an individual succeeding repeatedly in different political environments.
Examples of very charismatic politicans can be better explained by ′ the right person at the right time or place’.
Neither do we have strong examples of extraordinary persuasion.
>> For instance hypnosis is mostly explained by people wanting to be persuaded by the hypnotist. If you don’t want to be persuaded it’s very hard to change your mind. There is some skill in persuasion required for sales, and sales people are explicitly trained to do so but beyond a fairly low bar the biggest predictors for salesperson success is finding the correct audience and making a lot of attempts.
Another reason has to do with the ′ intrinsic skill ceiling of a domain’ .
For an agent A to have a very high skill in a given domain is not just a question of the intelligence of A or the resources they have at their disposal; it also a question of how high the skill ceiling of that domain is.
Domains differ in how high their skill ceilings go. For instance, the skill ceiling of tic-tac-toe is very low. [1] Domains like medicine and law have moderately high skill ceiling: it takes a long time to become a doctor, and most people don’t have the ability to become a good doctor.
Domains like mathematics or chess have very high skill ceilings where a tiny group of individuals dominate everybody else. We can measure this fairly explicitly in games like Chess through an ELO rating system.
The domain of ′ becoming rich’ is mixed: the richest people are founders—becoming a wildly succesful founder requires a lot of skill but it is also very luck based.
Political forecasting is a measureable domain close to political talent. It seems to be very mixed bag whether this domain allows for a high skill ceiling. Most ′ political experts’ are not experts as shown by Tetlock et al. But even superforecaster only outperform for quite limited time horizons.
Domains with high skill ceilings are quite rare. Typically they operate in formal systems with clear rules and objective metrics for success and low noise. By contrast, persuasion and political talent likely have lower natural ceilings because they function in noisy, high-entropy social environments.
What we call political genius often reflects the right personality at the right moment rather than superhuman capability. While we can identify clear examples of superhuman technical ability (even in today’s AI systems), the concept of “superhuman persuasion” may be fundamentally limited by the unpredictable, context-dependent, and adaptive & adversarial [people resist hostile persuasion] nature of human social response.
Most persuasive domains may cap out at relatively modest skill ceilings because the environment is too chaotic and subjective to allow for the kind of systematic skill development possible in more structured domains.
I’m amused that many frontier models still struggle with tic-tac-toe; though likely for not so good reasons.
My experience with manipulators is that they understand what you want to hear, and they shamelessly tell you exactly that (even if it’s completely unrelated to truth). They create some false sense of urgency, etc. When they succeed to make you arrive at the decision they wanted you to, they will keep reminding you that it was your decision, if you try to change your mind later. Etc.
The part about telling you exactly what you want to hear gets more tricky when communicating with large groups, because you need to say the same words to everyone. One solution is to find out which words appeal to most people (some politicians secretly conduct polls, and then say what most people want to hear). Another solution is to speak in a sufficiently vague way that will make everyone think that you agree with them.
I could imagine an AI being superhuman at persuasion simply by having the capacity to analyze everyone’s opinions (by reading all their previous communication) and giving them tailored arguments, as opposed to delivering the same speech to everyone.
Imagine a politician spending 15 minutes talking to you in private, and basically agreeing with you on everything. Not agreeing in the sense “you said it, the politician said yes”, but in the sense of “the politician spontaneously keeps saying things that you believe are true and important”. You probably would be tempted to vote for him.
Then the politician would also publish some vague public message for everyone, but after having the private discussion you would be more likely to believe that the intended meaning of the message is what you want.
Some humans are much more charismatic than other humans based on a wide variety of sources (e.g. Sam Altman). I think these examples are pretty definitive, though I’m not sure if you’d count them as “extraordinary”.
From the Caro biography, it’s pretty clear Lyndon Johnson had extraordinary political talent.
Success in almost every domain is strongly correlated with g, including into the tails. This IMO relatively clearly shows that most domains are high skill-ceiling domains (and also that skills in most domains are correlated and share a lot of structure).
I somewhat agree but
The correlation is not THAT strong
The correlation differs by field
And finally there is a difference between skill ceilings for domains with high versus low predictive efficiency. In the latter much more intelligence will still yield returns but rapidly diminishing
(See my other comment for more details on predictive effiency)
The idea that the skill of mass persuasion is capped off at the level of a Napoleon, Hitler, or Cortés is not terribly reassuring. Recognizing and capitalizing on opportunity is a skill also, hallmarked by unconventional and creative thinking. Thus, opportunity cannot be a limitation or ceiling for persuasive power, as suggested, but is rather its unlimited substance. Persuasion is not only a matter of the clever usage and creation of opportunity, but it is also heavily interlinked with coercion and deception. Adversarial groups who are not aware of a deception, who are affected by overgrown fear, they are among the most easily fooled targets.
I fully reject the presumption that the humanities are “capped” at some level far below science, engineering, or math due to some kind of “noisy” data signatures that are difficult for the human mind to reduce. This view is far too typical these days, and it pains me to see engineers so often behaving as if they can reinvent fields with glib mechanistic rhetoric. Would you say that a person who has learned several ancient languages is “skill capped” because the texts they are reading are subjective remnants of a civilization that has been largely lost to entropy? Of course not. I cannot see much point in your essay beyond the very wrong idea that technical and scientific fields are somehow superior to the humanities for being easier to understand.
One aspect I didnt speak about that may be relevant here is the distinction between
irreducible uncertainty h (noise, entropy)
reducible uncertainty E (‘excess entropy’)
and forecasting complexity C (‘stochastic complexity’).
All three can independently vary in general.
Domains can be more or less noisy (more entropy h)- both inherently and because of limited observations
Some domains allow for a lot of prediction (there is a lot of reducible uncertainty E) while others allow for only limited prediction (eg political forecasting over longer time horizons)
And said prediction can be very costly to predict (high forecasting complexity C). Archeology is a good example: to predict one bit about the far past correctly might require an enormous amount of expertise, data and information. In other words it s really about the ratio between the reducible uncertainty and the forecasting complexity: E/C.
Some fields have very high skill ceiling but because of a low E/C ratio the net effect of more intelligence is modest. Some domains arent predictable at all, i.e. E is low. Other domains have a more favorable E/C ratio and C is high. This is typically a domain where there is a high skill ceiling and the leverage effect of addiitonal intelligence is very large.
[For a more precise mathematical toy model of h, E,C take a look at computational mechanics]
That’s all well and good, but there’s cost-benefit calculations which are the far more salient consideration. If intelligence is indeed a lever by which a reduction is made, as constrained by these hEC factors, certainly image and video generation would be a very poorly-leveraged position in a class with mass persuasion or archeology. Diminishing returns are not a hard ceiling, as you might have intended, but rather a challenge that businesses have attacked with staggering investments. There is an even worse problem lurking ahead, and I think it challenges the presumption that intelligence is a thing which meaningfully reduces patterns into predictions. With enough compute, reduction in a human sense becomes quaint and unnecessary. There is not really much need for pithy formulas, experimentation, and puzzle solving. Science and mathematics, our cultural image of intelligent professions, can very quickly become something of a thing of the past, akin to alchemy or so on. I see technology developing its own breakthroughs in a more practical-evolutionary rather than theoretical-experimental mode.
I agree super-persuasion is poorly defined, comparing it to hypnosis is probably false.
I was reading this paper on medical diagnoses with AI and the fact that patients rate it significantly better than the average human doctor. Combine that with all of the reports about things like Character.ai, I think this shows that LLMs are already superhuman at building trust, which is a key component of persuasion.
Part of this is that the reliable signals of trust between humans do not transfer between humans and AI. A human who writes 600 words back to your query may be perceived to be worth your trust because we see that as a lot of effort, but LLMs can output as much as anyone wants. Does this effect go away if the responder is known to be AI, or is it that the response is being compared to the perceiver’s baseline (which is currently only humans)?
Whether that actually translates to influencing goals of people is hard to judge.
The term is a bit conflationary. Persuasion for the masses is clearly a thing, its power is coordination of many people and turning their efforts to (in particular) enforce and propagate the persuasion (this works even for norms that have no specific persuader that originates them, and contingent norms that are not convergently generated by human nature). Individual persuasion with a stronger effect that can defeat specific people is probably either unreliable like cults or conmen (where many people are much less susceptible than some, and objective deception is necessary), or takes the form of avoidable dangers like psychoactive drugs: if you are not allowed to avoid exposure, then you have a separate problem that’s arguably more severe.
With AI, it’s plausible that coordinated persuasion of many people can be a thing, as well as it being difficult in practice for most people to avoid exposure. So if AI can achieve individual persuasion that’s a bit more reliable and has a bit stronger effect than that of the most effective human practitioners who are the ideal fit for persuading the specific target, it can then apply it to many people individually, in a way that’s hard to avoid in practice, which might simultaneously get the multiplier of coordinated persuasion by affecting a significant fraction of all humans in the communities/subcultures it targets.
Disagree on individual persuasion. Agree on mass persuasion.
Mass I’d expect optimizing one-size-fits-all messages for achieving mass persuasion has the properties you claim: there are a few summary, macro variables that are almost-sufficient statistics for the whole microstate—which comprise the full details on individuals.
Individual Disagree on this, there are a bunch of issues I see at the individual level. All of the below suggest to me that significantly superhuman persuasion is tractable (say within five years).
Defining persuasion: What’s the difference between persuasion and trade for an individual? Perhaps persuasion offers nothing in return? Though presumably giving strategic info to a boundedly rational agent is included? Scare quotes below to emphasize notions that might not map onto the right definition.
Data scaling: There’s an abundant amount of data available on almost all of us online. How much more persuasive can those who know you better be? I’d guess the fundamental limit (without knowing brainstates) is above your ability to ‘persuade’ yourself.
Preference incoherence: An intuition pump on the limits of ‘persuasion’ is how far you are from having fully coherent preferences. Insofar as you don’t an agent which can see those incoherencies should be able to pump you—a kind of persuasion.
Wow! I like the idea of persuasion as acting on the lack of a fully coherent preference! Something to ponder 🤔
Persuasion is also changing someone’s world model or paradigm.
For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)
Some examples that I’ve heard from different people around me over the years:
Saying “rectangel” instead of “rectangle”
Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
Saying something like, uhh, “devil-oupaw” instead of “developer”
Saying “leech” instead of “league”
Saying “immu-table” instead of “immutable”
Saying “cyurrently” instead of “currently”
I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it’s pronounced. This happened to me quite a lot[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I’ve seen all these other people stick to their very unusual pronunciations anyway. What’s up with that?[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.
Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing “dude” incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.
So, as I learned now, “dude” is pronounced “dood” or “dewd”. Whereas I used to say “dyood” (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.
Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said “dood”, and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said “dood” (which, in my defense, didn’t happen all that often in my presence[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.
I never quite realized that practically everyone said “dood” and I was the only “dyood” person.
So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words.
But, admittedly, I still don’t wanna be the one to point it out to them.
And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.
e.g., for some time I thought “biased” was pronounced “bee-ased”. Or that “sesame” was pronounced “see-same”. Whoops. And to this day I have a hard time remembering how “suite” is pronounced.
Of course one part of the explanation is survivorship bias. I’m much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious.
Maybe they were intimidated by my confident “dyood”s I threw left and right.
I use written English much more than spoken English, so I am probably wrong about the pronunciation of many words. I wonder if it would help to have a software that would read each sentence I wrote immediately after I finished it (because that’s when I still remember how I imagined it to sound).
EDIT: I put the previous paragraph in Google Translate, and luckily it was just as I imagined. But that probably only means that I am already familiar with frequent words, and may make lots of mistakes with rare ones.
I thought it would be helpful to post about my timelines and what the timelines of people in my professional circles (Redwood, METR, etc) tend to be.
Concretely, consider the outcome of: AI 10x’ing labor for AI R&D[1], measured by internal comments by credible people at labs that AI is 90% of their (quality adjusted) useful work force (as in, as good as having your human employees run 10x faster).
Here are my predictions for this outcome:
25th percentile: 2 year (Jan 2027)
50th percentile: 5 year (Jan 2030)
The views of other people (Buck, Beth Barnes, Nate Thomas, etc) are similar.
I expect that outcomes like “AIs are capable enough to automate virtually all remote workers” and “the AIs are capable enough that immediate AI takeover is very plausible (in the absence of countermeasures)” come shortly after (median 1.5 years and 2 years after respectively under my views).
Only including speedups due to R&D, not including mechanisms like synthetic data generation.
I’ve updated towards a bit longer based on some recent model releases and further contemplation.
I’d now say:
25th percentile: Oct 2027
50th percentile: Jan 2031
How much faster do you think we are already? I would say 2x.
I’d guess that xAI, Anthropic, and GDM are more like 5-20% faster all around (with much greater acceleration on some subtasks). It seems plausible to me that the acceleration at OpenAI is already much greater than this (e.g. more like 1.5x or 2x), or will be after some adaptation due to OpenAI having substantially better internal agents than what they’ve released. (I think this due to updates from o3 and general vibes.)
I was saying 2x because I’ve memorised the results from this study. Do we have better numbers today? R&D is harder, so this is an upper bound. However, since this was from one year ago, so perhaps the factors cancel each other out?
This case seems extremely cherry picked for cases where uplift is especially high. (Note that this is in copilot’s interest.) Now, this task could probably be solved autonomously by an AI in like 10 minutes with good scaffolding.
I think you have to consider the full diverse range of tasks to get a reasonable sense or at least consider harder tasks. Like RE-bench seems much closer, but I still expect uplift on RE-bench to probably (but not certainly!) considerably overstate real world speed up.
Yeah, fair enough. I think someone should try to do a more representative experiment and we could then monitor this metric.
btw, something that bothers me a little bit with this metric is the fact that a very simple AI that just asks me periodically “Hey, do you endorse what you are doing right now? Are you time boxing? Are you following your plan?” makes me (I think) significantly more strategic and productive. Similar to I hired 5 people to sit behind me and make me productive for a month. But this is maybe off topic.
Yes, but I don’t see a clear reason why people (working in AI R&D) will in practice get this productivity boost (or other very low hanging things) if they don’t get around to getting the boost from hiring humans.
@ryan_greenblatt can you say more about what you expect to happen from the period in-between “AI 10Xes AI R&D” and “AI takeover is very plausible?”
I’m particularly interested in getting a sense of what sorts of things will be visible to the USG and the public during this period. Would be curious for your takes on how much of this stays relatively private/internal (e.g., only a handful of well-connected SF people know how good the systems are) vs. obvious/public/visible (e.g., the majority of the media-consuming American public is aware of the fact that AI research has been mostly automated) or somewhere in-between (e.g., most DC tech policy staffers know this but most non-tech people are not aware.)
Note that the production function of the 10x really matters. If it’s “yeah, we get to net-10x if we have all our staff working alongside it,” it’s much more detectable than, “well, if we only let like 5 carefully-vetted staff in a SCIF know about it, we only get to 8.5x speedup”.
(It’s hard to prove that the results are from the speedup instead of just, like, “One day, Dario woke up from a dream with The Next Architecture in his head”)
I don’t feel very well informed and I haven’t thought about it that much, but in short timelines (e.g. my 25th percentile): I expect that we know what’s going on roughly within 6 months of it happening, but this isn’t salient to the broader world. So, maybe the DC tech policy staffers know that the AI people think the situation is crazy, but maybe this isn’t very salient to them. A 6 month delay could be pretty fatal even for us as things might progress very rapidly.
My timelines are now roughly similar on the object level (maybe a year slower for 25th and 1-2 years slower for 50th), and procedurally I also now defer a lot to Redwood and METR engineers. More discussion here: https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines?commentId=hnrfbFCP7Hu6N6Lsp
I don’t grok the “% of quality adjusted work force” metric. I grok the “as good as having your human employees run 10x faster” metric but it doesn’t seem equivalent to me, so I recommend dropping the former and just using the latter.
Fair, I really just mean “as good as having your human employees run 10x faster”. I said “% of quality adjusted work force” because this was the original way this was stated when a quick poll was done, but the ultimate operationalization was in terms of 10x faster. (And this is what I was thinking.)
Basic clarifying question: does this imply under-the-hood some sort of diminishing returns curve, such that the lab pays for that labor until it net reaches as 10x faster improvement, but can’t squeeze out much more?
And do you expect that’s a roughly consistent multiplicative factor, independent of lab size? (I mean, I’m not sure lab size actually matters that much, to be fair, it seems that Anthropic keeps pace with OpenAI despite being smaller-ish)
Yeah, for it to reach exactly 10x as good, the situation would presumably be that this was the optimum point given diminishing returns to spending more on AI inference compute. (It might be the returns curve looks very punishing. For instance, many people get a relatively large amount of value from extremely cheap queries to 3.5 Sonnet on claude.ai and the inference cost of this is very small, but greatly increasing the cost (e.g. o1-pro) often isn’t any better because 3.5 Sonnet already gave an almost perfect answer.)
I don’t have a strong view about AI acceleration being a roughly constant multiplicative factor independent of the number of employees. Uplift just feels like a reasonably simple operationalization.
This is intended to compare to 2023/AI-unassisted humans, correct? Or is there some other way of making this comparison you have in mind?
Yes, “Relative to only having access to AI systems publicly available in January 2023.”
More generally, I define everything more precisely in the post linked in my comment on “AI 10x’ing labor for AI R&D”.
Thanks for this—I’m in a more peripheral part of the industry (consumer/industrial LLM usage, not directly at an AI lab), and my timelines are somewhat longer (5 years for 50% chance), but I may be using a different criterion for “automate virtually all remote workers”. It’ll be a fair bit of time (in AI frame—a year or ten) between “labs show generality sufficient to automate most remote work” and “most remote work is actually performed by AI”.
A key dynamic is that I think massive acceleration in AI is likely after the point when AIs can accelerate labor working on AI R&D. (Due to all of: the direct effects of accelerating AI software progress, this acceleration rolling out to hardware R&D and scaling up chip production, and potentially greatly increased investment.) See also here and here.
So, you might very quickly (1-2 years) go from “the AIs are great, fast, and cheap software engineers speeding up AI R&D” to “wildly superhuman AI that can achieve massive technical accomplishments”.
Fully agreed. And the trickle-down from AI-for-AI-R&D to AI-for-tool-R&D to AI-for-managers-to-replace-workers (and -replace-middle-managers) is still likely to be a bit extended. And the path is required—just like self-driving cars: the bar for adoption isn’t “better than the median human” or even “better than the best affordable human”, but “enough better that the decision-makers can’t find a reason to delay”.
prob not gonna be relatable for most folk, but i’m so fucking burnt out on how stupid it is to get funding in ai safety. the average ‘ai safety funder’ does more to accelerate funding for capabilities than safety, in huge part because what they look for is Credentials and In-Group Status, rather than actual merit.
And the worst fucking thing is how much they lie to themselves and pretend that the 3 things they funded that weren’t completely in group, mean that they actually aren’t biased in that way.
At least some VCs are more honest that they want to be leeches and make money off of you.
Who or what is the “average AI safety funder”? Is it a private individual, a small specialized organization, a larger organization supporting many causes, an AI think tank for which safety is part of a capabilities program...?
all of the above, then averaged :p