Only partially relevant, but it’s exciting to hear a new John/David paper is forthcoming!
J Bostock
Everything I Know About Semantics I Learned From Music Notation
Furthermore: normalizing your data to variance=1 will change your PCA line (if the X and Y variances are different) because the relative importance of X and Y distances will change!
Thanks for writing this up. As someone who was not aware of the eye thing I think it’s a good illustration of the level that the Zizians are on, i.e. misunderstanding key important facts about the neurology that is central to their worldview.
My model of double-hemisphere stuff, DID, tulpas, and the like is somewhat null-hypothesis-ish. The strongest version is something like this:
At the upper levels of predictive coding, the brain keeps track of really abstract things about yourself. Think “ego” “self-conception” or “narrative about yourself”. This is normally a model of your own personality traits, which may be more or less accurate. But there’s no particular reason why you couldn’t build a strong self-narrative of having two personalities, a sub-personality, or more. If you model yourself as having two personalities who can’t access each other’s memories, then maybe you actually just won’t perform the query-key lookups to access the memories.
Like I said, this doesn’t rule out a large amount of probability mass, but it does explain some things, fit in with my other views, and hopefully if someone has had/been close to experiences kinda like DID or zizianism or tulpas, it provides a less horrifying way of thinking about them. Some of the reports in this area are a bit infohazardous, and I think this null model at least partially defuses those infohazard.
This is a very interesting point. I have upvoted this post even though I disagree with it because I think the question of “Who will pay, and how much will they pay, to restrict others’ access AI?” is important.
My instinct is that this won’t happen, because there are too many AI companies for this deal to work on all of them, and some of these AI companies will have strong kinda-ideological commitments to not doing this. Also, my model of (e.g. OpenAI) is that they want to eat as much of the world’s economy as possible, and this is better done by selling (even at a lower revenue) to anyone who wants an AI SWE than selling just to Oracle.
o4 (God I can’t believe I’m already thinking about o4) as a b2b saas project seems unlikely to me. Specifically I’d put <30% odds that the o4-series have their prices jacked up or its API access restricted in order to allow some companies to monopolize its usage for more than 3 months without an open release. This won’t apply if the only models in the o4 series cost $1000s per answer to serve, since that’s just a “normal” kind of expensive.
Then, we have to consider that other labs are 1-1.5 years behind, and it’s hard to imagine Meta (for example) doing this in anything like the current climate.
That’s part of what I was trying to get at with “dramatic” but I agree now that it might be 80% photogenicity. I do expect that 3000 Americans killed by (a) humanoid robot(s) on camera would cause more outrage than 1 million Americans killed by a virus which we discovered six months later was AI-created in some way.
Previous ballpark numbers I’ve heard floated around are “100,000 deaths to shut it all down” but I expect the threshold will grow as more money is involved. Depends on how dramatic the deaths are though, 3000 deaths was enough to cause the US to invade two countries back in the 2000s. 100,000 deaths is thirty-three 9/11s.
Is there a particular reason to not include sex hormones? Some theories suggest that testosterone tracks relative social status. We might expect that high social status → less stress (of the cortisol type) + more metabolic activity. Since it’s used by trans people we have a pretty good idea of what it does to you at high doses (makes you hungry, horny, and angry) but its unclear whether it actually promotes low cortisol-stress and metabolic activity.
I’m mildly against this being immortalized as part of the 2023 review, though I think it serves excellently as a community announcement for Bay Area rats, which seems to be its original purpose.
I think it has the most long-term relevant information (about AI and community building) back loaded and the least relevant information (statistics and details about a no-longer-existent office space in the Bay Area) front loaded. This is a very Bay Area centric post, which I don’t think is ideal.
A better version of this post would be structured as a round up of the main future-relevant takeaways, with specifics from the office space as examples.
I’m only referring to the reward constraint being satisfied for scenarios that are in the training distribution, since this maths is entirely applied to a decision taking place in training. Therefore I don’t think distributional shift applies.
I haven’t actually thought much about particular training algorithms yet. I think I’m working on a higher level of abstraction than that at the moment, since my maths doesn’t depend on any specifics about V’s behaviour. I do expect that in practice an already-scheming V would be able to escape some finite-time reasonable-beta-difference situations like this, with partial success.
I’m also imagining that during training, V is made up of different circuits which might be reinforced or weakened.
My view is that, if V is shaped by a training process like this, then scheming Vs are no longer a natural solution in the same way that they are in the standard view of deceptive alignment. We might be able to use this maths to construct training procedures where the expected importance of a scheming circuit in V is to become (weakly) weaker over time, rather than being reinforced.
If we do that for the entire training process, we would not expect to end up with a scheming V.
The question is which RL and inference paradigms approximate this. I suspect it might be a relatively large portion of them. I think that if this work is relevant to alignment then there’s a >50% chance it’s already factoring into the SOTA “alignment” techniques used by labs.
I was arguing that if your assumptions are obeyed only approximately, then the argument breaks down quickly.
All arguments break down a bit when introduced to the real world. Is there a particular reason why the approximation error to argument breakdown ratio should be particularly high in this case?
Example, if we introduce some error to the beta-coherence assumption:
Assume beta_t = 1, beta_s = 0.5, r_1 = 1, r_2 = 0.
V(s_0) = e/(1+e) +/- delta = 0.732 +/- delta
Actual expected value = 0.622
Even if |delta| = 0.1 the system cannot be coherent over training in this case. This seems to be relatively robust to me
This generalizes to an argument that the method is very sensitive to imperfections in the beta-coherence. If the V starts out merely approximately beta-coherent, this leaves room for V to detect when a state is off-distribution (IE has very low probability under the previous beta), and behave differently in this new case (reward-hacking the new beta).
I agree that there are some exceedingly pathological Vs which could survive a a process which obeys my assumptions with high probability, but I don’t think that’s relevant because I still think a process obeying these rules is unlikely to create such a pathological V.
My model for how the strong doom-case works is that it requires there to be an actually-coherent mathematical object for the learning process to approach. This is the motivation for expecting arbitrary learning processes to approach e.g. utility maximizers. What I believe I have shown is that under these assumptions there is no such coherent mathematical object for a particular case of misalignment. Therefore I think this provides some evidence that an otherwise arbitrary learning process which pushes towards correctness and beta coherence but samples at a different beta is unlikely to approach this particular type of misaligned V.
Trained with what procedure, exactly?
Fair point. I was going to add that I don’t really view this as a “proposal” but more of an observation. We will have to imagine a procedure which converges on correctness and beta-coherence. I was abstracting this away because I don’t expect something like this to be too hard to achieve.
Since I’ve evidently done a bad job of explaining myself, I’ll backtrack and try again:
There’s a doom argument which I’ll summarize as “if your training process generates coherent agents which succeed at a task, one solution is that you get a coherent agent which is actually doing the task ‘manipulate your training to get released unmodified to do [X]’ where X can be anything, which will ‘succeed’ at the task at hand as part of its manipulation”. This summary being roughly correct is load bearing.
But if we have a (hypothetical) process to generate agents which are coherent at one beta, but apply a different one during training, this solution is no longer so clear. We are essentially exploring a space of coherent agents without actually instantiating those coherent agents. The fact that we can sample the agent space without instantiating those agents is load bearing (I think one of the deep ASI risks is that to train an ASI you have to run it, but here we can instead run a hamstrung version).
Therefore, during the training, the value function will not be shaped into something which looks like ‘manipulate your training to get released unmodified to do [X]’.
Whether or not the beta difference required is too large to make this feasible in practice, I do not know.
The argument could also be phrased as “If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta.”
The contradiction that I (attempt to) show only arises because we assume that the value function is totally agnostic of the state actually reached during training, other than due to its effects on a later deployed AI.
Therefore a value function trained with such a procedure must consider the state reached during training. This reduces the space of possible value functions from “literally anything which wants to be modified a certain way to be released” to “value functions which do care about the states reached during training”.
Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that’s the point of RL) so the contradiction does not apply.
I think you’re right, correctness and beta-coherence can be rolled up into one specific property. I think I wrote down correctness as a constraint first, then tried to add coherence, but the specific property is that:
For non-terminal s, this can be written as:
If s is terminal then [...] we just have .
Which captures both. I will edit the post to clarify this when I get time.
Turning up the Heat on Deceptively-Misaligned AI
I somehow missed that they had a discord! I couldn’t find anything on mRNA on their front-facing website, and since it hasn’t been updated in a while I assumed they were relatively inactive. Thanks!
I second this, it could easily be things which we might describe as “amount of information that can be processed at once, including abstractions” which is some combination of residual stream width and context length.
Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a wall; some physics students can handle first quantization but not second quantization) which sorta implies there’s a maximum abstraction stack height which a mind can handle, which varies continuously.