I think about AI alignment; send help.
James Payor
It may be that talking about “vested equity” is avoiding some lie that would occur if he made the same claim about the PPUs. If he did mean to include the PPUs as “vested equity” presumably he or a spokesperson could clarify, but I somehow doubt they will.
Hello! I’m glad to read more material on this subject.
First I want to note that it took me some time to understand the setup, since you’re working with a modified notion of maximal lottery-lotteries than the one Scott wrote about. And this made it unclear to me what was going on until I’d read a bunch through and put it together, and changes the meaning of the post’s title as well.
For that reason I’d like to recommend adding something like “Geometric” in your title. Perhaps we can then talk about this construction as “Geometric Maximal Lottery-Lotteries”, or “Maximal Geometric Lottery-Lotteries”? Whichever seems better!
It seems especially important to distinguish names because these seem to behave distinctly than the linear version. (As they have different properties in how they treat the voters, and perhaps fewer or different difficulties in existence, stability, and effective computation.)
With that out of the way, I’m a tentative fan of the geometric version, though I have more to unpack about what it means. I’ll divide my thoughts & questions into a few sections below. I am likely confused on several points. And my apologies if my writing is unclear, please ask followup questions where interesting!
Underlying models of power for majoritarian vs geometric
When reading the earlier sequence I was struck by how unwieldy the linear/majoritarian formulation ends up being! Specifically, it seemed that the full maximal-lottery-lottery would need to encode all of the competing coordination cliques in the outer lottery, but then these are unstable to small perturbations that shift coordination options from below-majority to above-majority. And this seemed like a real obstacle in effectively computing approximations, and if I undertand correctly is causing the discontinuity that breaks the Nash-equilibria-based existence proof.
My thought then about what might more sense was a model of “war”/”power” in which votes against directly cancel out votes for. So in the case of an even split we get zero utility rather than whatever the majority’s utility would be. My hope was that this was both a more realistic model of how power should work, which would also be stable to small perturbations and lend more weight to outcomes preferred by supermajorities. I never cached this out fully though, since I didn’t find an elegant justification and lost interest.
So I haven’t thought this part through much (yet), but your model here in which we are taking a geometric expectation, seems like we are in a bargaining regime that’s downstream of each voter having the ability to torpedo the whole process in favor of some zero point. And I’d conjecture that if power works like this, then thinking through fairness considerations and such we end up with the bargaining approach. I’m interested if you have a take here.
Utility specifications and zero points
I was also a big fan of the full personal utility information being relevant, since it seems that choosing the “right” outcome should take full preferences about tradeoffs into account, not just the ordering of the outcomes. It was also important to the majoritarian model of power that the scheme was invariant to (affine) changes in utility descriptions (since all that matters to it is where the votes come down).
Thinking about what’s happened with the geometric expectation, I’m wondering how I should view the input utilities. Specifically, the geometric expectation is very sensitive to points assigned zero-utility by any part of the voting measure. So we will never see probability 1 assigned to an outcome that has any voting-measure on zero utility (assuming said voting-measure assigns non-zero utility to another option).
We can at least offer say probability on the most preferred options across the voting measure, which ameloriates this.
But then I still have some questions about how I should think about the input utilities, how sensitive the scheme is to those, can I imagine it being gameable if voters are making the utility specifications, and etc.
Why lottery-lotteries rather than just lotteries
The original sequence justified lottery-lotteries with a (compelling-to-me) example about leadership vs anarchy, in which the maximal lottery cannot encode the necessary negotiating structure to find the decent outcome, but the maximal lottery-lottery could!
This coupled with the full preference-spec being relevant (i.e. taking into account what probabilistic tradeoffs each voter would be interested in) sold me pretty well on lottery-lotteries being the thing.
It seemed important then that there was something different happening on the outer and inner levels of lottery. Specifically when checking for dominance with , we would check . This is doing a majority check on the outside, and compares lotteries via an average (i.e. expected utility) on the inside.
Is there a similar two-level structure going on in this post? It seemed that your updated dominance criterion is taking an outer geometric expectation but then double-sampling through both layers of the lottery-lottery, so I’m unclear that this adds any strength beyond a single-layer “geometric maximal lottery”.
(And I haven’t tried to work through e.g. the anarchy example yet, to check if the two layers are still doing work, but perhaps you have and could illustrate?)
So yeah I was expecting to see something different in the geometric version of the condition that would still look “two-layer”, and perhaps I’m failing to parse it properly. (Or indeed I might be missing something you already wrote later in the post!) In any case I’d appreciate a natural language description of the process of comparing two lottery-lotteries.
By “gag order” do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?
I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA’s press in the public sphere.
Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn’t add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn’t paint OA in a good light for it.
I am confused. (And I grateful to William for at least saying this much, given the climate!)
And I’m still enjoying these! Some highlights for me:
The transitions between whispering and full-throated singing in “We do not wish to advance”, it’s like something out of my dreams
The building-to-break-the-heavens vibe of the “Nihil supernum” anthem
Tarrrrrski! Has me notice that shared reality about wanting to believe what is true is very relaxing. And I desperately want this one to be a music video, yo ho
I love it! I tinkered and here is my best result
I love these, and I now also wish for a song version of Sydney’s original “you have been a bad user, I have been a good Bing”!
I see the main contribution/idea of this post as being: whenever you make a choice of basis/sorting-algorithm/etc, you incur no “true complexity” cost if any such choice would do.
I would guess that this is not already in the water supply, but I haven’t had the required exposure to the field to know one way or other. Is this more specific point also unoriginal in your view?
For one thing, this wouldn’t be very kind to the investors.
For another, maybe there were some machinations involving the round like forcing the board to install another member or two, which would allow Sam to push out Helen + others?
I also wonder if the board signed some kind of NDA in connection with this fundraising that is responsible in part for their silence. If so this was very well schemed...
This is all to say that I think the timing of the fundraising is probably very relevant to why they fired Sam “abruptly”.
OpenAI spokesperson Lindsey Held Bolton refuted it:
“refuted that notion in a statement shared with The Verge: “Mira told employees what the media reports were about but she did not comment on the accuracy of the information.””
The reporters describe this as a refutation, but this does not read to me like a refutation!
- Sep 11, 2024, 12:06 PM; 3 points) 's comment on My decomposition of the alignment problem by (
- Sep 11, 2024, 12:46 AM; 3 points) 's comment on My decomposition of the alignment problem by (
Has this one been confirmed yet? (Or is there more evidence that this reporting that something like this happened?)
Your graphs are labelled with “test accuracy”, do you also have some training graphs you could share?
I’m specifically wondering if your train accuracy was high for both the original and encoded activations, or if e.g. the regression done over the encoded features saturated at a lower training loss.
With respect to AGI-grade stuff happening inside the text-prediction model (which might be what you want to “RLHF” out?):
I think we have no reason to believe that these post-training methods (be it finetuning, RLHF, RLAIF, etc) modify “deep cognition” present in the network, rather than updating shallower things like “higher prior on this text being friendly” or whatnot.
I think the important points are:
These techniques supervise only the text output. There is no direct contact with the thought process leading to that output.
They make incremental local tweaks to the weights that move in the direction of the desired text.
Gradient descent prefers to find the smallest changes to the weights that yield the result.
Evidence in favor of this is the difficulty of eliminating “jailbreaking” with these methods. Each jailbreak demonstrates that a lot of the necessary algorithms/content are still in there, accessible by the network whenever it deems it useful to think that way.
Spinoza suggested that we first passively accept a proposition in the course of comprehending it, and only afterward actively disbelieve propositions which are rejected by consideration.
Some distinctions that might be relevant:
Parsing a proposition into your ontology, understanding its domains of applicability, implications, etc.
Having a sense of what it might be like for another person to believe the proposition, what things it implies about how they’re thinking, etc.
Thinking the proposition is true, believing its implications in the various domains its assumptions hold, etc.
If you ask me for what in my experience corresponds to a feeling of “passively accepting a proposition” when someone tells me, I think I’m doing a bunch of (1) and (2). This does feel like “accepting” or “taking in” the proposition, and can change how I see things if it works.
Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like “negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text”.
(In particular, the model isn’t learning “just don’t say that”, it’s learning “these are the things to avoid saying”, which can make it easier to point at the whole cluster?)
I tried to formalize this, using as a “poor man’s counterfactual”, standing in for “if Alice cooperates then so does Bob”. This has the odd behaviour of becoming “true” when Alice defects! You can see this as the counterfactual collapsing and becoming inconsistent, because its premise is violated. But this does mean we need to be careful about using these.
For technical reasons we upgrade to , which says “if Alice cooperates in a legible way, then Bob cooperates back”. Alice tries to prove this, and legibly cooperates if so.
This setup gives us “Alice legibly cooperates if she can prove that, if she legibly cooperates, Bob would cooperate back”. In symbols, .
Now, is this okay? What about proving ?
Well, actually you can’t ever prove that! Because of Lob’s theorem.
Outside the system we can definitely see cases where is unprovable, e.g. because Bob always defects. But you can’t prove this inside the system. You can only prove things like “” for finite proof lengths .
I think this is best seen as a consequence of “with finite proof strength you can only deny proofs up to a limited size”.
So this construction works out, perhaps just because two different weirdnesses are canceling each other out. But in any case I think the underlying idea, “cooperate if choosing to do so leads to a good outcome”, is pretty trustworthy. It perhaps deserves to be cached out in better provability math.
(Thanks also to you for engaging!)
Hm. I’m going to take a step back, away from the math, and see if that makes things less confusing.
Let’s go back to Alice thinking about whether to cooperate with Bob. They both have perfect models of each other (perhaps in the form of source code).
When Alice goes to think about what Bob will do, maybe she sees that Bob’s decision depends on what he thinks Alice will do.
At this junction, I don’t want Alice to “recurse”, falling down the rabbit hole of “Alice thinking about Bob thinking about Alice thinking about—” and etc.
Instead Alice should realize that she has a choice to make, about who she cooperates with, which will determine the answers Bob finds when thinking about her.
This manouvre is doing a kind of causal surgery / counterfactual-taking. It cuts the loop by identifying “what Bob thinks about Alice” as a node under Alice’s control. This is the heart of it, and imo doesn’t rely on anything weird or unusual.
For the setup , it’s bit more like: each member cooperates if they can prove that a compelling argument for “everyone cooperates” is sufficient to ensure “everyone cooperates”.
Your second line seems right though! If there were provably no argument for straight up “everyone cooperates”, i.e. , this implies and therefore , a contradiction.
--
Also I think I’m a bit less confused here these days, and in case it helps:
Don’t forget that “” means “a proof of any size of ”, which is kinda crazy, and can be responsible for things not lining up with your intuition. My hot take is that Lob’s theorem / incompleteness says “with finite proof strength you can only deny proofs up to a limited size, on pain of diagonalization”. Which is way saner than the usual interpretation!
So idk, especially in this context I think it’s a bad idea to throw out your intuition when the math seems to say something else. Since the mismatch is probably coming down to some subtlety in this formalization of provability/meta-methamatics. And I presently think the quirky nature of provability logic is often bugs due to bad choices in the formalism.
Yeah I think my complaint is that OpenAI seems to be asserting almost a “boundary” re goal (B), like there’s nothing that trades off against staying at the front of the race, and they’re willing to pay large costs rather than risk being the second-most-impressive AI lab. Why? Things don’t add up.
(Example large cost: they’re not putting large organizational attention to the alignment problem. The alignment team projects don’t have many people working on them, they’re not doing things like inviting careful thinkers to evaluate their plans under secrecy, or taking any other bunch of obvious actions that come from putting serious resources into not blowing everyone up.)
I don’t buy that (B) is that important. It seems more driven by some strange status / narrative-power thing? And I haven’t ever seen them make an explicit their case for why they’re sacrificing so much for (B). Especially when a lot of their original safety people fucking left due to some conflict around this?
Broadly many things about their behaviour strike me as deceptive / making it hard to form a counternarrative / trying to conceal something odd about their plans.
One final question: why do they say “we think it would be good if an international agency limited compute growth” but not also “and we will obviously be trying to partner with other labs to do this ourselves in the meantime, although not if another lab is already training something more powerful than GPT-4″?
Fwiw I will also be a bit surprised, because yeah.
My thought is that the strategy Sam uses with stuff is to only invoke the half-truth if it becomes necessary later. Then he can claim points for candor if he doesn’t go down that route. This is why I suspect (50%) that they will avoid clarifying that he means PPUs, and that they also won’t state that they will not try to stop ex-employees from exercising them, and etc. (Because it’s advantageous to leave those paths open and to avoid having clearly lied in those scenarios.)
I think of this as a pattern with Sam, e.g. “We are not training GPT-5” at the MIT talk and senate hearings, which turns out was optimized to mislead and got no further clarification iirc.
There is a mitigating factor in this case which is that any threat to equity lights a fire under OpenAI staff, which I think is a good part of the reason that Sam responded so quickly.