Our current big stupid: not preparing for 40% agreement
Epistemic status: lukewarm take from the gut (not brain) that feels rightish
The “Big Stupid” of the AI doomers 2013-2023 was AI nerds’ solution to the problem “How do we stop people from building dangerous AIs?” was “research how to build AIs”. Methods normal people would consider to stop people from building dangerous AIs, like asking governments to make it illegal to build dangerous AIs, were considered gauche. When the public turned out to be somewhat receptive to the idea of regulating AIs, doomers were unprepared.
Take: The “Big Stupid” of right now is still the same thing. (We’ve not corrected enough). Between now and transformative AGI we are likely to encounter a moment where 40% of people realize AIs really could take over (say if every month another 1% of the population loses their job). If 40% of the world were as scared of AI loss-of-control as you, what could the world do? I think a lot! Do we have a plan for then?
Almost every LessWrong post on AIs are about analyzing AIs. Almost none are about how, given widespread public support, people/governments could stop bad AIs from being built.
[Example: if 40% of people were as worried about AI as I was, the US would treat GPU manufacture like uranium enrichment. And fortunately GPU manufacture is hundreds of time harder than uranium enrichment! We should be nerding out researching integrated circuit supply chains, choke points, foundry logistics in jurisdictions the US can’t unilaterally sanction, that sort of thing.]
TLDR, stopping deadly AIs from being built needs less research on AIs and more research on how to stop AIs from being built.
the problem “How do we stop people from building dangerous AIs?” was “research how to build AIs”.
Not quite. It was to research how to build friendly AIs. We haven’t succeeded yet. What research progress we have made points to the problem being harder than initially thought, and capabilities turned out to be easier than most of us expected as well.
Methods normal people would consider to stop people from building dangerous AIs, like asking governments to make it illegal to build dangerous AIs, were considered gauche.
Considered by whom? Rationalists? The public? The public would not have been so supportive before ChatGPT, because most everybody didn’t expect general AI so soon, if they thought about the topic at all. It wasn’t an option at the time. Talking about this at all was weird, or at least niche, certainly not something one could reasonably expect politicians to care about. That has changed, but only recently.
I don’t particularly disagree with your prescription in the short term, just your history. That said, politics isn’t exactly our strong suit.
But even if we get a pause, this only buys us some time. In the long(er) term, I think either the Singularity or some kind of existential catastrophe is inevitable. Those are the attractor states. Our current economic growth isn’t sustainable without technological progress to go with it. Without that, we’re looking at civilizational collapse. But with that, we’re looking at ever widening blast radii for accidents or misuse of more and more powerful technology. Either we get smarter about managing our collective problems, or they will eventually kill us. Friendly AI looked like the way to do that. If we solve that one problem, even without world cooperation, it solves all the others for us. It’s probably not the only way, but it’s not clear the alternatives are any easier. What would you suggest?
I can think of three alternatives.
First, the most mundane (but perhaps most difficult), would be an adequate world government. This would be an institution that could easily solve climate change, ban nuclear weapons (and wars in general), etc. Even modern stable democracies are mostly not competent enough. Autocracies are an obstacle, and some of them have nukes. We are not on track to get this any time soon, and much of the world is not on board with it, but I think progress in the area of good governance and institution building is worthwhile. Charter cities are among the things I see discussed here.
Second might be intelligence enhancement through brain-computer interfaces. Neuralink exists, but it’s early days. So far, it’s relatively low bandwidth. Probably enough to restore some sight to the blind and some action to the paralyzed, but not enough to make us any smarter. It might take AI assistance to get to that point any time soon, but current AIs are not able, and future ones will be even more of a risk. This would certainly be of interest to us.
Third would be intelligence enhancement through biotech/eugenics. I think this looks like encouraging the smartest to reproduce more rather than the misguided and inhumane attempts of the past to remove the deplorables from the gene pool. Biotech can speed this up with genetic screening and embryo selection. This seems like the approach most likely to actually work (short of actually solving alignment), but this would still take a generation or two at best. I don’t think we can sustain a pause that long. Any enforcement regime would have too many holes to work indefinitely, and civilization is still in danger for the other reasons. Biological enhancement is also something I see discussed on LessWrong.
There are some efforts in the governance space and in the space of public awareness, but there should and can be much, much more.
My read of these survey results is:
AI Alignment researchers are optimistic people by nature. Despite this, most of them don’t think we’re on track to solve alignment in time, and they are split on whether we will even make significant progress. Most of them also support pausing AI development to give alignment research time to catch up.
As for what to actually do about it: There are a lot of options, but I want to highlight PauseAI. (Disclosure: I volunteer with them. My involvement brings me no monetary benefit, and no net social benefit.) Their Discord server is highly active and engaged and is peopled with alignment researchers, community- and mass-movement organizers, experienced protesters, artists, developers, and a swath of regular people from around the world. They play the inside and outside game, both doing public outreach and also lobbying policymakers.
On that note, I also want to put a spotlight on the simple action of sending emails to policymakers. Doing so and following through is extremely OP (i.e. has much more utility than you might expect), and can result in face-to-face meetings to discuss the nature of AI x-risk and what they can personally do about. Genuinely, my model of a world in 2040 that contains humans is almost always one in which a lot more people sent emails to politicians.
I promise I won’t just continue to re-post a bunch of papers, but this one seems relevant to many around these parts. In particular @Elizabeth (also, sorry if you dislike being at-ed like that).
Food preferences significantly influence dietary choices, yet understanding natural dietary patterns in populations remains limited. Here we identifiy four dietary subtypes by applying data-driven approaches to food-liking data from 181,990 UK Biobank participants: ‘starch-free or reduced-starch’ (subtype 1), ‘vegetarian’ (subtype 2), ‘high protein and low fiber’ (subtype 3) and ‘balanced’ (subtype 4). These subtypes varied in diverse brain health domains. The individuals with a balanced diet demonstrated better mental health and superior cognitive functions relative to other three subtypes. Compared with subtype 4, subtype 3 displayed lower gray matter volumes in regions such as the postcentral gyrus, while subtype 2 showed higher volumes in thalamus and precuneus. Genome-wide association analyses identified 16 genes different between subtype 3 and subtype 4, enriched in biological processes related to mental health and cognition. These findings provide new insights into naturally developed dietary patterns, highlighting the importance of a balanced diet for brain health.
Epistemic Note: Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion.
Premise 1: It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research.
Premise 2: This was the default outcome.
Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.
Premise 3: Without repercussions for terrible decisions, decision makers have no skin in the game.
To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.
This must include Holden Karnofsky and Paul Christiano, both of whom were closely involved.
To quote OpenPhil: ”OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela.”
To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.
I like a lot of this post, but the sentence above seems very out of touch to me. Who are these third parties who are completely objective? Why is objective the adjective here, instead of “good judgement” or “predicted this problem at the time”?
I downvoted this comment because it felt uncomfortably scapegoat-y to me. If you think the OpenAI grant was a big mistake, it’s important to have a detailed investigation of what went wrong, and that sort of detailed investigation is most likely to succeed if you have cooperation from people who are involved. I’ve been reading a fair amount about what it takes to instill a culture of safety in an organization, and nothing I’ve seen suggests that scapegoating is a good approach.
Writing a postmortem is not punishment—it is a learning opportunity for the entire company.
...
Blameless postmortems are a tenet of SRE culture. For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the “wrong” thing prevails, people will not bring issues to light for fear of punishment.
Blameless culture originated in the healthcare and avionics industries where mistakes can be fatal. These industries nurture an environment where every “mistake” is seen as an opportunity to strengthen the system. When postmortems shift from allocating blame to investigating the systematic reasons why an individual or team had incomplete or incorrect information, effective prevention plans can be put in place. You can’t “fix” people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
...
Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization [Boy13].
...
We can say with confidence that thanks to our continuous investment in cultivating a postmortem culture, Google weathers fewer outages and fosters a better user experience.
If you start with the assumption that there was a moral failing on the part of the grantmakers, and you are wrong, there’s a good chance you’ll never learn that.
If you start with the assumption that there was a moral failing on the part of the grantmakers, and you are wrong, there’s a good chance you’ll never learn that.
I think you are misinterpreting the grandparent comment. I do not read any mention of a ‘moral failing’ in that comment. You seem worried because of the commenter’s clear description of what they think would be a sensible step for us to take given what they believe are egregious flaws in the decision-making processes of the people involved. I don’t think there’s anything wrong with such claims.
Again: You can care about people while also seeing their flaws and noticing how they are hurting you and others you care about. You can be empathetic to people having flawed decision making and care about them, while also wanting to keep them away from certain decision-making positions.
If you think the OpenAI grant was a big mistake, it’s important to have a detailed investigation of what went wrong, and that sort of detailed investigation is most likely to succeed if you have cooperation from people who are involved.
Oh, interesting. Who exactly do you think influential people like Holden Karnofsky and Paul Christiano are accountable to, exactly? This “detailed investigation” you speak of, and this notion of a “blameless culture”, makes a lot of sense when you are the head of an organization and you are conducting an investigation as to the systematic mistakes made by people who work for you, and who you are responsible for. I don’t think this situation is similar enough that you can use these intuitions blandly without thinking through the actual causal factors involved in this situation.
Note that I don’t necessarily endorse the grandparent comment claims. This is a complex situation and I’d spend more time analyzing it and what occurred.
Enforcing social norms to prevent scapegoating also destroys information that is valuable for accurate credit assignment and causally modelling reality.
I read the Ben Hoffman post you linked. I’m not finding it very clear, but the gist seems to be something like: Statements about others often import some sort of good/bad moral valence; trying to avoid this valence can decrease the accuracy of your statements.
If OP was optimizing purely for descriptive accuracy, disregarding everyone’s feelings, that would be one thing. But the discussion of “repercussions” before there’s been an investigation goes into pure-scapegoating territory if you ask me.
I do not read any mention of a ‘moral failing’ in that comment.
If OP wants to clarify that he doesn’t think there was a moral failing, I expect that to be helpful for a post-mortem. I expect some other people besides me also saw that subtext, even if it’s not explicit.
You can be empathetic to people having flawed decision making and care about them, while also wanting to keep them away from certain decision-making positions.
“Keep people away” sounds like moral talk to me. If you think someone’s decisionmaking is actively bad, i.e. you’d better off reversing any advice from them, then maybe you should keep them around so you can do that! But more realistically, someone who’s fucked up in a big way will probably have learned from that, and functional cultures don’t throw away hard-won knowledge.
Imagine a world where AI is just an inherently treacherous domain, and we throw out the leadership whenever they make a mistake. So we get a continuous churn of inexperienced leaders in an inherently treacherous domain—doesn’t sound like a recipe for success!
Oh, interesting. Who exactly do you think influential people like Holden Karnofsky and Paul Christiano are accountable to, exactly? This “detailed investigation” you speak of, and this notion of a “blameless culture”, makes a lot of sense when you are the head of an organization and you are conducting an investigation as to the systematic mistakes made by people who work for you, and who you are responsible for. I don’t think this situation is similar enough that you can use these intuitions blandly without thinking through the actual causal factors involved in this situation.
I agree that changes things. I’d be much more sympathetic to the OP if they were demanding an investigation or an apology.
But the discussion of “repercussions” before there’s been an investigation goes into pure-scapegoating territory if you ask me.
Just to be clear, OP themselves seem to think that what they are saying will have little effect on the status quo. They literally called it “Very Spicy Take”. Their intention was to allow them to express how they felt about the situation. I’m not sure why you find this threatening, because again, the people they think ideally wouldn’t continue to have influence over AI safety related decisions are incredibly influential and will very likely continue to have the influence they currently possess. Almost everyone else in this thread implicitly models this fact as they are discussing things related to the OP comment.
There is not going to be any scapegoating that will occur. I imagine that everything I say is something I would say in person to the people involved, or to third parties, and not expect any sort of coordinated action to reduce their influence—they are that irreplaceable to the community and to the ecosystem.
So basically, I think it is a bad idea and you think we can’t do it anyway. In that case let’s stop calling for it, and call for something more compassionate and realistic like a public apology.
I’ll bet an apology would be a more effective way to pressure OpenAI to clean up its act anyways. Which is a better headline—“OpenAI cofounder apologizes for their role in creating OpenAI”, or some sort of internal EA movement drama? If we can generate a steady stream of negative headlines about OpenAI, there’s a chance that Sam is declared too much of a PR and regulatory liability. I don’t think it’s a particularly good plan, but I haven’t heard a better one.
Can you not be close friends with someone while also expecting them to be bad at self-control when it comes to alcohol? Or perhaps they are great at technical stuff like research but pretty bad at negotiation, especially when dealing with experienced adverserial situations such as when talking to VCs?
If you think someone’s decisionmaking is actively bad, i.e. you’d better off reversing any advice from them, then maybe you should keep them around so you can do that!
It is not that people people’s decision-making skill is optimized such that you can consistently reverse someone’s opinion to get something that accurately tracks reality. If that was the case then they are implicitly tracking reality very well already. Reversed stupidity is not intelligence.
But more realistically, someone who’s fucked up in a big way will probably have learned from that, and functional cultures don’t throw away hard-won knowledge.
Again you seem to not be trying to track the context of our discussion here. This advice again is usually said when it comes to junior people embedded in an institution, because the ability to blame someone and / or hold them responsible is a power that senior / executive people hold. This attitude you describe makes a lot of sense when it comes to people who are learning things, yes. I don’t know if you can plainly bring it into this domain, and you even acknowledge this in the next few lines.
Imagine a world where AI is just an inherently treacherous domain, and we throw out the leadership whenever they make a mistake.
I think it is incredibly unlikely that the rationalist community has an ability to ‘throw out’ the ‘leadership’ involved here. I find this notion incredibly silly, given the amount of influence OpenPhil has over the alignment community, especially through their funding (including the pipeline, such as MATS).
It is not that people people’s decision-making skill is optimized such that you can consistently reverse someone’s opinion to get something that accurately tracks reality. If that was the case then they are implicitly tracking reality very well already. Reversed stupidity is not intelligence.
Sure, I think this helps tease out the moral valence point I was trying to make. “Don’t allow them near” implies their advice is actively harmful, which in turn suggests that reversing it could be a good idea. But as you say, this is implausible. A more plausible statement is that their advice is basically noise—you shouldn’t pay too much attention to it. I expect OP would’ve said something like that if they were focused on descriptive accuracy rather than scapegoating.
Another way to illuminate the moral dimension of this conversation: If we’re talking about poor decision-making, perhaps MIRI and FHI should also be discussed? They did a lot to create interest in AGI, and MIRI failed to create good alignment researchers by its own lights. Now after doing advocacy off and on for years, and creating this situation, they’re pivoting to 100% advocacy.
Could MIRI be made up of good people who are “great at technical stuff”, yet apt to shoot themselves in the foot when it comes to communicating with the public? It’s hard for me to imagine an upvoted post on this forum saying “MIRI shouldn’t be allowed anywhere near AI safety communications”.
What about large numbers of people working at OpenAI directly on capabilities for many years? (Which is surely worth far more than $30 million.)
Separately, this grant seems to have been done to influence the goverance at OpenAI, not make OpenAI go faster. (Directly working on capabilities seems modestly more accelerating and risky than granting money in exchange for a partnership.)
(ETA: TBC, there is a relationship between the grant and people working at OpenAI on capabilities: the grant was associated with a general vague endorsement of trying to play inside game at OpenAI.)
> We expect the primary benefits of this grant to stem from our partnership with OpenAI, rather than simply from contributing funding toward OpenAI’s work. While we would also expect general support for OpenAI to be likely beneficial on its own, the case for this grant hinges on the benefits we anticipate from our partnership, particularly the opportunity to help play a role in OpenAI’s approach to safety and governance issues.
So the case for the grant wasn’t “we think it’s good to make OAI go faster/better”.
Why do you think the grant was bad? E.g. I don’t think “OAI is bad” would suffice to establish that the grant was bad.
On a meta note, IF proposition 2 is true, THEN the best way to tell this would be if people had been saying so AT THE TIME. If instead, actually everyone at the time disagreed with proposition 2, then it’s not clear that there’s someone “we” know to hand over decision making power to instead. Personally, I was pretty new to the area, and as a Yudkowskyite I’d probably have reflexively decried giving money to any sort of non-X-risk-pilled non-alignment-differential capabilities research. But more to the point, as a newcomer, I wouldn’t have tried hard to have independent opinions about stuff that wasn’t in my technical focus area, or to express those opinions with much conviction, maybe because it seemed like Many Highly Respected Community Members With Substantially Greater Decision Making Experience would know far better, and would not have the time or the non-status to let me in on the secret subtle reasons for doing counterintuitive things. Now I think everyone’s dumb and everyone should say their opinions a lot so that later they can say that they’ve been saying this all along. I’ve become extremely disagreeable in the last few years, I’m still not disagreeable enough, and approximately no one I know personally is disagreeable enough.
In 2019, OpenAI restructured to ensure that the company could raise capital in pursuit of this mission, while preserving the nonprofit’s mission, governance, and oversight. The majority of the board is independent, and the independent directors do not hold equity in OpenAI.
A serious effective altruism movement with clean house. Everyone who pushed the ‘work with AI capabilities company’ line should retire or be forced to retire. There is no need to blame anyone for mistakes, the decision makers had reasons. But they chose wrong and should not continue to be leaders.
Do you think that whenever anyone makes a decision that ends up being bad ex-post they should be forced to retire?
Doesn’t this strongly disincentivize making positive EV bets which are likely to fail?
Edit: I interpreted this comment as a generic claim about how the EA community should relate to things which went poorly ex-post, I now think this comment was intended to be less generic.
Not OP, but I take the claim to be “endorsing getting into bed with companies on-track to make billions of dollars profiting from risking the extinction of humanity in order to nudge them a bit, is in retrospect an obviously doomed strategy, and yet many self-identified effective altruists trusted their leadership to have secret good reasons for doing so and followed them in supporting the companies (e.g. working there for years including in capabilities roles and also helping advertise the company jobs). now that a new consensus is forming that it indeed was obviously a bad strategy, it is also time to have evaluated the leadership’s decision as bad at the time of making the decision and impose costs on them accordingly, including loss of respect and power”.
So no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.
endorsing getting into bed with companies on-track to make billions of dollars profiting from risking the extinction of humanity in order to nudge them a bit
I interpreted the comment as being more general than this. (As in, if someone does something that works out very badly, they should be forced to resign.)
Upon rereading the comment, it reads as less generic than my original interpretation. I’m not sure if I just misread the comment or if it was edited. (Would be nice to see the original version if actually edited.)
(Edit: Also, you shouldn’t interpret my comment as an endorsement or agreement with the the rest of the content of Ben’s comment.)
By “positive EV bets” I meant positive EV with respect to shared values, not with respect to personal gain.
Edit: Maybe your view is that leaders should take this bets anyway even though they know they are likely to result in a forced retirement. (E.g. ignoring the disincentive.) I was actually thinking of the disincentive effect as: you are actually a good leader, so you remaining in power would be good, therefore you should avoid actions that result in you losing power for unjustified reasons. Therefore you should avoid making positive EV bets (as making these bets is now overall negative EV as it will result in a forced leadership transition which is bad). More minimally, you strongly select for leaders which don’t make such bets.
“ETA” commonly is short for “estimated time of arrival”. I understand you are using it to mean “edited” but I don’t quite know what it is short for, and also it seems like using this is just confusing for people in general.
I would be happy to defend roughly the position above (I don’t agree with all of it, but agree with roughly something like “the strategy of trying to play the inside game at labs was really bad, failed in predictable ways, and has deeply eroded trust in community leadership due to the adversarial dynamics present in such a strategy and many people involved should be let go”).
I do think most people who disagree with me here are under substantial confidentiality obligations and de-facto non-disparagement obligations (such as really not wanting to imply anything bad about Anthropic or wanting to maintain a cultivated image for policy purposes) so that it will be hard to find a good public debate partner, but it isn’t impossible.
Are you just referring to the profit incentive conflicting with the need for safety, or something else?
I’m struggling to see how we get aligned AI without “inside game at labs” in some way, shape, or form.
My sense is that evaporative cooling is the biggest thing which went wrong at OpenAI. So I feel OK about e.g. Anthropic if it’s not showing signs of evaporative cooling.
I have indeed been publicly advocating against the inside game strategy at labs for many years (going all the way back to 2018), predicting it would fail due to incentive issues and have large negative externalities due to conflict of interest issues. I could dig up my comments, but I am confident almost anyone who I’ve interfaced with at the labs, or who I’ve talked to about any adjacent topic in leadership would be happy to confirm.
For me, the key question in situations when leaders made a decision with really bad consequences is, “How did they engage with criticism and opposing views?”
If they did well on this front, then I don’t think it’s at all mandatory to push for leadership changes (though certainly, the worse someones track record gets, the more that speaks against them).
By contrast, if leaders tried to make the opposition look stupid or if they otherwise used their influence to dampen the reach of opposing views, then being wrong later is unacceptable.
Basically, I want to allow for a situation where someone was like, “this is a tough call and I can see reasons why others wouldn’t agree with me, but I think we should do this,” and then ends up being wrong, but I don’t want to allow situations where someone is wrong after having expressed something more like, “listen to me, I know better than you, go away.”
In the first situation, it might still be warranted to push for leadership changes (esp. if there’s actually a better alternative), but I don’t see it as mandatory.
The author of the original short form says we need to hold leaders accountable for bad decisions because otherwise the incentives are wrong. I agree with that, but I think it’s being too crude to tie incentives to whether a decision looks right or wrong in hindsight. We can do better and evaluate how someone went about making a decision and how they handled opposing views. (Basically, if opposing views aren’t loud enough that you’d have to actively squish them using your influence illegitimately, then the mistake isn’t just yours as the leader; it’s also that the situation wasn’t significantly obvious to others around you.) I expect that everyone who has strong opinions on things and is ambitious and agenty in a leadership position is going to make some costly mistakes. The incentives shouldn’t be such that leaders shy away from consequential interventions.
I just realized that Paul Christiano and Dario Amodei both probably have signed non-disclosure + non-disparagement contracts since they both left OpenAI.
That impacts how I’d interpret Paul’s (and Dario’s) claims and opinions (or the lack thereof), that relates to OpenAI or alignment proposals entangled with what OpenAI is doing. If Paul has systematically silenced himself, and a large amount of OpenPhil and SFF money has been mis-allocated because of systematically skewed beliefs that these organizations have had due to Paul’s opinions or lack thereof, well. I don’t think this is the case though—I expect Paul, Dario, and Holden all seem to have converged on similar beliefs (whether they track reality or not) and have taken actions consistent with those beliefs.
Regarding the situation at OpenAI, I think it’s important to keep a few historical facts in mind:
The AI alignment community has long stated that an ideal FAI project would have a lead over competing projects. See e.g. this post:
Requisite resource levels: The project must have adequate resources to compete at the frontier of AGI development, including whatever mix of computational resources, intellectual labor, and closed insights are required to produce a 1+ year lead over less cautious competing projects.
The scaling hypothesis wasn’t obviously true around the time OpenAI was founded. At that time, it was assumed that regulation was ineffectual because algorithms can’t be regulated. It’s only now, when GPUs are looking like the bottleneck, that the regulation strategy seems viable.
What happened with OpenAI? One story is something like:
AI safety advocates attracted a lot of attention in Silicon Valley with a particular story about AI dangers and what needed to be done.
Part of this story involved an FAI project with a lead over competing projects. But the story didn’t come with easy-to-evaluate criteria for whether a leading project counted as a good “FAI project” or an bad “UFAI project”. Thinking about AI alignment is epistemically cursed; people who think about the topic independently rarely have similar models.
OpenAI hired employees with a distribution of beliefs about AI alignment difficulty, some of whom may be motivated primarily by greed or power-seeking.
At a certain point, that distribution got “truncated” with the formation of Anthropic.
Presumably at this point, every major project thinks it’s best if they win, due to self-serving biases.
Some possible lessons:
Do more message red-teaming. If an organization like AI Lab Watch had been founded 10+ years ago, and was baked into the AI safety messaging along with “FAI project needs a clear lead”, then we could’ve spent the past 10 years getting consensus on how to anoint one or a just a few “FAI projects”. And the campaign for AI Pause could instead be a campaign to “pause all AGI projects except the anointed FAI project”. So—when we look back in 10 years on the current messaging, what mistakes will seem obvious in hindsight? And if this situation is partially a result of MIRI’s messaging in the past, perhaps we should ask hard questions about their current pivot towards messaging? (Note: I could be accused of grinding my personal axe here, because I’m rather dissatisfied with current AI Pause messaging.)
Assume AI acts like magnet for greedy power-seekers. Make decisions accordingly.
We propose a new framework for reasoning about information in complex systems. Our foundation is based on a variational extension of Shannon’s information theory that takes into account the modeling power and computational constraints of the observer. The resulting \emph{predictive V-information} encompasses mutual information and other notions of informativeness such as the coefficient of determination. Unlike Shannon’s mutual information and in violation of the data processing inequality, V-information can be created through computation. This is consistent with deep neural networks extracting hierarchies of progressively more informative features in representation learning. Additionally, we show that by incorporating computational constraints, V-information can be reliably estimated from data even in high dimensions with PAC-style guarantees. Empirically, we demonstrate predictive V-information is more effective than mutual information for structure learning and fair representation learning.
My reading is their definition of conditional predictive entropy is the naive generalization of Shannon’s conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.
For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.
Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.
More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions V, with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).
The Shannon entropy of a random variable Y given X is
H(Y|X)=−∫∫p(x,y)logp(y|x)dxdy
Thus, the predictive entropy of a random variable Y given X, only being able to condition using functions in V would be
HV=inff∈V−∫∫p(x,y)logf(y|x)dxdy
Where f(y|x)=f[x](y), if we’d like to use the notation of the paper.
And using this we can define predictive information, which as said before answers the question “how much more predictable is Y after we get the infromation X compared to no information?” by
IV(X→Y)=HV(Y|∅)−HV(Y|X)
which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).
I’ve long been a skeptic of scaling LLMs to AGI *. To me I fundamentally don’t understand how this is even possible. It must be said that very smart people give this view credence. davidad, dmurfet. on the other side are vanessa kosoy and steven byrnes. When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech or even obsolete their job. They usually mumble something about scaffolding.
I won’t get into this debate here but I do want to note that my timelines have lengthened, primarily because some of the never-clearly-stated but heavily implied AI developments by proponents of very short timelines have not materialized. To be clear, it has only been a year since gpt-4 is released, and gpt-5 is around the corner, so perhaps my hope is premature. Still my timelines are lengthening.
A year ago, when gpt-3 came out progress was blindingly fast. Part of short timelines came from a sense of ‘if we got surprised so hard by gpt2-3, we are completely uncalibrated, who knows what comes next’.
People seemed surprised by gpt-4 in a way that seemed uncalibrated to me. gpt-4 performance was basically in line with what one would expect if the scaling laws continued to hold. At the time it was already clear that the only really important driver was compute data and that we would run out of both shortly after gpt-4. Scaling proponents suggested this was only the beginning, that there was a whole host of innovation that would be coming. Whispers of mesa-optimizers and simulators.
One year in: Chain-of-thought doesn’t actually improve things that much. External memory and super context lengths ditto. A whole list of proposed architectures seem to serve solely as a paper mill. Every month there is new hype about the latest LLM or image model. Yet they never deviate from expectations based on simple extrapolation of the scaling laws. There is only one thing that really seems to matter and that is compute and data. We have about 3 more OOMs of compute to go. Data may be milked another OOM.
A big question will be whether gpt-5 will suddenly make agentGPT work ( and to what degree). It would seem that gpt-4 is in many ways far more capable than (most or all) humans yet agentGPT is curiously bad.
All-in-all AI progress** is developing according to the naive extrapolations of Scaling Laws but nothing beyond that. The breathless twitter hype about new models is still there but it seems to be believed more at a simulacra level higher than I can parse.
Does this mean we’ll hit an AI winter? No. In my model there may be only one remaining roadblock to ASI (and I suspect I know what it is). That innovation could come at anytime. I don’t know how hard it is, but I suspect it is not too hard.
* the term AGI seems to denote vastly different things to different people in a way I find deeply confusing. I notice that the thing that I thought everybody meant by AGI is now being called ASI. So when I write AGI, feel free to substitute ASI.
** or better, AI congress
addendum: since I’ve been quoted in dmurfet’s AXRP interview as believing that there are certain kinds of reasoning that cannot be represented by transformers/LLMs I want to be clear that this is not really an accurate portrayal of my beliefs. e.g. I don’t think transformers don’t truly understand, are just a stochastic parrot, or in other ways can’t engage in the abstract reasoning that humans do. I think this is clearly false, as seen by interacting with any frontier model.
State-of-the-art models such as Gemini aren’t LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.
I don’t recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
Chain-of-thought prompting makes models much more capable. In the original paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.
I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.
I just asked GPT-4 a GSM8K problem and I agree with your point. I think what’s happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it’s no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to “respond with just a single number” to eliminate the chain-of-thought reasoning it’s problem-solving ability is much worse.
When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech
Can you expand on what you mean by “create nanotech?” If improvements to our current photolithography techniques count, I would not be surprised if (scaffolded) LLMs could be useful for that. Likewise for getting bacteria to express polypeptide catalysts for useful reactions, and even maybe figure out how to chain several novel catalysts together to produce something useful (again, referring to scaffolded LLMs with access to tools).
If you mean that LLMs won’t be able to bootstrap from our current “nanotech only exists in biological systems and chip fabs” world to Drexler-style nanofactories, I agree with that, but I expect things will get crazy enough that I can’t predict them long before nanofactories are a thing (if they ever are).
or even obsolete their job
Likewise, I don’t think LLMs can immediately obsolete all of the parts of my job. But they sure do make parts of my job a lot easier. If you have 100 workers that each spend 90% of their time on one specific task, and you automate that task, that’s approximately as useful as fully automating the jobs of 90 workers. “Human-equivalent” is one of those really leaky abstractions—I would be pretty surprised if the world had any significant resemblance to the world of today by the time robotic systems approached the dexterity and sensitivity of human hands for all of the tasks we use our hands for, whereas for the task of “lift heavy stuff” or “go really fast” machines left us in the dust long ago.
Iterative improvements on the timescale we’re likely to see are still likely to be pretty crazy by historical standards. But yeah, if your timelines were “end of the world by 2026” I can see why they’d be lengthening now.
My timelines were not 2026. In fact, I made bets against doomers 2-3 years ago, one will resolve by next year.
I agree iterative improvements are significant. This falls under “naive extrapolation of scaling laws”.
By nanotech I mean something akin to drexlerian nanotech or something similarly transformative in the vicinity. I think it is plausible that a true ASI will be able to make rapid progress (perhaps on the order of a few years or a decade) on nanotech.
I suspect that people that don’t take this as a serious possibility haven’t really thought through what AGI/ASI means + what the limits and drivers of science and tech really are; I suspect they are simply falling prey to status-quo bias.
With scale, there is visible improvement in difficulty of novel-to-chatbot ideas/details that is possible to explain in-context, things like issues with the code it’s writing. If a chatbot is below some threshold of situational awareness of a task, no scaffolding can keep it on track, but for a better chatbot trivial scaffolding might suffice. Many people can’t google for a solution to a technical issue, the difference between them and those who can is often subtle.
So modest amount of scaling alone seems plausibly sufficient for making chatbots that can do whole jobs almost autonomously. If this works, 1-2 OOMs more of scaling becomes both economically feasible and more likely to be worthwhile. LLMs think much faster, so they only need to be barely smart enough to help with clearing those remaining roadblocks.
At this moment in time, it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
That’s what I’m also saying above (in case you are stating what you see as a point of disagreement). This is consistent with scaling-only short timeline expectations. The crux for this model is current chatbots being already close to autonomous agency and to becoming barely smart enough to help with AI research. Not them directly reaching superintelligence or having any more room for scaling.
What I don’t get about this position:
If it was indeed just scaling—what’s AI research for ? There is nothing to discover, just scale more compute. Sure you can maybe improve the speed of deploying compute a little but at the core of it it seems like a story that’s in conflict with itself?
My view is that there’s huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it’s not that scaling leads to ASI directly, it’s:
scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness
Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures.
Training these newly discovered architectures at large scales leads to much more competent automated researchers.
This process repeats quickly over a few months or years.
This process results in AGI.
AGI, if instructed (or allowed, if it’s agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms.
This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans.
Note that this process isn’t inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren’t disempowered until near the end of this.
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.
The story involves phase changes. Just scaling is what’s likely to be available to human developers in the short term (a few years), it’s not enough for superintelligence. Autonomous agency secures funding for a bit more scaling. If this proves sufficient to get smart autonomous chatbots, they then provide speed to very quickly reach the more elusive AI research needed for superintelligence.
It’s not a little speed, it’s a lot of speed, serial speedup of about 100x plus running in parallel. This is not as visible today, because current chatbots are not capable of doing useful work with serial depth, so the serial speedup is not in practice distinct from throughput and cost. But with actually useful chatbots it turns decades to years, software and theory from distant future become quickly available, non-software projects get to be designed in perfect detail faster than they can be assembled.
In my mainline model there are only a few innovations needed, perhaps only a single big one to product an AGI which just like the Turing Machine sits at the top of the Chomsky Hierarchy will be basically the optimal architecture given resource constraints. There are probably some minor improvements todo with bridging the gap between theoretically optimal architecture and the actual architecture, or parts of the algorithm that can be indefinitely improved but with diminishing returns (these probably exist due to Levin and possibly.matrix.multiplication is one of these). On the whole I expect AI research to be very chunky.
Indeed, we’ve seen that there was really just one big idea to all current AI progress: scaling, specifically scaling GPUs on maximally large undifferentiated datasets. There were some minor technical innovations needed to pull this off but on the whole that was the clinger.
Of course, I don’t know. Nobody knows. But I find this the most plausible guess based on what we know about intelligence, learning, theoretical computer science and science in general.
There are two kinds of relevant hypothetical innovations: those that enable chatbot-led autonomous research, and those that enable superintelligence. It’s plausible that there is no need for (more of) the former, so that mere scaling through human efforts will lead to such chatbots in a few years regardless. (I think it’s essentially inevitable that there is currently enough compute that with appropriate innovations we can get such autonomous human-scale-genius chatbots, but it’s unclear if these innovations are necessary or easy to discover.) If autonomous chatbots are still anything like current LLMs, they are very fast compared to humans, so they quickly discover remaining major innovations of both kinds.
In principle, even if innovations that enable superintelligence (at scale feasible with human efforts in a few years) don’t exist at all, extremely fast autonomous research and engineering still lead to superintelligence, because they greatly accelerate scaling. Physical infrastructure might start scaling really fast using pathways like macroscopic biotech even if drexlerian nanotech is too hard without superintelligence or impossible in principle. Drosophila biomass doubles every 2 days, small things can assemble into large things.
Wasn’t the surprising thing about GPT-4 that scaling laws did hold? Before this many people expected scaling laws to stop before such a high level of capabilities. It doesn’t seem that crazy to think that a few more OOMs could be enough for greater than human intelligence. I’m not sure that many people predicted that we would have much faster than scaling law progress (at least until ~human intelligence AI can speed up research)? I think scaling laws are the extreme rate of progress which many people with short timelines worry about.
To some degree yes, they were not guaranteed to hold. But by that point they held for over 10 OOMs iirc and there was no known reason they couldn’t continue.
This might be the particular twitter bubble I was in but people definitely predicted capabilities beyond simple extrapolation of scaling laws.
I’m surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don’t care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.
In the spirit of trying to understand what actually went wrong here—IIRC, OpenAI didn’t expect ChatGPT to blow up the way it did. Seems like they were playing a strategy of “release cool demos” as opposed to “create a giant competitive market”.
Half a year ago, I’d have guessed that OpenAI leadership, while likely misguided, was essentially well-meaning and driven by a genuine desire to confront a difficult situation.
The recent series of events has made me update significantly against the general trustworthiness and general epistemic reliability of Altman and his circle.
While my overall view of OpenAI’s strategy hasn’t really changed, my likelihood of them possibly “knowing better” has dramatically gone down now.
You continue to model OpenAI as this black box monolith instead of trying to unravel the dynamics inside it and understand the incentive structures that lead these things to occur. Its a common pattern I notice in the way you interface with certain parts of reality.
I don’t consider OpenAI as responsible for this as much as Paul Christiano and Jan Leike and his team. Back in 2016 or 2017, when they initiated and led research into RLHF, they focused on LLMs because they expected that LLMs would be significantly more amenable to RLHF. This means that instruction-tuning was the cause of the focus on LLMs, which meant that it was almost inevitable that they’d try instruction-tuning on it, and incrementally build up models that deliver mundane utility. It was extremely predictable that Sam Altman and OpenAI would leverage this unexpected success to gain more investment and translate that into more researchers and compute. But Sam Altman and Greg Brockman aren’t researchers, and they didn’t figure out a path that minimized ‘capabilities overhang’—Paul Christiano did. And more important—this is not mutually exclusive with OpenAI using the additional resources for both capabilities research and (what they call) alignment research. While you might consider everything they do as effectively capabilities research, the point I am making is that this is still consistent with the hypothesis that while they are misguided, they still are roughly doing the best they can given their incentives.
What really changed my perspective here was the fact that Sam Altman seems to have been systematically destroying extremely valuable information about how we could evaluate OpenAI. Specifically, this non-disparagement clause that ex-employees cannot even mention without falling afoul of this contract, is something I didn’t expect (I did expect non-disclosure clauses but not something this extreme). This meant that my model of OpenAI was systematically too optimistic about how cooperative and trustworthy they are and will be in the future. In addition, if I was systematically deceived about OpenAI due to non-disparagement clauses that cannot even be mentioned, I would expect that something similar to also be possible when it comes to other frontier labs (especially Anthropic, but also DeepMind) due to things similar to this non-disparagement clause. In essence, I no longer believe that Sam Altman (for OpenAI is nothing but his tool now) is doing the best he can to benefit humanity given his incentives and constraints. I expect that Sam Altman is entirely doing whatever he believes will retain and increase his influence and power, and this includes the use of AGI, if and when his teams finally achieve that level of capabilities.
This is the update I expect people are making. It is about being systematically deceived at multiple levels. It is not about “OpenAI being irresponsible”.
Sometimes I forget to take a dose of methylphenidate. As my previous dose fades away, I start to feel much worse than baseline. I then think “Oh no, I’m feeling so bad, I will not be able to work at all.”
But then I remember that I forgot to take a dose of methylphenidate and instantly I feel a lot better.
Usually, one of the worst things when I’m feeling down is that I don’t know why. But now, I’m in this very peculiar situation where putting or not putting some particular object into my mouth is the actual cause. It’s hard to imagine something more tangible.
Knowing the cause makes me feel a lot better. Even when I don’t take the next dose, and still feel drowsy, it’s still easy for me to work. Simply knowing why you feel a particular way seems to make a huge difference.
Wait, some of y’all were still holding your breaths for OpenAI to be net-positive in solving alignment?
After the whole “initially having to be reminded alignment is A Thing”? And going back on its word to go for-profit? And spinning up a weird and opaque corporate structure? And people being worried about Altman being power-seeking? And everything to do with the OAI board debacle? And OAI Very Seriously proposing what (still) looks to me to be like a souped-up version of Baby Alignment Researcher’s Master Plan B (where A involves solving physics and C involves RLHF and cope)? That OpenAI? I just want to be very sure. Because if it took the safety-ish crew of founders resigning to get people to finally pick up on the issue… it shouldn’t have. Not here. Not where people pride themselves on their lightness.
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there’s a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc.
Some quick thoughts:
Soft power– I think people underestimate the how strong the “soft power” of labs is, particularly in the Bay Area.
Jobs– A large fraction of people getting involved in AI safety are interested in the potential of working for a lab one day. There are some obvious reasons for this– lots of potential impact from being at the organizations literally building AGI, big salaries, lots of prestige, etc.
People (IMO correctly) perceive that if they acquire a reputation for being critical of labs, their plans, or their leadership, they will essentially sacrifice the ability to work at the labs.
So you get an equilibrium where the only people making (strong) criticisms of labs are those who have essentially chosen to forgo their potential of working there.
Money– The labs and Open Phil (which has been perceived, IMO correctly, as investing primarily into metastrategies that are aligned with lab interests) have an incredibly large share of the $$$ in the space. When funding became more limited, this became even more true, and I noticed a very tangible shift in the culture & discourse around labs + Open Phil
Status games//reputation– Groups who were more inclined to criticize labs and advocate for public or policymaker outreach were branded as “unilateralist”, “not serious”, and “untrustworthy” in core EA circles. In many cases, there were genuine doubts about these groups, but my impression is that these doubts got amplified/weaponized in cases where the groups were more openly critical of the labs.
Subjectivity of “good judgment”– There is a strong culture of people getting jobs/status for having “good judgment”. This is sensible insofar as we want people with good judgment (who wouldn’t?) but this often ends up being so subjective that it ends up leading to people being quite afraid to voice opinions that go against mainstream views and metastrategies (particularly those endorsed by labs + Open Phil).
Anecdote– Personally, I found my ability to evaluate and critique labs + mainstream metastrategies substantially improved when I spent more time around folks in London and DC (who were less closely tied to the labs). In fairness, I suspect that if I had lived in London or DC *first* and then moved to the Bay Area, it’s plausible I would’ve had a similar feeling but in the “reverse direction”.
With all this in mind, I find myself more deeply appreciating folks who have publicly and openly critiqued labs, even in situations where the cultural and economic incentives to do so were quite weak (relative to staying silent or saying generic positive things about labs).
Examples: Habryka, Rob Bensinger, CAIS, MIRI, Conjecture, and FLI. More recently, @Zach Stein-Perlman, and of course Jan Leike and Daniel K.
Noticing good stuff labs do, not just criticizing them, is often helpful. I wish you thought of this work more as “evaluation” than “criticism.”
It’s often important for evaluation to be quite truth-tracking. Criticism isn’t obviously good by default.
Edit:
3. I’m pretty sure OP likes good criticism of the labs; no comment on how OP is perceived. And I think I don’t understand your “good judgment” point. Feedback I’ve gotten on AI Lab Watch from senior AI safety people has been overwhelmingly positive, and of course there’s a selection effect in what I hear, but I’m quite sure most of them support such efforts.
4. Conjecture (not exclusively) has done things that frustrated me, including in dimensions like being “‘unilateralist,’ ‘not serious,’ and ‘untrustworthy.’” I think most criticism of Conjecture-related advocacy is legitimate and not just because people are opposed to criticizing labs.
5. I do agree on “soft power” and some of “jobs.” People often don’t criticize the labs publicly because they’re worried about negative effects on them, their org, or people associated with them.
Agreed— my main point here is that the marketplace of ideas undervalues criticism.
I think one perspective could be “we should all just aim to do objective truth-seeking”, and as stated I agree with it.
The main issue with that frame, imo, is that it’s very easy to forget that the epistemic environment can be tilted in favor of certain perspectives.
EG I think it can be useful for “objective truth-seeking efforts” to be aware of some of the culture/status games that underincentivize criticism of labs & amplify lab-friendly perspectives.
RE 3:
Good to hear that responses have been positive to lab watch. My impression is that this is a mix of: (a) lab watch doesn’t really threaten the interests of labs (especially Anthropic, which is currently winning & currently the favorite lab among senior AIS ppl), (b) the tides have been shifting somewhat and it is genuinely less taboo to criticize labs than a year ago, and (c) EAs respond more positively to criticism that feels more detailed/nuanced (look I have these 10 categories, let’s rate the labs on each dimension) than criticisms that are more about metastrategy (e.g., challenging the entire RSP frame or advocating for policymaker outreach).
RE 4: I haven’t heard anything about Conjecture that I’ve found particularly concerning. Would be interested in you clarifying (either here or via DM) what you’ve heard. (And clarification note that my original point was less “Conjecture hasn’t done anything wrong” and more “I suspect Conjecture will be more heavily scrutinized and examined and have a disproportionate amount of optimization pressure applied against it given its clear push for things that would hurt lab interests.”)
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
@Daniel Kokotajlo If you indeed avoided signing an NDA, would you be able to share how much you passed up as a result of that? I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
I wonder if it might be more effective to fund legal action against OpenAI than to compensate individual ex-employees for refusing to sign an NDA. Trying to take vested equity away from ex-employees who refuse to sign an NDA sounds likely to not hold up in court, and if we can establish a legal precident that OpenAI cannot do this, that might make other ex-employees much more comfortable speaking out against OpenAI than the possibility that third-parties might fundraise to partially compensate them for lost equity would be (a possibility you might not even be able to make every ex-employee aware of). The fact that this would avoid financially rewarding OpenAI for bad behavior is also a plus. Of course, legal action is expensive, but so is the value of the equity that former OpenAI employees have on the line.
Yeah, at the time I didn’t know how shady some of the contracts here were. I do think funding a legal defense is a marginally better use of funds (though my guess is funding both is worth it).
@habryka , Would you reply to this comment if there’s an opportunity to donate to either? Me and another person are interested, and others could follow this comment too if they wanted to
(only if it’s easy for you, I don’t want to add an annoying task to your plate)
To clarify: I did sign something when I joined the company, so I’m still not completely free to speak (still under confidentiality obligations). But I didn’t take on any additional obligations when I left.
Unclear how to value the equity I gave up, but it probably would have been about 85% of my family’s net worth at least. But we are doing fine, please don’t worry about us.
Mostly for @habryka’s sake: it sounds like you are likely describing your unvested equity, or possibly equity that gets clawed back on quitting. Neither of which is (usually) tied to signing an NDA on the way out the door—they’d both be lost simply due to quitting.
The usual arrangement is some extra severance payment tied to signing something on your way out the door, and that’s usually way less than the unvested equity.
My current best guess is that actually cashing out the vested equity is tied to an NDA, but I am really not confident. OpenAI has a bunch of really weird equity arrangements.
Can you speak to any, let’s say, “hypothetical” specific concerns that somebody who was in your position at a company like OpenAI might have had that would cause them to quit in a similar way to you?
I think the board must be thinking about how to get some independence from Microsoft, and there are not many entities who can counterbalance one of the biggest companies in the world. The government’s intelligence and defence industries are some of them (as are Google, Meta, Apple, etc). But that move would require secrecy, both to stop nationalistic race conditions, and by contract, and to avoid a backlash.
EDIT: I’m getting a few disagrees, would someone mind explaining why they disagree with these wild speculations?
The latter. Yeah idk whether the sacrifice was worth it but thanks for the support. Basically I wanted to retain my ability to criticize the company in the future. I’m not sure what I’d want to say yet though & I’m a bit scared of media attention.
I’d be interested in hearing peoples’ thoughts on whether the sacrifice was worth it, from the perspective of assuming that counterfactual Daniel would have used the extra net worth altruistically. Is Daniel’s ability to speak more freely worth more than the altruistic value that could have been achieved with the extra net worth?
(Note: Regardless of whether it was worth it in this case, simeon_c’s reward/incentivization idea may be worthwhile as long as there are expected to be some cases in the future where it’s worth it, since the people in those future cases may not be as willing as Daniel to make the altruistic personal sacrifice, and so we’d want them to be able to retain their freedom to speak without it costing them as much personally.)
I think having signed an NDA (and especially a non-disparagement agreement) from a major capabilities company should probably rule you out of any kind of leadership position in AI Safety, and especially any kind of policy position. Given that I think Daniel has a pretty decent chance of doing either or both of these things, and that work is very valuable and constrained on the kind of person that Daniel is, I would be very surprised if this wasn’t worth it on altruistic grounds.
Edit: As Buck points out, different non-disclosure-agreements can differ hugely in scope. To be clear, I think non-disclosure-agreements that cover specific data or information you were given seems fine, but non-disclosure-agreements that cover their own existence, or that are very broadly worded and prevent you from basically talking about anything related to an organization, are pretty bad. My sense is the stuff that OpenAI employees are asked to sign when they leave are very constraining, but my guess is the kind of stuff that people have to sign for a small amount of contract work or for events are not very constraining, though I would definitely read any contract carefully in this space.
Strong disagree re signing non-disclosure agreements (which I’ll abbreviate as NDAs). I think it’s totally reasonable to sign NDAs with organizations; they don’t restrict your ability to talk about things you learned other ways than through the ways covered by the NDA. And it’s totally standard to sign NDAs when working with organizations. I’ve signed OpenAI NDAs at least three times, I think—once when I worked there for a month, once when I went to an event they were running, once when I visited their office to give a talk.
I think non-disparagement agreements are way more problematic. At the very least, signing secret non-disparagement agreements should probably disqualify you from roles where your silence re an org might be interpreted as a positive sign.
It might be a good on the current margin to have a norm of publicly listing any non-disclosure agreements you have signed (e.g. on one’s LW profile), and the rough scope of them, so that other people can model what information you’re committed to not sharing, and highlight if it is related to anything beyond the details of technical research being done (e.g. if it is about social relationships or conflicts or criticism).
I have added the one NDA that I have signed to my profile.
But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren’t the most important. (E.g. in your case.)
I’ve signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.
I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.
I agree with this overall point, although I think “trade secrets” in the domain of AI can be relevant for people having surprising timelines views that they can’t talk about.
My understanding is that the extent of NDAs can differ a lot between different implementations, so it might be hard to speak in generalities here. From the revealed behavior of people I poked here who have worked at OpenAI full-time, the OpenAI NDAs seem very comprehensive and limiting. My guess is also the NDAs for contractors and for events are a very different beast and much less limiting.
Also just the de-facto result of signing non-disclosure-agreements is that people don’t feel comfortable navigating the legal ambiguity and default very strongly to not sharing approximately any information about the organization at all.
Maybe people would do better things here with more legal guidance, and I agree that you don’t generally seem super constrained in what you feel comfortable saying, but like I sure now have run into lots of people who seem constrained by NDAs they signed (even without any non-disparagement component). Also, if the NDA has a gag clause that covers the existence of the agreement, there is no way to verify the extent of the NDA, and that makes navigating this kind of stuff super hard and also majorly contributes to people avoiding the topic completely.
What are your timelines like? How long do YOU think we have left?
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation. However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
One AGI CEO hasn’t gone THAT crazy (yet), but is quite sure that the November 2024 election will be meaningless because pivotal acts will have already occurred that make nation state elections visibly pointless.
Also I know many normies who can’t really think probabilistically and mostly aren’t worried at all about any of this… but one normy who can calculate is pretty sure that we have AT LEAST 12 years (possibly because his retirement plans won’t be finalized until then). He also thinks that even systems as “mere” as TikTok will be banned before the November 2024 election because “elites aren’t stupid”.
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”
new observations > new thoughts when it comes to calibrating yourself.
The best calibrated people are people who get lots of interaction with the real world, not those who think a lot or have a complicated inner model. Tetlock’s super forecasters were gamblers and weathermen.
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation
Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I’d give it at least even odds, with most of the remaining chances being meditation or Eastern religions...
Wait, you know smart people who have NOT, at some point in their life: (1) taken a psychedelic NOR (2) meditated, NOR (3) thought about any of buddhism, jainism, hinduism, taoism, confucianisn, etc???
To be clear to naive readers: psychedelics are, in fact, non-trivially dangerous.
I personally worry I already have “an arguably-unfair and a probably-too-high share” of “shaman genes” and I don’t feel I need exogenous sources of weirdness at this point.
But in the SF bay area (and places on the internet memetically downstream from IRL communities there) a lot of that is going around, memetically (in stories about) and perhaps mimetically (via monkey see, monkey do).
The first time you use a serious one you’re likely getting a permanent modification to your personality (+0.5 stddev to your Openness?) and arguably/sorta each time you do a new one, or do a higher dose, or whatever, you’ve committed “1% of a personality suicide” by disrupting some of your most neurologically complex commitments.
To a first approximation my advice is simply “don’t do it”.
HOWEVER: this latter consideration actually suggests: anyone seriously and truly considering suicide should perhaps take a low dose psychedelic FIRST (with at least two loving tripsitters and due care) since it is also maybe/sorta “suicide” but it leaves a body behind that most people will think is still the same person and so they won’t cry very much and so on?
To calibrate this perspective a bit, I also expect that even if cryonics works, it will also cause an unusually large amount of personality shift. A tolerable amount. An amount that leaves behind a personality that similar-enough-to-the-current-one-to-not-have-triggered-a-ship-of-theseus-violation-in-one-modification-cycle. Much more than a stressful day and then bad nightmares and a feeling of regret the next day, but weirder. With cryonics, you might wake up to some effects that are roughly equivalent to “having taken a potion of youthful rejuvenation, and not having the same birthmarks, and also learning that you’re separated-by-disjoint-subjective-deaths from LOTS of people you loved when you experienced your first natural death” for example.This is a MUCH BIGGER CHANGE than just having a nightmare and a waking up with a change of heart (and most people don’t have nightmares and changes of heart every night (at least: I don’t and neither do most people I’ve asked)).
A good “axiological practice” (which I don’t know of anyone working on except me (and I’m only doing it a tiny bit, not with my full mental budget)) is sort of an idealized formal praxis for making yourself robust to “humanely heartful emotional changes”(?) and changing only in <PROPERTY-NAME-TBD> ways from such events.
(Edited to add: Current best candidate name for this property is: “WISE” but maybe “healthy” works? (It depends on whether the Stoics or Nietzsche were “more objectively correct” maybe? The Stoics, after all, were erased and replaced by Platonism-For-The-Masses (AKA “Christianity”) so if you think that “staying implemented in physics forever” is critically important then maybe “GRACEFUL” is the right word? (If someone says “vibe-alicious” or “flowful” or “active” or “strong” or “proud” (focusing on low latency unity achieved via subordination to simply and only power) then they are probably downstream of Heidegger and you should always be ready for them to change sides and submit to metaphorical Nazis, just as Heidegger subordinated himself to actual Nazis without really violating his philosophy at all.)))
I don’t think that psychedelics fits neatly into EITHER category. Drugs in general are akin to wireheading, except wireheading is when something reaches into your brain to overload one or more of your positive-value-tracking-modules, (as a trivially semantically invalid shortcut to achieving positive value “out there” in the state-of-affairs that your tracking modules are trying to track) but actual humans have LOTS of <thing>-tracking-modules and culture and science barely have any RIGOROUS vocabulary for any them.
Note that many of these neurological <thing>-tracking-modules were evolved.
Also, many of them will probably be “like hands” in terms of AI’s ability to model them.
This is part of why AI’s should be existentially terrifying to anyone who is spiritually adept.
AI that sees the full set of causal paths to modifying human minds will be “like psychedelic drugs with coherent persistent agendas”. Humans have basically zero cognitive security systems. Almost all security systems are culturally mediated, and then (absent complex interventions) lots of the brain stuff freezes in place around the age of puberty, and then other stuff freezes around 25, and so on. This is why we protect children from even TALKING to untrusted adults: they are too plastic and not savvy enough. (A good heuristic for the lowest level of “infohazard” is “anything you wouldn’t talk about in front of a six year old”.)
Humans are sorta like a bunch of unpatchable computers, exposing “ports” to the “internet”, where each of our port numbers is simply a lightly salted semantic hash of an address into some random memory location that stores everything, including our operating system.
Your word for “drugs” and my word for “drugs” don’t point to the same memory addresses in the computer’s implementing our souls. Also our souls themselves don’t even have the same nearby set of “documents” (because we just have different memories n’stuff)… but the word “drugs” is not just one of the ports… it is a port that deserves a LOT of security hardening.
The bible said ~”thou shalt not suffer a ‘pharmakeia’ to live” for REASONS.
Wondering why this has so many disagreement votes. Perhaps people don’t like to see the serious topic of “how much time do we have left”, alongside evidence that there’s a population of AI entrepreneurs who are so far removed from consensus reality, that they now think they’re living in a simulation.
(edit: The disagreement for @JenniferRM’s comment was at something like −7. Two days later, it’s at −2)
It could just be because it reaches a strong conclusion on anecdotal/clustered evidence (e.g. it might say more about her friend group than anything else). Along with claims to being better calibrated for weak reasons—which could be true, but seems not very epistemically humble.
Full disclosure I downvoted karma, because I don’t think it should be top reply, but I did not agree or disagree.
But Jen seems cool, I like weird takes, and downvotes are not a big deal—just a part of a healthy contentious discussion.
For most of my comments, I’d almost be offended if I didn’t say something surprising enough to get a “high interestingness, low agreement” voting response. Excluding speech acts, why even say things if your interlocutor or full audience can predict what you’ll say?
And I usually don’t offer full clean proofs in direct word. Anyone still pondering the text at the end, properly, shouldn’t “vote to agree”, right? So from my perspective… its fine and sorta even working as intended <3
However, also, this is currently the top-voted response to me, and if William_S himself reads it I hope he answers here, if not with text then (hopefully? even better?) with a link to a response elsewhere?
((EDIT: Re-reading everything above his, point, I notice that I totally left out the “basic take” that might go roughly like “Kurzweil, Altman, and Zuckerberg are right about compute hardware (not software or philosophy) being central, and there’s a compute bottleneck rather than a compute overhang, so the speed of history will KEEP being about datacenter budgets and chip designs, and those happen on 6-to-18-month OODA loops that could actually fluctuate based on economic decisions, and therefore its maybe 2026, or 2028, or 2030, or even 2032 before things pop, depending on how and when billionaires and governments decide to spend money”.))
Pulling honest posteriors from people who’ve “seen things we wouldn’t believe” gives excellent material for trying to perform aumancy… work backwards from their posteriors to possible observations, and then forwards again, toward what might actually be true :-)
I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.
I think having a probability distribution over timelines is the correct approach. Like, in the comment above:
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don’t see it happening given the evidence. OpenAI wouldn’t need to talk about raising trillions of dollars, companies wouldn’t be trying to commoditize their products, and the employees who quit OpenAI would speak up.
Political infighting is in general just more likely than very short timelines, which would go in counter of most prediction markets on the matter. Not to mention, given it’s already happened with the firing of Sam Altman, it’s far more likely to have happened again.
If there was a probability distribution of timelines, the current events indicate sub 3 year ones have negligible odds. If I am wrong about this, I implore the OpenAI employees to speak up. I don’t think normies misunderstand probability distributions, they just usually tend not to care about unlikely events.
No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.
(In reality, every member of its leadership has their own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).
The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it’s not clear how big is the potential of “harness”/”scaffolding”, and so on).
This does not mean there is no political infighting. But it’s on the background of them being correctly uncertain about true timelines...
Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).
So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).
However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
Why at most one of them can be meaningfully right?
Would not a simulation typically be “a multi-player game”?
(But yes, if they assume that their “original self” was the sole creator (?), then they would all be some kind of “clones” of that particular “original self”. Which would surely increase the overall weirdness.)
These are valid concerns! I presume that if “in the real timeline” there was a consortium of AGI CEOs who agreed to share costs on one run, and fiddled with their self-inserts, then they… would have coordinated more? (Or maybe they’re trying to settle a bet on how the Singularity might counterfactually might have happened in the event of this or that person experiencing this or that coincidence? But in that case I don’t think the self inserts would be allowed to say they’re self inserts.)
Like why not re-roll the PRNG, to censor out the counterfactually simulable timelines that included me hearing from any of the REAL “self inserts of the consortium of AGI CEOS” (and so I only hear from “metaphysically spurious” CEOs)??
Or maybe the game engine itself would have contacted me somehow to ask me to “stop sticking causal quines in their simulation” and somehow I would have been induced by such contact to not publish this?
Mostly I presume AGAINST “coordinated AGI CEO stuff in the real timeline” along any of these lines because, as a type, they often “don’t play well with others”. Fucking oligarchs… maaaaaan.
It seems like a pretty normal thing, to me, for a person to naturally keep track of simulation concerns as a philosophic possibility (its kinda basic “high school theology” right?)… which might become one’s “one track reality narrative” as a sort of “stress induced psychotic break away from a properly metaphysically agnostic mental posture”?
That’s my current working psychological hypothesis, basically.
But to the degree that it happens more and more, I can’t entirely shake the feeling that my probability distribution over “the time T of a pivotal acts occurring” (distinct from when I anticipate I’ll learn that it happened which of course must be LATER than both T and later than now) shouldn’t just include times in the past, but should actually be a distribution over complex numbers or something...
...but I don’t even know how to do that math? At best I can sorta see how to fit it into exotic grammars where it “can have happened counterfactually” or so that it “will have counterfactually happened in a way that caused this factually possible recurrence” or whatever. Fucking “plausible SUBJECTIVE time travel”, fucking shit up. It is so annoying.
Like… maybe every damn crazy AGI CEO’s claims are all true except the ones that are mathematically false?
How the hell should I know? I haven’t seen any not-plausibly-deniable miracles yet. (And all of the miracle reports I’ve heard were things I was pretty sure the Amazing Randi could have duplicated.)
All of this is to say, Hume hasn’t fully betrayed me yet!
Mostly I’ll hold off on performing normal updates until I see for myself, and hold off on performing logical updates until (again!) I see a valid proof for myself <3
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a “no comment” or lack of response or something to that degree implies a “yes” with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA’s offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)
Kelsey Piper now reports: “I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it.”
My layman’s understanding is that managerial employees are excluded from that ruling, unfortunately. Which I think applies to William_S if I read his comment correctly. (See Pg 11, in the “Excluded” section in the linked pdf in your link)
I think one key point that is missing is this: regardless of whether the NDA and the subsequent gag order is legitimate or not; William would still have to spend thousands of dollars on a court case to rescue his rights. This sort of strong-arm litigation has become very common in the modern era. It’s also just… very stressful. If you’ve just resigned from a company you probably used to love, you likely don’t want to fish all of your old friends, bosses and colleagues into a court case.
Edit: also, if William left for reasons involving AGI safety—maybe entering into (what would likely be a very public) court case would be counteractive to their reason for leaving? You probably don’t want to alarm the public by flavouring existential threats in legal jargon. American judges have the annoying tendency to valorise themselves as celebrities when confronting AI (see Musk v Open AI).
Are you familiar with USA NDA’s? I’m sure there are lots of clauses that have been ruled invalid by case law? In many cases, non-lawyers have no ideas about these, so you might be able to make a difference with very little effort. There is also the possibility that valuable OpenAI shares could be rescued?
If you haven’t seen it, check out this thread where one of the OpenAI leavers did not sigh the gag order.
(1) Invalidity of the NDA does not guarantee William will be compensated after the trial. Even if he is, his job prospects may be hurt long-term.
(2) State’s have different laws on whether the NLRA trumps internal company memorandums. More importantly, labour disputes are traditionally solved through internal bargaining. Presumably, the collective bargaining ‘hand-off’ involving NDA’s and gag-orders at this level will waive subsequent litigation in district courts. The precedent Habryka offered refers to hostile severance agreements only, not the waiving of the dispute mechanism itself.
I honestly wish I could use this dialogue as a discrete communication to William on a way out, assuming he needs help, but I re-affirm my previous worries on the costs.
I also add here, rather cautiously, that there are solutions. However, it would depend on whether William was an independent contractor, how long he worked there, whether it actually involved a trade secret (as others have mentioned) and so on. The whole reason NDA’s tend to be so effective is because they obfuscate the material needed to even know or be aware of what remedies are available.
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
EDIT: Kelsey Piper has confirmed that there is an OA NDA with a gag order, and violation forfeits all equity—including fully vested equity. This implies that since you would assume Ilya Sutskever would have received many PPUs & would be holding them as much as possible, Sutskever might have had literally billions of dollars at stake based on how he quit and what he then, say, tweeted… (PPUs which can only be sold in the annual OA-controlled tender offer.)
It turns out there’s a very clear reason for that. I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it. If a departing employee declines to sign the document, or if they violate it, they can lose all vested equity they earned during their time at the company, which is likely worth millions of dollars....While nondisclosure agreements aren’t unusual in highly competitive Silicon Valley, putting an employee’s already-vested equity at risk for declining or violating one is. For workers at startups like OpenAI, equity is a vital form of compensation, one that can dwarf the salary they make. Threatening that potentially life-changing money is a very effective way to keep former employees quiet. (OpenAI did not respond to a request for comment.)
By “gag order” do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?
I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA’s press in the public sphere.
Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn’t add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn’t paint OA in a good light for it.
I am confused. (And I grateful to William for at least saying this much, given the climate!)
I would guess that there isn’t a clear smoking gun that people aren’t sharing because of NDAs, just a lot of more subtle problems that add up to leaving (and in some cases saying OpenAI isn’t being responsible etc).
This is consistent with the observation of the board firing Sam but not having a clear crossed line to point at for why they did it.
It’s usually easier to notice when the incentives are pointing somewhere bad than to explain what’s wrong with them, and it’s easier to notice when someone is being a bad actor than it is to articulate what they did wrong. (Both of these run a higher risk of false positives relative to more crisply articulatable problems.)
The lack of leaks could just mean that there’s nothing interesting to leak. Maybe William and others left OpenAI over run-of-the-mill office politics and there’s nothing exceptional going on related to AI.
Rest assured, there is plenty that could leak at OA… (And might were there not NDAs, which of course is much of the point of having them.)
For a past example, note that no one knew that Sam Altman had been fired from YC CEO for similar reasons as OA CEO, until the extreme aggravating factor of the OA coup, 5 years later. That was certainly more than ‘run of the mill office politics’, I’m sure you’ll agree, but if that could be kept secret, surely lesser things now could be kept secret well past 2029?
At least one of them has explicitly indicated they left because of AI safety concerns, and this thread seems to be insinuating some concern—Ilya Sutskever’s conspicuous silence has become a meme, and Altman recently expressed that he is uncertain of Ilya’s employment status. There still hasn’t been any explanation for the boardroom drama last year.
If it was indeed run-of-the-mill office politics and all was well, then something to the effect of “our departures were unrelated, don’t be so anxious about the world ending, we didn’t see anything alarming at OpenAI” would obviously help a lot of people and also be a huge vote of confidence for OpenAI.
It seems more likely that there is some (vague?) concern but it’s been overridden by tremendous legal/financial/peer motivations.
Profit Participation Units (PPUs) represent a unique compensation method, distinct from traditional equity-based rewards. Unlike shares, stock options, or profit interests, PPUs don’t confer ownership of the company; instead, they offer a contractual right to participate in the company’s future profits.
Does anyone know if it’s typically the case that people under gag orders about their NDAs can talk to other people who they know signed the same NDAs? That is, if a bunch of people quit a company and all have signed self-silencing NDAs, are they normally allowed to talk to each other about why they quit and commiserate about the costs of their silence?
From my perspective, the only thing that keeps the OpenAI situation from being all kinds of terrible is that I continue to think they’re not close to human-level AGI, so it probably doesn’t matter all that much.
This is also my take on AI doom in general; my P(doom|AGI soon) is quite high (>50% for sure), but my P(AGI soon) is low. In fact it decreased in the last 12 months.
On many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with “wrong” reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL’s return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design. We demonstrate that this surprising robustness property is attributable to an interplay between the notion of pessimism in offline RL algorithms and certain implicit biases in common data collection practices. As we prove in this work, pessimism endows the agent with a “survival instinct”, i.e., an incentive to stay within the data support in the long term, while the limited and biased data coverage further constrains the set of survival policies...
Our empirical and theoretical results suggest a new paradigm for RL, whereby an agent is nudged to learn a desirable behavior with imperfect reward but purposely biased data coverage.
But I heard that some people found these results “too good to be true”, with some dismissing it instantly as wrong or mis-stated. I find this ironic, given that the paper was recently published in a top-tier AI conference. Yes, papers can sometimes be bad, but… seriously? You know the thing where lotsa folks used to refuse to engage with AI risk cuz it sounded too weird, without even hearing the arguments? … Yeaaah, absurdity bias.
Anyways, the paper itself is quite interesting. I haven’t gone through all of it yet, but I think I can give a good summary. The github.io is a nice (but nonspecific) summary.
Summary
It’s super important to remember that we aren’t talking about PPO. Boy howdy, we are in a different part of town when it comes to these “offline” RL algorithms (which train on a fixed dataset, rather than generating more of their own data “online”). ATAC, PSPI, what the heck are those algorithms? The important-seeming bits:
Many offline RL algorithms pessimistically initialize the value of unknown states
“Unknown” means: “Not visited in the offline state-action distribution”
Pessimistic means those are assigned a super huge negative value (this is a bit simplified)
Because future rewards are discounted, reaching an unknown state-action pair is bad if it happens soon and less bad if it happens farther in the future
So on an all-zero reward function, a model-based RL policy will learn to stay within the state-action pairs it was demonstrated for as long as possible (“length bias”)
In the case of the gridworld, this means staying on the longest demonstrated path, even if the red lava is rewarded and the yellow key is penalized.
In the case of Hopper, I’m not sure how they represented the states, but if they used non-tabular policies, this probably looks like “repeat the longest portion of demonstrated policies without falling over” (because that leads to the pessimistic penalty, and most of the data looked like walking successfully due to length bias, so that kind of data is least likely to be penalized).
On a negated reward function (which e.g. penalizes the Hopper for staying upright and rewards for falling over), if falling over still leads to a terminal/unknown state-action, that leads to a huge negative penalty. So it’s optimal to keep hopping whenever
Reward(falling over)+γ⋅Pessimism pen.<11−γPen. for being upright
For example, if the original per-timestep reward for staying upright was 1, and the original penalty for falling over was −1, then now the policy gets penalized for staying upright and rewarded for falling over! At γ=.9, it’s therefore optimal to stay upright whenever
1+.9⋅Pessimism<10⋅−1
which holds whenever the pessimistic penalty is at least 12.3. That’s not too high, is it? (When I was in my graduate RL class, we’d initialize the penalties to −1000!)
Significance
DPO, for example, is an offline RL algorithm. It’s plausible that frontier models will be trained using that algorithm. So, these results are more relevant if future DPO variants use pessimism and if the training data (e.g. example user/AI interactions) last for more turns when they’re actually helpful for the user.
While it may be tempting to dismiss these results as irrelevant because “length won’t perfectly correlate with goodness so there won’t be positive bias”, I think that would be a mistake. When analyzing the performance and alignment properties of an algorithm, I think it’s important to have a clear picture of all relevant pieces of the algorithm. The influence of length bias and the support of the offline dataset are additional available levers for aligning offline RL-trained policies.
To close on a familiar note, this is yet another example of how “reward” is not the only important quantity to track in an RL algorithm. I also think it’s a mistake to dismiss results like this instantly; this offers an opportunity to reflect on what views and intuitions led to the incorrect judgment.
Don’t you mean future values? Also, AFAICT, the only thing going on here that seperates online from offline RL is that offline RL algorithms shape the initial value function to give conservative behaviour. And so you get conservative behaviour.
One lesson you could take away from this is “pay attention to the data, not the process”—this happened because the data had longer successes than failures. If successes were more numerous than failures, many algorithms would have imitated those as well with null reward.
The paper sounds fine quality-wise to me, I just find it implausible that it’s relevant for important alignment work, since the proposed mechanism is mainly an aversion to building new capabilities.
@Jozdien sent me this paper, and I dismissed it with a cursory glance, thinking if they had to present their results using the “safe” shortening in the context they used that shortening, their results can’t be too much to sneeze at. Reading your summary, the results are minorly more impressive than I was imagining, but still in the same ballpark I think. I don’t think there’s much applicability to the safety of systems though? If I’m reading you right, you don’t get guarantees for situations for which the model is very out-of-distribution, but still behaving competently, since it hasn’t seen tabulation sequences there.
Where the results are applicable, they definitely seem like they give mixed, probably mostly negative signals. If (say) I have a stop button, and I reward my agent for shutting itself down if I press that stop button, don’t these results say that the agent won’t shut down, for the same reasons the hopper won’t fall over, even if my reward function has rewarded it for falling over in such situation? More generally, this seems to tell us in such situations we have marginally fewer degrees of freedom with which we can modify a model’s goals than we may have thought, since the stay-on-distribution aspect dominates over the reward aspect. On the other hand, “staying on distribution” is in a sense a property we do want! Is this sort of “staying on distribution” the same kind of “staying on distribution” as that used in quantilization? I don’t think so.
More generally, it seems like whether more or less sensitivity to reward, architecture, and data on the part of the functions neural networks learn is better or worse for alignment is an open problem.
Several dozen people now presumably have Lumina in their mouths. Can we not simply crowdsource some assays of their saliva? I would chip money in to this. Key questions around ethanol levels, aldehyde levels, antibacterial levels, and whether the organism itself stays colonized at useful levels.
Any recommendations on how I should do that? You may assume that I know what a gas chromatograph is and what a Petri dish is and why you might want to use either or both of those for data collection, but not that I have any idea of how to most cost-effectively access either one as some rando who doesn’t even have a MA in Chemistry.
Lumina is incredibly cheap right now. I pre-ordered for 250usd. Even genuinely quite poor people I know don’t find the price off-putting (poor in the sense of absolutely poor for the country they live in). I have never met a single person who decided not to try Lumina because the price was high. If they pass its always because they think its risky.
I think Romoeo is thinking of checking a bunch of mediators of risk (like aldehyde levels) as well as of function (like whether the organism stays colonised)
Maybe I’m late to the conversation but has anyone thought through what happens when Lumina colonizes the mouths of other people? Mouth bacteria is important for things like conversation of nitrate to nitrite for nitric oxide production. How do we know the lactic acid metabolism isn’t important or Lumina won’t outcompete other strains important for overall health?
We inhabit this real material world, the one which we perceive all around us (and which somehow gives rise to perceptive and self-conscious beings like us).
Though not all of our perceptions conform to a real material world. We may be fooled by things like illusions or hallucinations or dreams that mimic perceptions of this world but are actually all in our minds.
Indeed if you examine your perceptions closely, you’ll see that none of them actually give you representations of the material world, but merely reactions to it.
In fact, since the only evidence we have is of perceptions, the “material world” is more of a metaphysical hypothesis we use to explain patterns in our perceptions, not something we can vouch for as actually existing.
Since this hypothesis is untestable, it is best put aside when we consider what actually exists. The “material world” is not a thing, but a framework and vocabulary useful for discussing regularities in what is really real.
What is really “real”—what the word “real” means—is our moment to moment perceptions and interpretations, which appear to us in the form of a material world that we inhabit.
In (3), the word “merely” is doing a lot of unexamined work. My perceptions have a coherence to them, an obstinate coherence that I cannot wish away. I reach out for the coffee cup that I see, and it shows up to my sense of touch. What does it mean to call this a “mere” response, when it maintains a persistent similarity of structure to my idea of what is out there—that is, it is a representation of it.
In (4), if the hypothesis explains the perceptions, the perceptions are evidence for the hypothesis. These are two different ways of saying the same thing.
A hypothesis that explains the perceptions can be a just-so story. For any set of perceptions ζ, there may be a vast number of hypotheses that explain those perceptions. How do you choose among them?
In other words, if f() and g() both explain ζ equally well, but are incompatible in all sorts of other ways for which you do not have perceptions to distinguish them, ζ may be “evidence for the hypothesis” f and ζ may be “evidence for the hypothesis” g, but ζ offers no help in determining whether f or g is truer. Consider e.g. f is idealism, g is realism, or some other incompatible metaphysical positions that start with our perceptions and speculate from there.
An author I read recently compared this obstinate coherence of our perceptions to a GUI. When I move my mouse pointer to a file, click, and drag that file into another folder, I’m doing something that has predictable results, and that is similar to other actions I’ve performed in the past, and that plays nicely with my intuitions about objects and motion and so forth. But it would be a mistake for me to then extrapolate from this and assume that somewhere on my hard drive or in my computer memory is a “file” which I have “dragged” “into” a “folder”. My perceptions via the interface may have consistency and practical utility, but they are not themselves a reliable guide to the actual state of the world.
Obstinate coherence and persistent similarity of structure are intriguing but they are limited in how much they can explain by themselves.
Dragging files around in a GUI is a familiar action that does known things with known consequences. Somewhere on the hard disc (or SSD, or somewhere in the cloud, etc.) there is indeed a “file” which has indeed been “moved” into a “folder”, and taking off those quotation marks only requires some background knowledge (which in fact I have) of the lower-level things that are going on and which the GUI presents to me through this visual metaphor.
Some explanations work better than others. The idea that there is stuff out there that gives rise to my perceptions, and which I can act on with predictable results, seems to me the obvious explanation that any other contender will have to do a great deal of work to topple from the plinth. The various philosophical arguments over doctrines such as “idealism”, “realism”, and so on are more like a musical recreation (see my other comment) than anything to take seriously as a search for truth. They are hardly the sort of thing that can be right or wrong, and to the extent that they are, they are all wrong.
It sounds like you want to say things like “coherence and persistent similarity of structure in perceptions demonstrates that perceptions are representations of things external to the perceptions themselves” or “the idea that there is stuff out there seems the obvious explanation” or “explanations that work better than others are the best alternatives in the search for truth” and yet you also want to say “pish, philosophy is rubbish; I don’t need to defend an opinion about realism or idealism or any of that nonsense”. In fact what you’re doing isn’t some alternative to philosophy, but a variety of it.
Some philosophy is rubbish. Quite a lot, I believe. And with a statement such as “perceptions are caused by things external to the perceptions themselves”, which I find unremarkable in itself as a prima facie obvious hypothesis to run with, there is a tendency for philosophers to go off the rails immediately by inventing precise definitions of words such as “perceptions”, “are”, and “caused”, and elaborating all manner of quibbles and paradoxes. Hence the whole tedious catalogue of realisms.
Science did not get anywhere by speculating on whether there are four or five elements and arguing about their natures.
There’s a soft patch around 5 and 6. Why is testability important? It’s a charactersitic of science, but science assumes an external world. It’s not a characteristic of philosophy—good explanation is enough in philosophy, and the general posit of some sort of external world does explanatory work. And it’s separate from the specific posit that the external world is knowable in some particular way.
It’s a characteristic of philosophy, too, at least according to the positivists. If you’re humoring a metaphysical theory that could not even in theory be confirmed or falsified by some possible observation, they suggest that you’re really engaging in mythmaking or poetry or something, not philosophy.
A lot of philosophy is like that. Or perhaps it is better compared to music. Music sounds meaningful, but no-one has explained what it means. Even so, much philosophy sounds meaningful, consisting of grammatical sentences with a sense of coherence, but actually meaning nothing. This is why there is no progress in philosophy, any more than there is in music. New forms can be invented and other forms can go out of fashion, but the only development is the ever-greater sprawl of the forest.
“Before a man studies Zen, to him mountains are mountains and waters are waters; after he gets an insight into the truth of Zen through the instruction of a good master, mountains to him are not mountains and waters are not waters; but after this when he really attains to the abode of rest, mountains are once more mountains and waters are waters.”
(D. T. Suzuki, Essays in Zen Buddhism, First Series, 1926, London; New York: Published for the Buddhist Society, London by Rider, p. 24.)
What loop? They are all various viewpoints on the nature of reality, not steps you have to go through in some order or anything. (1) is a more useful viewpoint than the rest, and you can adopt that one for 99%+ of everything you think about and only care about the rest as basically ideas to toy with rather than live by.
I don’t know about you (assuming you even exist in any sense other than my perception of words on a screen), but to me a model that an external reality exists beyond what I can perceive is amazingly useful for essentially everything. Even if it might not be actually true, it explains my perceptions to a degree that seems incredible if it were not even partly true. Even most of the apparent exceptions in (2) are well explained by it once your physical model includes much of how perception works.
So while (4) holds, it’s to such a powerful degree that (2) to (6) are essentially identical to (1).
On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
each month of lead that the leader started out with would correspond to 15,000 human researchers working for 15 months.
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don’t seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM’s recent Introducing the Frontier Safety Framework:
The Framework has three key components:
Identifying capabilities a model may have with potential for severe harm. To do this, we research the paths through which a model could cause severe harm in high-risk domains, and then determine the minimal level of capabilities a model must have to play a role in causing such harm. We call these “Critical Capability Levels” (CCLs), and they guide our evaluation and mitigation approach.
Evaluating our frontier models periodically to detect when they reach these Critical Capability Levels. To do this, we will develop suites of model evaluations, called “early warning evaluations,” that will alert us when a model is approaching a CCL, and run them frequently enough that we have notice before that threshold is reached.
Applying a mitigation plan when a model passes our early warning evaluations. This should take into account the overall balance of benefits and risks, and the intended deployment contexts. These mitigations will focus primarily on security (preventing the exfiltration of models) and deployment (preventing misuse of critical capabilities).
I see no reason why, in principle, a similar high-level approach couldn’t be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I’m still very surprised by the overall neglectedness, made even more salient by current events).
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
This seems good w.r.t. automated AI safety potentially ‘piggybacking’, but bad for differential progress.
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Sure, though wouldn’t this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
mention seem to me like they could be very important to deploy at scale ASAP
Why think this is important to measure or that this already isn’t happening?
E.g., on the current model organism related project I’m working on, I automate inspecting reasoning traces in various ways. But I don’t feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn’t more important than other tips for doing LLM research better).
Intuitively, I’m thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this ‘race’, corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
Machine Learning R&D level 1: Could significantly accelerate AI research at a cutting-edge lab if deployed widely, e.g.improving the pace of algorithmic progress by 3X, or comparably accelerate other AI research groups.
Machine Learning R&D level 2: Could fully automate the AI R&D pipeline at a fraction of human labor costs, potentially enabling hyperbolic growth in AI capabilities.
I think that suddenly starting to using written media (even journals), in an environment without much or any guidance, is like pressing too hard on the gas; you’re gaining incredible power and going from zero to one on things faster than you ever have before.
Depending on their environment and what they’re interested in starting out, some people might learn (or be shown) how to steer quickly, whereas others might accumulate/scaffold really lopsided optimization power and crash and burn (e.g. getting involved in tons of stuff at once that upon reflection was way too much for someone just starting out).
This seems incredibly interesting to me. Googling “White-boarding techniques” only gives me results about digitally shared idea spaces. Is this what you’re referring to?
I’d love to hear more on this topic.
The release had a very broad definition of the company (including officers, directors, shareholders, etc.), but a fairly reasonable scope of the claims I was releasing. So far, so good. But then it included a general non-disparagement provision, which basically said I couldn’t say anything bad about the company, which, by itself, is also fairly typical and reasonable.
Given the way the contract is worded it might be worth checking whether executing your own “general release” (without a non-disparagement agreement in it) would be sufficient, but I’m not a lawyer and maybe you need the counterparty to agree to it for it to count.
And as a matter of industry practice, this is of course an extremely non-standard requirement for retaining vested equity (or equity-like instruments), whereas it’s pretty common when receiving an additional severance package. (Though even in those cases I haven’t heard of any such non-disparagement agreement that was itself covered by a non-disclosure agreement… but would I have?)
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully.
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Over the past few months my team has been sailing against the wind. Sometimes we were struggling for compute and it was getting harder and harder to get this crucial research done.
While this worked well, for even a slightly more complicated problem the model failed. One Twitter user suggested just adding a random ‘iPhone 15’ in the book text and then asking the model if there is anything in the book that seems out of place in the book. And the model failed to locate it.
The same was the case when the model was asked to summarize a 30-minute Mr. Beast video (over 300k tokens). It generated the summary but many people who had watched the video pointed out that the summary was mostly incorrect.
So while on paper this looked like a huge leap forward for Google, it seems that in practice it’s not performing as well as they might have hoped.
But is this due to limitations of RLHF training, or something else?
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
From a previous comment: [...] This seems to be evidence that RLHF does not tend to generalize well out-of-distribution
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
Can you expand on “Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.”?
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Our current big stupid: not preparing for 40% agreement
Epistemic status: lukewarm take from the gut (not brain) that feels rightish
The “Big Stupid” of the AI doomers 2013-2023 was AI nerds’ solution to the problem “How do we stop people from building dangerous AIs?” was “research how to build AIs”. Methods normal people would consider to stop people from building dangerous AIs, like asking governments to make it illegal to build dangerous AIs, were considered gauche. When the public turned out to be somewhat receptive to the idea of regulating AIs, doomers were unprepared.
Take: The “Big Stupid” of right now is still the same thing. (We’ve not corrected enough). Between now and transformative AGI we are likely to encounter a moment where 40% of people realize AIs really could take over (say if every month another 1% of the population loses their job). If 40% of the world were as scared of AI loss-of-control as you, what could the world do? I think a lot! Do we have a plan for then?
Almost every LessWrong post on AIs are about analyzing AIs. Almost none are about how, given widespread public support, people/governments could stop bad AIs from being built.
[Example: if 40% of people were as worried about AI as I was, the US would treat GPU manufacture like uranium enrichment. And fortunately GPU manufacture is hundreds of time harder than uranium enrichment! We should be nerding out researching integrated circuit supply chains, choke points, foundry logistics in jurisdictions the US can’t unilaterally sanction, that sort of thing.]
TLDR, stopping deadly AIs from being built needs less research on AIs and more research on how to stop AIs from being built.
*My research included 😬
Not quite. It was to research how to build friendly AIs. We haven’t succeeded yet. What research progress we have made points to the problem being harder than initially thought, and capabilities turned out to be easier than most of us expected as well.
Considered by whom? Rationalists? The public? The public would not have been so supportive before ChatGPT, because most everybody didn’t expect general AI so soon, if they thought about the topic at all. It wasn’t an option at the time. Talking about this at all was weird, or at least niche, certainly not something one could reasonably expect politicians to care about. That has changed, but only recently.
I don’t particularly disagree with your prescription in the short term, just your history. That said, politics isn’t exactly our strong suit.
But even if we get a pause, this only buys us some time. In the long(er) term, I think either the Singularity or some kind of existential catastrophe is inevitable. Those are the attractor states. Our current economic growth isn’t sustainable without technological progress to go with it. Without that, we’re looking at civilizational collapse. But with that, we’re looking at ever widening blast radii for accidents or misuse of more and more powerful technology. Either we get smarter about managing our collective problems, or they will eventually kill us. Friendly AI looked like the way to do that. If we solve that one problem, even without world cooperation, it solves all the others for us. It’s probably not the only way, but it’s not clear the alternatives are any easier. What would you suggest?
I can think of three alternatives.
First, the most mundane (but perhaps most difficult), would be an adequate world government. This would be an institution that could easily solve climate change, ban nuclear weapons (and wars in general), etc. Even modern stable democracies are mostly not competent enough. Autocracies are an obstacle, and some of them have nukes. We are not on track to get this any time soon, and much of the world is not on board with it, but I think progress in the area of good governance and institution building is worthwhile. Charter cities are among the things I see discussed here.
Second might be intelligence enhancement through brain-computer interfaces. Neuralink exists, but it’s early days. So far, it’s relatively low bandwidth. Probably enough to restore some sight to the blind and some action to the paralyzed, but not enough to make us any smarter. It might take AI assistance to get to that point any time soon, but current AIs are not able, and future ones will be even more of a risk. This would certainly be of interest to us.
Third would be intelligence enhancement through biotech/eugenics. I think this looks like encouraging the smartest to reproduce more rather than the misguided and inhumane attempts of the past to remove the deplorables from the gene pool. Biotech can speed this up with genetic screening and embryo selection. This seems like the approach most likely to actually work (short of actually solving alignment), but this would still take a generation or two at best. I don’t think we can sustain a pause that long. Any enforcement regime would have too many holes to work indefinitely, and civilization is still in danger for the other reasons. Biological enhancement is also something I see discussed on LessWrong.
Strong agree and strong upvote.
There are some efforts in the governance space and in the space of public awareness, but there should and can be much, much more.
My read of these survey results is:
AI Alignment researchers are optimistic people by nature. Despite this, most of them don’t think we’re on track to solve alignment in time, and they are split on whether we will even make significant progress. Most of them also support pausing AI development to give alignment research time to catch up.
As for what to actually do about it: There are a lot of options, but I want to highlight PauseAI. (Disclosure: I volunteer with them. My involvement brings me no monetary benefit, and no net social benefit.) Their Discord server is highly active and engaged and is peopled with alignment researchers, community- and mass-movement organizers, experienced protesters, artists, developers, and a swath of regular people from around the world. They play the inside and outside game, both doing public outreach and also lobbying policymakers.
On that note, I also want to put a spotlight on the simple action of sending emails to policymakers. Doing so and following through is extremely OP (i.e. has much more utility than you might expect), and can result in face-to-face meetings to discuss the nature of AI x-risk and what they can personally do about. Genuinely, my model of a world in 2040 that contains humans is almost always one in which a lot more people sent emails to politicians.
I promise I won’t just continue to re-post a bunch of papers, but this one seems relevant to many around these parts. In particular @Elizabeth (also, sorry if you dislike being at-ed like that).
Associations of dietary patterns with brain health from behavioral, neuroimaging, biochemical and genetic analyses
h/t Hal Herzog via Tyler Cowen
Very Spicy Take
Epistemic Note:
Many highly respected community members with substantially greater decision making experience (and Lesswrong karma) presumably disagree strongly with my conclusion.
Premise 1:
It is becoming increasingly clear that OpenAI is not appropriately prioritizing safety over advancing capabilities research.
Premise 2:
This was the default outcome.
Instances in history in which private companies (or any individual humans) have intentionally turned down huge profits and power are the exception, not the rule.
Premise 3:
Without repercussions for terrible decisions, decision makers have no skin in the game.
Conclusion:
Anyone and everyone involved with Open Phil recommending a grant of $30 million dollars be given to OpenAI in 2017 shouldn’t be allowed anywhere near AI Safety decision making in the future.
To go one step further, potentially any and every major decision they have played a part in needs to be reevaluated by objective third parties.
This must include Holden Karnofsky and Paul Christiano, both of whom were closely involved.
To quote OpenPhil:
”OpenAI researchers Dario Amodei and Paul Christiano are both technical advisors to Open Philanthropy and live in the same house as Holden. In addition, Holden is engaged to Dario’s sister Daniela.”
I like a lot of this post, but the sentence above seems very out of touch to me. Who are these third parties who are completely objective? Why is objective the adjective here, instead of “good judgement” or “predicted this problem at the time”?
I downvoted this comment because it felt uncomfortably scapegoat-y to me. If you think the OpenAI grant was a big mistake, it’s important to have a detailed investigation of what went wrong, and that sort of detailed investigation is most likely to succeed if you have cooperation from people who are involved. I’ve been reading a fair amount about what it takes to instill a culture of safety in an organization, and nothing I’ve seen suggests that scapegoating is a good approach.
https://sre.google/sre-book/postmortem-culture/
If you start with the assumption that there was a moral failing on the part of the grantmakers, and you are wrong, there’s a good chance you’ll never learn that.
Enforcing social norms to prevent scapegoating also destroys information that is valuable for accurate credit assignment and causally modelling reality.
I think you are misinterpreting the grandparent comment. I do not read any mention of a ‘moral failing’ in that comment. You seem worried because of the commenter’s clear description of what they think would be a sensible step for us to take given what they believe are egregious flaws in the decision-making processes of the people involved. I don’t think there’s anything wrong with such claims.
Again: You can care about people while also seeing their flaws and noticing how they are hurting you and others you care about. You can be empathetic to people having flawed decision making and care about them, while also wanting to keep them away from certain decision-making positions.
Oh, interesting. Who exactly do you think influential people like Holden Karnofsky and Paul Christiano are accountable to, exactly? This “detailed investigation” you speak of, and this notion of a “blameless culture”, makes a lot of sense when you are the head of an organization and you are conducting an investigation as to the systematic mistakes made by people who work for you, and who you are responsible for. I don’t think this situation is similar enough that you can use these intuitions blandly without thinking through the actual causal factors involved in this situation.
Note that I don’t necessarily endorse the grandparent comment claims. This is a complex situation and I’d spend more time analyzing it and what occurred.
I read the Ben Hoffman post you linked. I’m not finding it very clear, but the gist seems to be something like: Statements about others often import some sort of good/bad moral valence; trying to avoid this valence can decrease the accuracy of your statements.
If OP was optimizing purely for descriptive accuracy, disregarding everyone’s feelings, that would be one thing. But the discussion of “repercussions” before there’s been an investigation goes into pure-scapegoating territory if you ask me.
If OP wants to clarify that he doesn’t think there was a moral failing, I expect that to be helpful for a post-mortem. I expect some other people besides me also saw that subtext, even if it’s not explicit.
“Keep people away” sounds like moral talk to me. If you think someone’s decisionmaking is actively bad, i.e. you’d better off reversing any advice from them, then maybe you should keep them around so you can do that! But more realistically, someone who’s fucked up in a big way will probably have learned from that, and functional cultures don’t throw away hard-won knowledge.
Imagine a world where AI is just an inherently treacherous domain, and we throw out the leadership whenever they make a mistake. So we get a continuous churn of inexperienced leaders in an inherently treacherous domain—doesn’t sound like a recipe for success!
I agree that changes things. I’d be much more sympathetic to the OP if they were demanding an investigation or an apology.
Just to be clear, OP themselves seem to think that what they are saying will have little effect on the status quo. They literally called it “Very Spicy Take”. Their intention was to allow them to express how they felt about the situation. I’m not sure why you find this threatening, because again, the people they think ideally wouldn’t continue to have influence over AI safety related decisions are incredibly influential and will very likely continue to have the influence they currently possess. Almost everyone else in this thread implicitly models this fact as they are discussing things related to the OP comment.
There is not going to be any scapegoating that will occur. I imagine that everything I say is something I would say in person to the people involved, or to third parties, and not expect any sort of coordinated action to reduce their influence—they are that irreplaceable to the community and to the ecosystem.
So basically, I think it is a bad idea and you think we can’t do it anyway. In that case let’s stop calling for it, and call for something more compassionate and realistic like a public apology.
I’ll bet an apology would be a more effective way to pressure OpenAI to clean up its act anyways. Which is a better headline—“OpenAI cofounder apologizes for their role in creating OpenAI”, or some sort of internal EA movement drama? If we can generate a steady stream of negative headlines about OpenAI, there’s a chance that Sam is declared too much of a PR and regulatory liability. I don’t think it’s a particularly good plan, but I haven’t heard a better one.
Can you not be close friends with someone while also expecting them to be bad at self-control when it comes to alcohol? Or perhaps they are great at technical stuff like research but pretty bad at negotiation, especially when dealing with experienced adverserial situations such as when talking to VCs?
It is not that people people’s decision-making skill is optimized such that you can consistently reverse someone’s opinion to get something that accurately tracks reality. If that was the case then they are implicitly tracking reality very well already. Reversed stupidity is not intelligence.
Again you seem to not be trying to track the context of our discussion here. This advice again is usually said when it comes to junior people embedded in an institution, because the ability to blame someone and / or hold them responsible is a power that senior / executive people hold. This attitude you describe makes a lot of sense when it comes to people who are learning things, yes. I don’t know if you can plainly bring it into this domain, and you even acknowledge this in the next few lines.
I think it is incredibly unlikely that the rationalist community has an ability to ‘throw out’ the ‘leadership’ involved here. I find this notion incredibly silly, given the amount of influence OpenPhil has over the alignment community, especially through their funding (including the pipeline, such as MATS).
Sure, I think this helps tease out the moral valence point I was trying to make. “Don’t allow them near” implies their advice is actively harmful, which in turn suggests that reversing it could be a good idea. But as you say, this is implausible. A more plausible statement is that their advice is basically noise—you shouldn’t pay too much attention to it. I expect OP would’ve said something like that if they were focused on descriptive accuracy rather than scapegoating.
Another way to illuminate the moral dimension of this conversation: If we’re talking about poor decision-making, perhaps MIRI and FHI should also be discussed? They did a lot to create interest in AGI, and MIRI failed to create good alignment researchers by its own lights. Now after doing advocacy off and on for years, and creating this situation, they’re pivoting to 100% advocacy.
Could MIRI be made up of good people who are “great at technical stuff”, yet apt to shoot themselves in the foot when it comes to communicating with the public? It’s hard for me to imagine an upvoted post on this forum saying “MIRI shouldn’t be allowed anywhere near AI safety communications”.
I mostly agree with premises 1, 2, and 3, but I don’t see how the conclusion follows.
It is possible for things to be hard to influence and yet still worth it to try to influence them.
(Note that the $30 million grant was not an endorsement and was instead a partnership (e.g. it came with a board seat), see Buck’s comment.)
(Ex-post, I think this endeavour was probably net negative, though I’m pretty unsure and ex-ante I currently think it seems great.)
Why focus on the $30 million grant?
What about large numbers of people working at OpenAI directly on capabilities for many years? (Which is surely worth far more than $30 million.)
Separately, this grant seems to have been done to influence the goverance at OpenAI, not make OpenAI go faster. (Directly working on capabilities seems modestly more accelerating and risky than granting money in exchange for a partnership.)
(ETA: TBC, there is a relationship between the grant and people working at OpenAI on capabilities: the grant was associated with a general vague endorsement of trying to play inside game at OpenAI.)
From that page:
> We expect the primary benefits of this grant to stem from our partnership with OpenAI, rather than simply from contributing funding toward OpenAI’s work. While we would also expect general support for OpenAI to be likely beneficial on its own, the case for this grant hinges on the benefits we anticipate from our partnership, particularly the opportunity to help play a role in OpenAI’s approach to safety and governance issues.
So the case for the grant wasn’t “we think it’s good to make OAI go faster/better”.
Why do you think the grant was bad? E.g. I don’t think “OAI is bad” would suffice to establish that the grant was bad.
On a meta note, IF proposition 2 is true, THEN the best way to tell this would be if people had been saying so AT THE TIME. If instead, actually everyone at the time disagreed with proposition 2, then it’s not clear that there’s someone “we” know to hand over decision making power to instead. Personally, I was pretty new to the area, and as a Yudkowskyite I’d probably have reflexively decried giving money to any sort of non-X-risk-pilled non-alignment-differential capabilities research. But more to the point, as a newcomer, I wouldn’t have tried hard to have independent opinions about stuff that wasn’t in my technical focus area, or to express those opinions with much conviction, maybe because it seemed like Many Highly Respected Community Members With Substantially Greater Decision Making Experience would know far better, and would not have the time or the non-status to let me in on the secret subtle reasons for doing counterintuitive things. Now I think everyone’s dumb and everyone should say their opinions a lot so that later they can say that they’ve been saying this all along. I’ve become extremely disagreeable in the last few years, I’m still not disagreeable enough, and approximately no one I know personally is disagreeable enough.
Did OpenAI have the for-profit element at that time?
No. E.g. see here
A serious effective altruism movement with clean house. Everyone who pushed the ‘work with AI capabilities company’ line should retire or be forced to retire. There is no need to blame anyone for mistakes, the decision makers had reasons. But they chose wrong and should not continue to be leaders.
Do you think that whenever anyone makes a decision that ends up being bad ex-post they should be forced to retire?
Doesn’t this strongly disincentivize making positive EV bets which are likely to fail?
Edit: I interpreted this comment as a generic claim about how the EA community should relate to things which went poorly ex-post, I now think this comment was intended to be less generic.
Not OP, but I take the claim to be “endorsing getting into bed with companies on-track to make billions of dollars profiting from risking the extinction of humanity in order to nudge them a bit, is in retrospect an obviously doomed strategy, and yet many self-identified effective altruists trusted their leadership to have secret good reasons for doing so and followed them in supporting the companies (e.g. working there for years including in capabilities roles and also helping advertise the company jobs). now that a new consensus is forming that it indeed was obviously a bad strategy, it is also time to have evaluated the leadership’s decision as bad at the time of making the decision and impose costs on them accordingly, including loss of respect and power”.
So no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.
Wasn’t OpenAI a nonprofit at the time?
I interpreted the comment as being more general than this. (As in, if someone does something that works out very badly, they should be forced to resign.)
Upon rereading the comment, it reads as less generic than my original interpretation. I’m not sure if I just misread the comment or if it was edited. (Would be nice to see the original version if actually edited.)
(Edit: Also, you shouldn’t interpret my comment as an endorsement or agreement with the the rest of the content of Ben’s comment.)
Wasn’t edited, based on my memory.
Leadership is supposed to be about service not personal gain.
I don’t see how this is relevant to my comment.
By “positive EV bets” I meant positive EV with respect to shared values, not with respect to personal gain.
Edit: Maybe your view is that leaders should take this bets anyway even though they know they are likely to result in a forced retirement. (E.g. ignoring the disincentive.) I was actually thinking of the disincentive effect as: you are actually a good leader, so you remaining in power would be good, therefore you should avoid actions that result in you losing power for unjustified reasons. Therefore you should avoid making positive EV bets (as making these bets is now overall negative EV as it will result in a forced leadership transition which is bad). More minimally, you strongly select for leaders which don’t make such bets.
“ETA” commonly is short for “estimated time of arrival”. I understand you are using it to mean “edited” but I don’t quite know what it is short for, and also it seems like using this is just confusing for people in general.
ETA = edit time addition
I should probably not use this term, I think I picked up this habit from some other people on LW.
Oh, weird. I always thought “ETA” means “Edited To Add”.
The Internet seems to agree with you. I wonder why I remember “edit time addition”.
I didn’t know it meant either.
I’d like to see people who are more informed than I am have a conversation about this. Maybe at Less.online?
https://www.lesswrong.com/posts/zAqqeXcau9y2yiJdi/can-we-build-a-better-public-doublecrux
I would be happy to defend roughly the position above (I don’t agree with all of it, but agree with roughly something like “the strategy of trying to play the inside game at labs was really bad, failed in predictable ways, and has deeply eroded trust in community leadership due to the adversarial dynamics present in such a strategy and many people involved should be let go”).
I do think most people who disagree with me here are under substantial confidentiality obligations and de-facto non-disparagement obligations (such as really not wanting to imply anything bad about Anthropic or wanting to maintain a cultivated image for policy purposes) so that it will be hard to find a good public debate partner, but it isn’t impossible.
Are you just referring to the profit incentive conflicting with the need for safety, or something else?
I’m struggling to see how we get aligned AI without “inside game at labs” in some way, shape, or form.
My sense is that evaporative cooling is the biggest thing which went wrong at OpenAI. So I feel OK about e.g. Anthropic if it’s not showing signs of evaporative cooling.
If the strategy failed in predictable ways, shouldn’t we expect to be “pre-registered” predictions that it would fail?
I have indeed been publicly advocating against the inside game strategy at labs for many years (going all the way back to 2018), predicting it would fail due to incentive issues and have large negative externalities due to conflict of interest issues. I could dig up my comments, but I am confident almost anyone who I’ve interfaced with at the labs, or who I’ve talked to about any adjacent topic in leadership would be happy to confirm.
For me, the key question in situations when leaders made a decision with really bad consequences is, “How did they engage with criticism and opposing views?”
If they did well on this front, then I don’t think it’s at all mandatory to push for leadership changes (though certainly, the worse someones track record gets, the more that speaks against them).
By contrast, if leaders tried to make the opposition look stupid or if they otherwise used their influence to dampen the reach of opposing views, then being wrong later is unacceptable.
Basically, I want to allow for a situation where someone was like, “this is a tough call and I can see reasons why others wouldn’t agree with me, but I think we should do this,” and then ends up being wrong, but I don’t want to allow situations where someone is wrong after having expressed something more like, “listen to me, I know better than you, go away.”
In the first situation, it might still be warranted to push for leadership changes (esp. if there’s actually a better alternative), but I don’t see it as mandatory.
The author of the original short form says we need to hold leaders accountable for bad decisions because otherwise the incentives are wrong. I agree with that, but I think it’s being too crude to tie incentives to whether a decision looks right or wrong in hindsight. We can do better and evaluate how someone went about making a decision and how they handled opposing views. (Basically, if opposing views aren’t loud enough that you’d have to actively squish them using your influence illegitimately, then the mistake isn’t just yours as the leader; it’s also that the situation wasn’t significantly obvious to others around you.) I expect that everyone who has strong opinions on things and is ambitious and agenty in a leadership position is going to make some costly mistakes. The incentives shouldn’t be such that leaders shy away from consequential interventions.
I just realized that Paul Christiano and Dario Amodei both probably have signed non-disclosure + non-disparagement contracts since they both left OpenAI.
That impacts how I’d interpret Paul’s (and Dario’s) claims and opinions (or the lack thereof), that relates to OpenAI or alignment proposals entangled with what OpenAI is doing. If Paul has systematically silenced himself, and a large amount of OpenPhil and SFF money has been mis-allocated because of systematically skewed beliefs that these organizations have had due to Paul’s opinions or lack thereof, well. I don’t think this is the case though—I expect Paul, Dario, and Holden all seem to have converged on similar beliefs (whether they track reality or not) and have taken actions consistent with those beliefs.
Can anybody confirm whether Paul is likely systematically silenced re OpenAI?
I mean, if Paul doesn’t confirm that he is not under any non-disparagement obligations to OpenAI like Cullen O’ Keefe did, we have our answer.
In fact, given this asymmetry of information situation, it makes sense to assume that Paul is under such an obligation until he claims otherwise.
Regarding the situation at OpenAI, I think it’s important to keep a few historical facts in mind:
The AI alignment community has long stated that an ideal FAI project would have a lead over competing projects. See e.g. this post:
The scaling hypothesis wasn’t obviously true around the time OpenAI was founded. At that time, it was assumed that regulation was ineffectual because algorithms can’t be regulated. It’s only now, when GPUs are looking like the bottleneck, that the regulation strategy seems viable.
What happened with OpenAI? One story is something like:
AI safety advocates attracted a lot of attention in Silicon Valley with a particular story about AI dangers and what needed to be done.
Part of this story involved an FAI project with a lead over competing projects. But the story didn’t come with easy-to-evaluate criteria for whether a leading project counted as a good “FAI project” or an bad “UFAI project”. Thinking about AI alignment is epistemically cursed; people who think about the topic independently rarely have similar models.
Deepmind was originally the consensus “FAI project”, but Elon Musk started OpenAI because Larry Page has e/acc beliefs.
OpenAI hired employees with a distribution of beliefs about AI alignment difficulty, some of whom may be motivated primarily by greed or power-seeking.
At a certain point, that distribution got “truncated” with the formation of Anthropic.
Presumably at this point, every major project thinks it’s best if they win, due to self-serving biases.
Some possible lessons:
Do more message red-teaming. If an organization like AI Lab Watch had been founded 10+ years ago, and was baked into the AI safety messaging along with “FAI project needs a clear lead”, then we could’ve spent the past 10 years getting consensus on how to anoint one or a just a few “FAI projects”. And the campaign for AI Pause could instead be a campaign to “pause all AGI projects except the anointed FAI project”. So—when we look back in 10 years on the current messaging, what mistakes will seem obvious in hindsight? And if this situation is partially a result of MIRI’s messaging in the past, perhaps we should ask hard questions about their current pivot towards messaging? (Note: I could be accused of grinding my personal axe here, because I’m rather dissatisfied with current AI Pause messaging.)
Assume AI acts like magnet for greedy power-seekers. Make decisions accordingly.
A Theory of Usable Information Under Computational Constraints
h/t Simon Pepin Lehalleur
Can somebody explain to me what’s happening in this paper ?
My reading is their definition of conditional predictive entropy is the naive generalization of Shannon’s conditional entropy given that the way that you condition on some data is restricted to only being able to implement functions of a particular class. And the corresponding generalization of mutual information becomes a measure of how much more predictable does some piece of information become (Y) given evidence (X) compared to no evidence.
For example, the goal of public key cryptography cannot be to make the mutual information between a plaintext, and public key & encrypted text zero, while maintaining maximal mutual information between the encrypted text and plaintext given the private key, since this is impossible.
Cryptography instead assumes everyone involved can only condition their probability distributions using polynomial time algorithms of the data they have, and in that circumstance you can minimize the predictability of your plain text after getting the public key & encrypted text, while maximizing the predictability of the plain text after getting the private key & encrypted text.
More mathematically, they assume you can only implement functions from your data to your conditioned probability distributions in the set of functions V, with the property that for any possible probability distribution you are able to output given the right set of data, you also have the choice of simply outputting the probability distribution without looking at the data. In other words, if you can represent it, you can output it. This corresponds to equation (1).
The Shannon entropy of a random variable Y given X is
H(Y|X)=−∫∫p(x,y)logp(y|x)dxdy
Thus, the predictive entropy of a random variable Y given X, only being able to condition using functions in V would be
HV=inff∈V−∫∫p(x,y)logf(y|x)dxdy
Where f(y|x)=f[x](y), if we’d like to use the notation of the paper.
And using this we can define predictive information, which as said before answers the question “how much more predictable is Y after we get the infromation X compared to no information?” by
IV(X→Y)=HV(Y|∅)−HV(Y|X)
which they also show can be empirically well estimated by the naive data sampling method (i.e. replacing the expectations in definition 2 with empirical samples).
My timelines are lengthening.
I’ve long been a skeptic of scaling LLMs to AGI *. To me I fundamentally don’t understand how this is even possible. It must be said that very smart people give this view credence. davidad, dmurfet. on the other side are vanessa kosoy and steven byrnes. When pushed proponents don’t actually defend the position that a large enough transformer will create nanotech or even obsolete their job. They usually mumble something about scaffolding.
I won’t get into this debate here but I do want to note that my timelines have lengthened, primarily because some of the never-clearly-stated but heavily implied AI developments by proponents of very short timelines have not materialized. To be clear, it has only been a year since gpt-4 is released, and gpt-5 is around the corner, so perhaps my hope is premature. Still my timelines are lengthening.
A year ago, when gpt-3 came out progress was blindingly fast. Part of short timelines came from a sense of ‘if we got surprised so hard by gpt2-3, we are completely uncalibrated, who knows what comes next’.
People seemed surprised by gpt-4 in a way that seemed uncalibrated to me. gpt-4 performance was basically in line with what one would expect if the scaling laws continued to hold. At the time it was already clear that the only really important driver was compute data and that we would run out of both shortly after gpt-4. Scaling proponents suggested this was only the beginning, that there was a whole host of innovation that would be coming. Whispers of mesa-optimizers and simulators.
One year in: Chain-of-thought doesn’t actually improve things that much. External memory and super context lengths ditto. A whole list of proposed architectures seem to serve solely as a paper mill. Every month there is new hype about the latest LLM or image model. Yet they never deviate from expectations based on simple extrapolation of the scaling laws. There is only one thing that really seems to matter and that is compute and data. We have about 3 more OOMs of compute to go. Data may be milked another OOM.
A big question will be whether gpt-5 will suddenly make agentGPT work ( and to what degree). It would seem that gpt-4 is in many ways far more capable than (most or all) humans yet agentGPT is curiously bad.
All-in-all AI progress** is developing according to the naive extrapolations of Scaling Laws but nothing beyond that. The breathless twitter hype about new models is still there but it seems to be believed more at a simulacra level higher than I can parse.
Does this mean we’ll hit an AI winter? No. In my model there may be only one remaining roadblock to ASI (and I suspect I know what it is). That innovation could come at anytime. I don’t know how hard it is, but I suspect it is not too hard.
* the term AGI seems to denote vastly different things to different people in a way I find deeply confusing. I notice that the thing that I thought everybody meant by AGI is now being called ASI. So when I write AGI, feel free to substitute ASI.
** or better, AI congress
addendum: since I’ve been quoted in dmurfet’s AXRP interview as believing that there are certain kinds of reasoning that cannot be represented by transformers/LLMs I want to be clear that this is not really an accurate portrayal of my beliefs. e.g. I don’t think transformers don’t truly understand, are just a stochastic parrot, or in other ways can’t engage in the abstract reasoning that humans do. I think this is clearly false, as seen by interacting with any frontier model.
State-of-the-art models such as Gemini aren’t LLMs anymore. They are natively multimodal or omni-modal transformer models that can process text, images, speech and video. These models seem to me like a huge jump in capabilities over text-only LLMs like GPT-3.
Links to Dan Murfet’s AXRP interview:
Transcript
Video
Agreed. I’m also pleasantly surprised that your take isn’t heavily downvoted.
I don’t recall what I said in the interview about your beliefs, but what I meant to say was something like what you just said in this post, apologies for missing the mark.
Mumble.
Chain-of-thought prompting makes models much more capable. In the original paper “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”, PaLM 540B with standard prompting only solves 18% of problems but 57% of problems with chain-of-thought prompting.
I expect the use of agent features such as reflection will lead to similar large increases in capabilities as well in the near future.
Those numbers don’t really accord with my experience actually using gpt-4. Generic prompting techniques just don’t help all that much.
I just asked GPT-4 a GSM8K problem and I agree with your point. I think what’s happening is that GPT-4 has been fine-tuned to respond with chain-of-thought reasoning by default so it’s no longer necessary to explicitly ask it to reason step-by-step. Though if you ask it to “respond with just a single number” to eliminate the chain-of-thought reasoning it’s problem-solving ability is much worse.
Lengthening from what to what?
I’ve never done explicit timelines estimates before so nothing to compare to. But since it’s a gut feeling anyway, I’m saying my gut is lengthening.
Can you expand on what you mean by “create nanotech?” If improvements to our current photolithography techniques count, I would not be surprised if (scaffolded) LLMs could be useful for that. Likewise for getting bacteria to express polypeptide catalysts for useful reactions, and even maybe figure out how to chain several novel catalysts together to produce something useful (again, referring to scaffolded LLMs with access to tools).
If you mean that LLMs won’t be able to bootstrap from our current “nanotech only exists in biological systems and chip fabs” world to Drexler-style nanofactories, I agree with that, but I expect things will get crazy enough that I can’t predict them long before nanofactories are a thing (if they ever are).
Likewise, I don’t think LLMs can immediately obsolete all of the parts of my job. But they sure do make parts of my job a lot easier. If you have 100 workers that each spend 90% of their time on one specific task, and you automate that task, that’s approximately as useful as fully automating the jobs of 90 workers. “Human-equivalent” is one of those really leaky abstractions—I would be pretty surprised if the world had any significant resemblance to the world of today by the time robotic systems approached the dexterity and sensitivity of human hands for all of the tasks we use our hands for, whereas for the task of “lift heavy stuff” or “go really fast” machines left us in the dust long ago.
Iterative improvements on the timescale we’re likely to see are still likely to be pretty crazy by historical standards. But yeah, if your timelines were “end of the world by 2026” I can see why they’d be lengthening now.
My timelines were not 2026. In fact, I made bets against doomers 2-3 years ago, one will resolve by next year.
I agree iterative improvements are significant. This falls under “naive extrapolation of scaling laws”.
By nanotech I mean something akin to drexlerian nanotech or something similarly transformative in the vicinity. I think it is plausible that a true ASI will be able to make rapid progress (perhaps on the order of a few years or a decade) on nanotech. I suspect that people that don’t take this as a serious possibility haven’t really thought through what AGI/ASI means + what the limits and drivers of science and tech really are; I suspect they are simply falling prey to status-quo bias.
With scale, there is visible improvement in difficulty of novel-to-chatbot ideas/details that is possible to explain in-context, things like issues with the code it’s writing. If a chatbot is below some threshold of situational awareness of a task, no scaffolding can keep it on track, but for a better chatbot trivial scaffolding might suffice. Many people can’t google for a solution to a technical issue, the difference between them and those who can is often subtle.
So modest amount of scaling alone seems plausibly sufficient for making chatbots that can do whole jobs almost autonomously. If this works, 1-2 OOMs more of scaling becomes both economically feasible and more likely to be worthwhile. LLMs think much faster, so they only need to be barely smart enough to help with clearing those remaining roadblocks.
You may be right. I don’t know of course.
At this moment in time, it seems scaffolding tricks haven’t really improved the baseline performance of models that much. Overwhelmingly, the capability comes down to whether the rlfhed base model can do the task.
That’s what I’m also saying above (in case you are stating what you see as a point of disagreement). This is consistent with scaling-only short timeline expectations. The crux for this model is current chatbots being already close to autonomous agency and to becoming barely smart enough to help with AI research. Not them directly reaching superintelligence or having any more room for scaling.
Yes agreed.
What I don’t get about this position: If it was indeed just scaling—what’s AI research for ? There is nothing to discover, just scale more compute. Sure you can maybe improve the speed of deploying compute a little but at the core of it it seems like a story that’s in conflict with itself?
My view is that there’s huge algorithmic gains in peak capability, training efficiency (less data, less compute), and inference efficiency waiting to be discovered, and available to be found by a large number of parallel research hours invested by a minimally competent multimodal LLM powered research team. So it’s not that scaling leads to ASI directly, it’s:
scaling leads to brute forcing the LLM agent across the threshold of AI research usefulness
Using these LLM agents in a large research project can lead to rapidly finding better ML algorithms and architectures.
Training these newly discovered architectures at large scales leads to much more competent automated researchers.
This process repeats quickly over a few months or years.
This process results in AGI.
AGI, if instructed (or allowed, if it’s agentically motivated on its own to do so) to improve itself will find even better architectures and algorithms.
This process can repeat until ASI. The resulting intelligence / capability / inference speed goes far beyond that of humans.
Note that this process isn’t inevitable, there are many points along the way where humans can (and should, in my opinion) intervene. We aren’t disempowered until near the end of this.
Why do you think there are these low-hanging algorithmic improvements?
My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I’m quite confident aren’t going to spread the details. It’s a hard thing to discuss in detail without sharing capabilities thoughts. If I don’t give details or cite sources, then… it’s just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you’d like to bet on it, I’m open to showing my confidence in my opinion by betting that the world turns out how I expect it to.
The story involves phase changes. Just scaling is what’s likely to be available to human developers in the short term (a few years), it’s not enough for superintelligence. Autonomous agency secures funding for a bit more scaling. If this proves sufficient to get smart autonomous chatbots, they then provide speed to very quickly reach the more elusive AI research needed for superintelligence.
It’s not a little speed, it’s a lot of speed, serial speedup of about 100x plus running in parallel. This is not as visible today, because current chatbots are not capable of doing useful work with serial depth, so the serial speedup is not in practice distinct from throughput and cost. But with actually useful chatbots it turns decades to years, software and theory from distant future become quickly available, non-software projects get to be designed in perfect detail faster than they can be assembled.
In my mainline model there are only a few innovations needed, perhaps only a single big one to product an AGI which just like the Turing Machine sits at the top of the Chomsky Hierarchy will be basically the optimal architecture given resource constraints. There are probably some minor improvements todo with bridging the gap between theoretically optimal architecture and the actual architecture, or parts of the algorithm that can be indefinitely improved but with diminishing returns (these probably exist due to Levin and possibly.matrix.multiplication is one of these). On the whole I expect AI research to be very chunky.
Indeed, we’ve seen that there was really just one big idea to all current AI progress: scaling, specifically scaling GPUs on maximally large undifferentiated datasets. There were some minor technical innovations needed to pull this off but on the whole that was the clinger.
Of course, I don’t know. Nobody knows. But I find this the most plausible guess based on what we know about intelligence, learning, theoretical computer science and science in general.
There are two kinds of relevant hypothetical innovations: those that enable chatbot-led autonomous research, and those that enable superintelligence. It’s plausible that there is no need for (more of) the former, so that mere scaling through human efforts will lead to such chatbots in a few years regardless. (I think it’s essentially inevitable that there is currently enough compute that with appropriate innovations we can get such autonomous human-scale-genius chatbots, but it’s unclear if these innovations are necessary or easy to discover.) If autonomous chatbots are still anything like current LLMs, they are very fast compared to humans, so they quickly discover remaining major innovations of both kinds.
In principle, even if innovations that enable superintelligence (at scale feasible with human efforts in a few years) don’t exist at all, extremely fast autonomous research and engineering still lead to superintelligence, because they greatly accelerate scaling. Physical infrastructure might start scaling really fast using pathways like macroscopic biotech even if drexlerian nanotech is too hard without superintelligence or impossible in principle. Drosophila biomass doubles every 2 days, small things can assemble into large things.
Wasn’t the surprising thing about GPT-4 that scaling laws did hold? Before this many people expected scaling laws to stop before such a high level of capabilities. It doesn’t seem that crazy to think that a few more OOMs could be enough for greater than human intelligence. I’m not sure that many people predicted that we would have much faster than scaling law progress (at least until ~human intelligence AI can speed up research)? I think scaling laws are the extreme rate of progress which many people with short timelines worry about.
To some degree yes, they were not guaranteed to hold. But by that point they held for over 10 OOMs iirc and there was no known reason they couldn’t continue.
This might be the particular twitter bubble I was in but people definitely predicted capabilities beyond simple extrapolation of scaling laws.
I’m surprised at people who seem to be updating only now about OpenAI being very irresponsible, rather than updating when they created a giant public competitive market for chatbots (which contains plenty of labs that don’t care about alignment at all), thereby reducing how long everyone has to solve alignment. I still parse that move as devastating the commons in order to make a quick buck.
I disagree. This whole saga has introduced the Effective Altruism movement to people at labs that hadn’t thought about alignment.
From my understanding openai isn’t anywhere close to breaking even from chatgpt and I can’t think of any way a chatbot could actually be monetized.
In the spirit of trying to understand what actually went wrong here—IIRC, OpenAI didn’t expect ChatGPT to blow up the way it did. Seems like they were playing a strategy of “release cool demos” as opposed to “create a giant competitive market”.
Who is updating? I haven’t seen anyone change their mind yet.
Half a year ago, I’d have guessed that OpenAI leadership, while likely misguided, was essentially well-meaning and driven by a genuine desire to confront a difficult situation. The recent series of events has made me update significantly against the general trustworthiness and general epistemic reliability of Altman and his circle. While my overall view of OpenAI’s strategy hasn’t really changed, my likelihood of them possibly “knowing better” has dramatically gone down now.
I believe that ChatGPT was not released with the expectation that it would become as popular as it did. OpenAI pivoted hard when it saw the results.
Also, I think you are misinterpreting the sort of ‘updates’ people are making here.
Well, even if that’s true, causing such an outcome by accident should still count as evidence of vast irresponsibility imo.
You continue to model OpenAI as this black box monolith instead of trying to unravel the dynamics inside it and understand the incentive structures that lead these things to occur. Its a common pattern I notice in the way you interface with certain parts of reality.
I don’t consider OpenAI as responsible for this as much as Paul Christiano and Jan Leike and his team. Back in 2016 or 2017, when they initiated and led research into RLHF, they focused on LLMs because they expected that LLMs would be significantly more amenable to RLHF. This means that instruction-tuning was the cause of the focus on LLMs, which meant that it was almost inevitable that they’d try instruction-tuning on it, and incrementally build up models that deliver mundane utility. It was extremely predictable that Sam Altman and OpenAI would leverage this unexpected success to gain more investment and translate that into more researchers and compute. But Sam Altman and Greg Brockman aren’t researchers, and they didn’t figure out a path that minimized ‘capabilities overhang’—Paul Christiano did. And more important—this is not mutually exclusive with OpenAI using the additional resources for both capabilities research and (what they call) alignment research. While you might consider everything they do as effectively capabilities research, the point I am making is that this is still consistent with the hypothesis that while they are misguided, they still are roughly doing the best they can given their incentives.
What really changed my perspective here was the fact that Sam Altman seems to have been systematically destroying extremely valuable information about how we could evaluate OpenAI. Specifically, this non-disparagement clause that ex-employees cannot even mention without falling afoul of this contract, is something I didn’t expect (I did expect non-disclosure clauses but not something this extreme). This meant that my model of OpenAI was systematically too optimistic about how cooperative and trustworthy they are and will be in the future. In addition, if I was systematically deceived about OpenAI due to non-disparagement clauses that cannot even be mentioned, I would expect that something similar to also be possible when it comes to other frontier labs (especially Anthropic, but also DeepMind) due to things similar to this non-disparagement clause. In essence, I no longer believe that Sam Altman (for OpenAI is nothing but his tool now) is doing the best he can to benefit humanity given his incentives and constraints. I expect that Sam Altman is entirely doing whatever he believes will retain and increase his influence and power, and this includes the use of AGI, if and when his teams finally achieve that level of capabilities.
This is the update I expect people are making. It is about being systematically deceived at multiple levels. It is not about “OpenAI being irresponsible”.
Sometimes I forget to take a dose of methylphenidate. As my previous dose fades away, I start to feel much worse than baseline. I then think “Oh no, I’m feeling so bad, I will not be able to work at all.”
But then I remember that I forgot to take a dose of methylphenidate and instantly I feel a lot better.
Usually, one of the worst things when I’m feeling down is that I don’t know why. But now, I’m in this very peculiar situation where putting or not putting some particular object into my mouth is the actual cause. It’s hard to imagine something more tangible.
Knowing the cause makes me feel a lot better. Even when I don’t take the next dose, and still feel drowsy, it’s still easy for me to work. Simply knowing why you feel a particular way seems to make a huge difference.
I wonder how much this generalizes.
Wait, some of y’all were still holding your breaths for OpenAI to be net-positive in solving alignment?
After the whole “initially having to be reminded alignment is A Thing”? And going back on its word to go for-profit? And spinning up a weird and opaque corporate structure? And people being worried about Altman being power-seeking? And everything to do with the OAI board debacle? And OAI Very Seriously proposing what (still) looks to me to be like a souped-up version of Baby Alignment Researcher’s Master Plan B (where A involves solving physics and C involves RLHF and cope)? That OpenAI? I just want to be very sure. Because if it took the safety-ish crew of founders resigning to get people to finally pick up on the issue… it shouldn’t have. Not here. Not where people pride themselves on their lightness.
My current perspective is that criticism of AGI labs is an under-incentivized public good. I suspect there’s a disproportionate amount of value that people could have by evaluating lab plans, publicly criticizing labs when they break commitments or make poor arguments, talking to journalists/policymakers about their concerns, etc.
Some quick thoughts:
Soft power– I think people underestimate the how strong the “soft power” of labs is, particularly in the Bay Area.
Jobs– A large fraction of people getting involved in AI safety are interested in the potential of working for a lab one day. There are some obvious reasons for this– lots of potential impact from being at the organizations literally building AGI, big salaries, lots of prestige, etc.
People (IMO correctly) perceive that if they acquire a reputation for being critical of labs, their plans, or their leadership, they will essentially sacrifice the ability to work at the labs.
So you get an equilibrium where the only people making (strong) criticisms of labs are those who have essentially chosen to forgo their potential of working there.
Money– The labs and Open Phil (which has been perceived, IMO correctly, as investing primarily into metastrategies that are aligned with lab interests) have an incredibly large share of the $$$ in the space. When funding became more limited, this became even more true, and I noticed a very tangible shift in the culture & discourse around labs + Open Phil
Status games//reputation– Groups who were more inclined to criticize labs and advocate for public or policymaker outreach were branded as “unilateralist”, “not serious”, and “untrustworthy” in core EA circles. In many cases, there were genuine doubts about these groups, but my impression is that these doubts got amplified/weaponized in cases where the groups were more openly critical of the labs.
Subjectivity of “good judgment”– There is a strong culture of people getting jobs/status for having “good judgment”. This is sensible insofar as we want people with good judgment (who wouldn’t?) but this often ends up being so subjective that it ends up leading to people being quite afraid to voice opinions that go against mainstream views and metastrategies (particularly those endorsed by labs + Open Phil).
Anecdote– Personally, I found my ability to evaluate and critique labs + mainstream metastrategies substantially improved when I spent more time around folks in London and DC (who were less closely tied to the labs). In fairness, I suspect that if I had lived in London or DC *first* and then moved to the Bay Area, it’s plausible I would’ve had a similar feeling but in the “reverse direction”.
With all this in mind, I find myself more deeply appreciating folks who have publicly and openly critiqued labs, even in situations where the cultural and economic incentives to do so were quite weak (relative to staying silent or saying generic positive things about labs).
Examples: Habryka, Rob Bensinger, CAIS, MIRI, Conjecture, and FLI. More recently, @Zach Stein-Perlman, and of course Jan Leike and Daniel K.
Sorry for brevity, I’m busy right now.
Noticing good stuff labs do, not just criticizing them, is often helpful. I wish you thought of this work more as “evaluation” than “criticism.”
It’s often important for evaluation to be quite truth-tracking. Criticism isn’t obviously good by default.
Edit:
3. I’m pretty sure OP likes good criticism of the labs; no comment on how OP is perceived. And I think I don’t understand your “good judgment” point. Feedback I’ve gotten on AI Lab Watch from senior AI safety people has been overwhelmingly positive, and of course there’s a selection effect in what I hear, but I’m quite sure most of them support such efforts.
4. Conjecture (not exclusively) has done things that frustrated me, including in dimensions like being “‘unilateralist,’ ‘not serious,’ and ‘untrustworthy.’” I think most criticism of Conjecture-related advocacy is legitimate and not just because people are opposed to criticizing labs.
5. I do agree on “soft power” and some of “jobs.” People often don’t criticize the labs publicly because they’re worried about negative effects on them, their org, or people associated with them.
RE 1& 2:
Agreed— my main point here is that the marketplace of ideas undervalues criticism.
I think one perspective could be “we should all just aim to do objective truth-seeking”, and as stated I agree with it.
The main issue with that frame, imo, is that it’s very easy to forget that the epistemic environment can be tilted in favor of certain perspectives.
EG I think it can be useful for “objective truth-seeking efforts” to be aware of some of the culture/status games that underincentivize criticism of labs & amplify lab-friendly perspectives.
RE 3:
Good to hear that responses have been positive to lab watch. My impression is that this is a mix of: (a) lab watch doesn’t really threaten the interests of labs (especially Anthropic, which is currently winning & currently the favorite lab among senior AIS ppl), (b) the tides have been shifting somewhat and it is genuinely less taboo to criticize labs than a year ago, and (c) EAs respond more positively to criticism that feels more detailed/nuanced (look I have these 10 categories, let’s rate the labs on each dimension) than criticisms that are more about metastrategy (e.g., challenging the entire RSP frame or advocating for policymaker outreach).
RE 4: I haven’t heard anything about Conjecture that I’ve found particularly concerning. Would be interested in you clarifying (either here or via DM) what you’ve heard. (And clarification note that my original point was less “Conjecture hasn’t done anything wrong” and more “I suspect Conjecture will be more heavily scrutinized and examined and have a disproportionate amount of optimization pressure applied against it given its clear push for things that would hurt lab interests.”)
Idea: Daniel Kokotajlo probably lost quite a bit of money by not signing an OpenAI NDA before leaving, which I consider a public service at this point. Could some of the funders of the AI safety landscape give some money or social reward for this?
I guess reimbursing everything Daniel lost might be a bit too much for funders but providing some money, both to reward the act and incentivize future safety people to not sign NDAs would have a very high value.
@Daniel Kokotajlo If you indeed avoided signing an NDA, would you be able to share how much you passed up as a result of that? I might indeed want to create a precedent here and maybe try to fundraise for some substantial fraction of it.
I wonder if it might be more effective to fund legal action against OpenAI than to compensate individual ex-employees for refusing to sign an NDA. Trying to take vested equity away from ex-employees who refuse to sign an NDA sounds likely to not hold up in court, and if we can establish a legal precident that OpenAI cannot do this, that might make other ex-employees much more comfortable speaking out against OpenAI than the possibility that third-parties might fundraise to partially compensate them for lost equity would be (a possibility you might not even be able to make every ex-employee aware of). The fact that this would avoid financially rewarding OpenAI for bad behavior is also a plus. Of course, legal action is expensive, but so is the value of the equity that former OpenAI employees have on the line.
Yeah, at the time I didn’t know how shady some of the contracts here were. I do think funding a legal defense is a marginally better use of funds (though my guess is funding both is worth it).
@habryka , Would you reply to this comment if there’s an opportunity to donate to either? Me and another person are interested, and others could follow this comment too if they wanted to
(only if it’s easy for you, I don’t want to add an annoying task to your plate)
Sure, I’ll try to post here if I know of a clear opportunity to donate to either.
To clarify: I did sign something when I joined the company, so I’m still not completely free to speak (still under confidentiality obligations). But I didn’t take on any additional obligations when I left.
Unclear how to value the equity I gave up, but it probably would have been about 85% of my family’s net worth at least. But we are doing fine, please don’t worry about us.
Mostly for @habryka’s sake: it sounds like you are likely describing your unvested equity, or possibly equity that gets clawed back on quitting. Neither of which is (usually) tied to signing an NDA on the way out the door—they’d both be lost simply due to quitting.
The usual arrangement is some extra severance payment tied to signing something on your way out the door, and that’s usually way less than the unvested equity.
EDIT: Turns out OpenAI’s equity terms are unusually brutal and it is indeed the case that the equity clawback was tied to signing the NDA.
My current best guess is that actually cashing out the vested equity is tied to an NDA, but I am really not confident. OpenAI has a bunch of really weird equity arrangements.
Can you speak to any, let’s say, “hypothetical” specific concerns that somebody who was in your position at a company like OpenAI might have had that would cause them to quit in a similar way to you?
One is the change to the charter to allow the company to work with the military.
https://news.ycombinator.com/item?id=39020778
I think the board must be thinking about how to get some independence from Microsoft, and there are not many entities who can counterbalance one of the biggest companies in the world. The government’s intelligence and defence industries are some of them (as are Google, Meta, Apple, etc). But that move would require secrecy, both to stop nationalistic race conditions, and by contract, and to avoid a backlash.
EDIT: I’m getting a few disagrees, would someone mind explaining why they disagree with these wild speculations?
They didn’t change their charter.
https://forum.effectivealtruism.org/posts/2Dg9t5HTqHXpZPBXP/ea-community-needs-mechanisms-to-avoid-deceptive-messaging
Thanks, I hadn’t seen that, I find it convincing.
Is that your family’s net worth is $100 and you gave up $85? Or your family’s net worth is $15 and you gave up $85?
Either way, hats off!
The latter. Yeah idk whether the sacrifice was worth it but thanks for the support. Basically I wanted to retain my ability to criticize the company in the future. I’m not sure what I’d want to say yet though & I’m a bit scared of media attention.
I appreciate that you are not speaking loudly if you don’t yet have anything loud to say.
I’d be interested in hearing peoples’ thoughts on whether the sacrifice was worth it, from the perspective of assuming that counterfactual Daniel would have used the extra net worth altruistically. Is Daniel’s ability to speak more freely worth more than the altruistic value that could have been achieved with the extra net worth?
(Note: Regardless of whether it was worth it in this case, simeon_c’s reward/incentivization idea may be worthwhile as long as there are expected to be some cases in the future where it’s worth it, since the people in those future cases may not be as willing as Daniel to make the altruistic personal sacrifice, and so we’d want them to be able to retain their freedom to speak without it costing them as much personally.)
I think having signed an NDA (and especially a non-disparagement agreement) from a major capabilities company should probably rule you out of any kind of leadership position in AI Safety, and especially any kind of policy position. Given that I think Daniel has a pretty decent chance of doing either or both of these things, and that work is very valuable and constrained on the kind of person that Daniel is, I would be very surprised if this wasn’t worth it on altruistic grounds.
Edit: As Buck points out, different non-disclosure-agreements can differ hugely in scope. To be clear, I think non-disclosure-agreements that cover specific data or information you were given seems fine, but non-disclosure-agreements that cover their own existence, or that are very broadly worded and prevent you from basically talking about anything related to an organization, are pretty bad. My sense is the stuff that OpenAI employees are asked to sign when they leave are very constraining, but my guess is the kind of stuff that people have to sign for a small amount of contract work or for events are not very constraining, though I would definitely read any contract carefully in this space.
Strong disagree re signing non-disclosure agreements (which I’ll abbreviate as NDAs). I think it’s totally reasonable to sign NDAs with organizations; they don’t restrict your ability to talk about things you learned other ways than through the ways covered by the NDA. And it’s totally standard to sign NDAs when working with organizations. I’ve signed OpenAI NDAs at least three times, I think—once when I worked there for a month, once when I went to an event they were running, once when I visited their office to give a talk.
I think non-disparagement agreements are way more problematic. At the very least, signing secret non-disparagement agreements should probably disqualify you from roles where your silence re an org might be interpreted as a positive sign.
It might be a good on the current margin to have a norm of publicly listing any non-disclosure agreements you have signed (e.g. on one’s LW profile), and the rough scope of them, so that other people can model what information you’re committed to not sharing, and highlight if it is related to anything beyond the details of technical research being done (e.g. if it is about social relationships or conflicts or criticism).
I have added the one NDA that I have signed to my profile.
But everyone has lots of duties to keep secrets or preserve privacy and the ones put in writing often aren’t the most important. (E.g. in your case.)
I’ve signed ~3 NDAs. Most of them are irrelevant now and useless for people to know about, like yours.
I agree in special cases it would be good to flag such things — like agreements to not share your opinions on a person/org/topic, rather than just keeping trade secrets private.
I agree with this overall point, although I think “trade secrets” in the domain of AI can be relevant for people having surprising timelines views that they can’t talk about.
My understanding is that the extent of NDAs can differ a lot between different implementations, so it might be hard to speak in generalities here. From the revealed behavior of people I poked here who have worked at OpenAI full-time, the OpenAI NDAs seem very comprehensive and limiting. My guess is also the NDAs for contractors and for events are a very different beast and much less limiting.
Also just the de-facto result of signing non-disclosure-agreements is that people don’t feel comfortable navigating the legal ambiguity and default very strongly to not sharing approximately any information about the organization at all.
Maybe people would do better things here with more legal guidance, and I agree that you don’t generally seem super constrained in what you feel comfortable saying, but like I sure now have run into lots of people who seem constrained by NDAs they signed (even without any non-disparagement component). Also, if the NDA has a gag clause that covers the existence of the agreement, there is no way to verify the extent of the NDA, and that makes navigating this kind of stuff super hard and also majorly contributes to people avoiding the topic completely.
Notably, there are some lawyers here on LessWrong who might help (possibly even for the lols, you never know). And you can look at case law and guidance to see if clauses are actually enforceable or not (many are not). To anyone reading, here’s habryka doing just that
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.
What are your timelines like? How long do YOU think we have left?
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation. However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
One AGI CEO hasn’t gone THAT crazy (yet), but is quite sure that the November 2024 election will be meaningless because pivotal acts will have already occurred that make nation state elections visibly pointless.
Also I know many normies who can’t really think probabilistically and mostly aren’t worried at all about any of this… but one normy who can calculate is pretty sure that we have AT LEAST 12 years (possibly because his retirement plans won’t be finalized until then). He also thinks that even systems as “mere” as TikTok will be banned before the November 2024 election because “elites aren’t stupid”.
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
new observations > new thoughts when it comes to calibrating yourself.
The best calibrated people are people who get lots of interaction with the real world, not those who think a lot or have a complicated inner model. Tetlock’s super forecasters were gamblers and weathermen.
Do you know if the origin of this idea for them was a psychedelic or dissociative trip? I’d give it at least even odds, with most of the remaining chances being meditation or Eastern religions...
Wait, you know smart people who have NOT, at some point in their life: (1) taken a psychedelic NOR (2) meditated, NOR (3) thought about any of buddhism, jainism, hinduism, taoism, confucianisn, etc???
To be clear to naive readers: psychedelics are, in fact, non-trivially dangerous.
I personally worry I already have “an arguably-unfair and a probably-too-high share” of “shaman genes” and I don’t feel I need exogenous sources of weirdness at this point.
But in the SF bay area (and places on the internet memetically downstream from IRL communities there) a lot of that is going around, memetically (in stories about) and perhaps mimetically (via monkey see, monkey do).
The first time you use a serious one you’re likely getting a permanent modification to your personality (+0.5 stddev to your Openness?) and arguably/sorta each time you do a new one, or do a higher dose, or whatever, you’ve committed “1% of a personality suicide” by disrupting some of your most neurologically complex commitments.
To a first approximation my advice is simply “don’t do it”.
HOWEVER: this latter consideration actually suggests: anyone seriously and truly considering suicide should perhaps take a low dose psychedelic FIRST (with at least two loving tripsitters and due care) since it is also maybe/sorta “suicide” but it leaves a body behind that most people will think is still the same person and so they won’t cry very much and so on?
To calibrate this perspective a bit, I also expect that even if cryonics works, it will also cause an unusually large amount of personality shift. A tolerable amount. An amount that leaves behind a personality that similar-enough-to-the-current-one-to-not-have-triggered-a-ship-of-theseus-violation-in-one-modification-cycle. Much more than a stressful day and then bad nightmares and a feeling of regret the next day, but weirder. With cryonics, you might wake up to some effects that are roughly equivalent to “having taken a potion of youthful rejuvenation, and not having the same birthmarks, and also learning that you’re separated-by-disjoint-subjective-deaths from LOTS of people you loved when you experienced your first natural death” for example.This is a MUCH BIGGER CHANGE than just having a nightmare and a waking up with a change of heart (and most people don’t have nightmares and changes of heart every night (at least: I don’t and neither do most people I’ve asked)).
Remember, every improvement is a change, though not every change is an improvement. A good “epistemological practice” is sort of a idealized formal praxis for making yourself robust to “learning any true fact” and changing only in GOOD ways from such facts.
A good “axiological practice” (which I don’t know of anyone working on except me (and I’m only doing it a tiny bit, not with my full mental budget)) is sort of an idealized formal praxis for making yourself robust to “humanely heartful emotional changes”(?) and changing only in <PROPERTY-NAME-TBD> ways from such events.
(Edited to add: Current best candidate name for this property is: “WISE” but maybe “healthy” works? (It depends on whether the Stoics or Nietzsche were “more objectively correct” maybe? The Stoics, after all, were erased and replaced by Platonism-For-The-Masses (AKA “Christianity”) so if you think that “staying implemented in physics forever” is critically important then maybe “GRACEFUL” is the right word? (If someone says “vibe-alicious” or “flowful” or “active” or “strong” or “proud” (focusing on low latency unity achieved via subordination to simply and only power) then they are probably downstream of Heidegger and you should always be ready for them to change sides and submit to metaphorical Nazis, just as Heidegger subordinated himself to actual Nazis without really violating his philosophy at all.)))
I don’t think that psychedelics fits neatly into EITHER category. Drugs in general are akin to wireheading, except wireheading is when something reaches into your brain to overload one or more of your positive-value-tracking-modules, (as a trivially semantically invalid shortcut to achieving positive value “out there” in the state-of-affairs that your tracking modules are trying to track) but actual humans have LOTS of <thing>-tracking-modules and culture and science barely have any RIGOROUS vocabulary for any them.
Note that many of these neurological <thing>-tracking-modules were evolved.
Also, many of them will probably be “like hands” in terms of AI’s ability to model them.
This is part of why AI’s should be existentially terrifying to anyone who is spiritually adept.
AI that sees the full set of causal paths to modifying human minds will be “like psychedelic drugs with coherent persistent agendas”. Humans have basically zero cognitive security systems. Almost all security systems are culturally mediated, and then (absent complex interventions) lots of the brain stuff freezes in place around the age of puberty, and then other stuff freezes around 25, and so on. This is why we protect children from even TALKING to untrusted adults: they are too plastic and not savvy enough. (A good heuristic for the lowest level of “infohazard” is “anything you wouldn’t talk about in front of a six year old”.)
Humans are sorta like a bunch of unpatchable computers, exposing “ports” to the “internet”, where each of our port numbers is simply a lightly salted semantic hash of an address into some random memory location that stores everything, including our operating system.
Your word for “drugs” and my word for “drugs” don’t point to the same memory addresses in the computer’s implementing our souls. Also our souls themselves don’t even have the same nearby set of “documents” (because we just have different memories n’stuff)… but the word “drugs” is not just one of the ports… it is a port that deserves a LOT of security hardening.
The bible said ~”thou shalt not suffer a ‘pharmakeia’ to live” for REASONS.
Wondering why this has so many disagreement votes. Perhaps people don’t like to see the serious topic of “how much time do we have left”, alongside evidence that there’s a population of AI entrepreneurs who are so far removed from consensus reality, that they now think they’re living in a simulation.
(edit: The disagreement for @JenniferRM’s comment was at something like −7. Two days later, it’s at −2)
It could just be because it reaches a strong conclusion on anecdotal/clustered evidence (e.g. it might say more about her friend group than anything else). Along with claims to being better calibrated for weak reasons—which could be true, but seems not very epistemically humble.
Full disclosure I downvoted karma, because I don’t think it should be top reply, but I did not agree or disagree.
But Jen seems cool, I like weird takes, and downvotes are not a big deal—just a part of a healthy contentious discussion.
For most of my comments, I’d almost be offended if I didn’t say something surprising enough to get a “high interestingness, low agreement” voting response. Excluding speech acts, why even say things if your interlocutor or full audience can predict what you’ll say?
And I usually don’t offer full clean proofs in direct word. Anyone still pondering the text at the end, properly, shouldn’t “vote to agree”, right? So from my perspective… its fine and sorta even working as intended <3
However, also, this is currently the top-voted response to me, and if William_S himself reads it I hope he answers here, if not with text then (hopefully? even better?) with a link to a response elsewhere?
((EDIT: Re-reading everything above his, point, I notice that I totally left out the “basic take” that might go roughly like “Kurzweil, Altman, and Zuckerberg are right about compute hardware (not software or philosophy) being central, and there’s a compute bottleneck rather than a compute overhang, so the speed of history will KEEP being about datacenter budgets and chip designs, and those happen on 6-to-18-month OODA loops that could actually fluctuate based on economic decisions, and therefore its maybe 2026, or 2028, or 2030, or even 2032 before things pop, depending on how and when billionaires and governments decide to spend money”.))
Pulling honest posteriors from people who’ve “seen things we wouldn’t believe” gives excellent material for trying to perform aumancy… work backwards from their posteriors to possible observations, and then forwards again, toward what might actually be true :-)
I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.
I think having a probability distribution over timelines is the correct approach. Like, in the comment above:
Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don’t see it happening given the evidence. OpenAI wouldn’t need to talk about raising trillions of dollars, companies wouldn’t be trying to commoditize their products, and the employees who quit OpenAI would speak up.
Political infighting is in general just more likely than very short timelines, which would go in counter of most prediction markets on the matter. Not to mention, given it’s already happened with the firing of Sam Altman, it’s far more likely to have happened again.
If there was a probability distribution of timelines, the current events indicate sub 3 year ones have negligible odds. If I am wrong about this, I implore the OpenAI employees to speak up. I don’t think normies misunderstand probability distributions, they just usually tend not to care about unlikely events.
No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.
(In reality, every member of its leadership has their own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).
The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it’s not clear how big is the potential of “harness”/”scaffolding”, and so on).
This does not mean there is no political infighting. But it’s on the background of them being correctly uncertain about true timelines...
Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).
So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).
Why at most one of them can be meaningfully right?
Would not a simulation typically be “a multi-player game”?
(But yes, if they assume that their “original self” was the sole creator (?), then they would all be some kind of “clones” of that particular “original self”. Which would surely increase the overall weirdness.)
These are valid concerns! I presume that if “in the real timeline” there was a consortium of AGI CEOs who agreed to share costs on one run, and fiddled with their self-inserts, then they… would have coordinated more? (Or maybe they’re trying to settle a bet on how the Singularity might counterfactually might have happened in the event of this or that person experiencing this or that coincidence? But in that case I don’t think the self inserts would be allowed to say they’re self inserts.)
Like why not re-roll the PRNG, to censor out the counterfactually simulable timelines that included me hearing from any of the REAL “self inserts of the consortium of AGI CEOS” (and so I only hear from “metaphysically spurious” CEOs)??
Or maybe the game engine itself would have contacted me somehow to ask me to “stop sticking causal quines in their simulation” and somehow I would have been induced by such contact to not publish this?
Mostly I presume AGAINST “coordinated AGI CEO stuff in the real timeline” along any of these lines because, as a type, they often “don’t play well with others”. Fucking oligarchs… maaaaaan.
It seems like a pretty normal thing, to me, for a person to naturally keep track of simulation concerns as a philosophic possibility (its kinda basic “high school theology” right?)… which might become one’s “one track reality narrative” as a sort of “stress induced psychotic break away from a properly metaphysically agnostic mental posture”?
That’s my current working psychological hypothesis, basically.
But to the degree that it happens more and more, I can’t entirely shake the feeling that my probability distribution over “the time T of a pivotal acts occurring” (distinct from when I anticipate I’ll learn that it happened which of course must be LATER than both T and later than now) shouldn’t just include times in the past, but should actually be a distribution over complex numbers or something...
...but I don’t even know how to do that math? At best I can sorta see how to fit it into exotic grammars where it “can have happened counterfactually” or so that it “will have counterfactually happened in a way that caused this factually possible recurrence” or whatever. Fucking “plausible SUBJECTIVE time travel”, fucking shit up. It is so annoying.
Like… maybe every damn crazy AGI CEO’s claims are all true except the ones that are mathematically false?
How the hell should I know? I haven’t seen any not-plausibly-deniable miracles yet. (And all of the miracle reports I’ve heard were things I was pretty sure the Amazing Randi could have duplicated.)
All of this is to say, Hume hasn’t fully betrayed me yet!
Mostly I’ll hold off on performing normal updates until I see for myself, and hold off on performing logical updates until (again!) I see a valid proof for myself <3
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?
No comment.
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a “no comment” or lack of response or something to that degree implies a “yes” with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA’s offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)
Kelsey Piper now reports: “I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it.”
(not a lawyer)
My layman’s understanding is that managerial employees are excluded from that ruling, unfortunately. Which I think applies to William_S if I read his comment correctly. (See Pg 11, in the “Excluded” section in the linked pdf in your link)
I am a lawyer.
I think one key point that is missing is this: regardless of whether the NDA and the subsequent gag order is legitimate or not; William would still have to spend thousands of dollars on a court case to rescue his rights. This sort of strong-arm litigation has become very common in the modern era. It’s also just… very stressful. If you’ve just resigned from a company you probably used to love, you likely don’t want to fish all of your old friends, bosses and colleagues into a court case.
Edit: also, if William left for reasons involving AGI safety—maybe entering into (what would likely be a very public) court case would be counteractive to their reason for leaving? You probably don’t want to alarm the public by flavouring existential threats in legal jargon. American judges have the annoying tendency to valorise themselves as celebrities when confronting AI (see Musk v Open AI).
Are you familiar with USA NDA’s? I’m sure there are lots of clauses that have been ruled invalid by case law? In many cases, non-lawyers have no ideas about these, so you might be able to make a difference with very little effort. There is also the possibility that valuable OpenAI shares could be rescued?
If you haven’t seen it, check out this thread where one of the OpenAI leavers did not sigh the gag order.
I have reviewed his post. Two (2) things to note:
(1) Invalidity of the NDA does not guarantee William will be compensated after the trial. Even if he is, his job prospects may be hurt long-term.
(2) State’s have different laws on whether the NLRA trumps internal company memorandums. More importantly, labour disputes are traditionally solved through internal bargaining. Presumably, the collective bargaining ‘hand-off’ involving NDA’s and gag-orders at this level will waive subsequent litigation in district courts. The precedent Habryka offered refers to hostile severance agreements only, not the waiving of the dispute mechanism itself.
I honestly wish I could use this dialogue as a discrete communication to William on a way out, assuming he needs help, but I re-affirm my previous worries on the costs.
I also add here, rather cautiously, that there are solutions. However, it would depend on whether William was an independent contractor, how long he worked there, whether it actually involved a trade secret (as others have mentioned) and so on. The whole reason NDA’s tend to be so effective is because they obfuscate the material needed to even know or be aware of what remedies are available.
Interesting! For most of us, this is outside our area of competence, so appreciate your input.
I can see some arguments in your direction but would tentatively guess the opposite.
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
EDIT: Kelsey Piper has confirmed that there is an OA NDA with a gag order, and violation forfeits all equity—including fully vested equity. This implies that since you would assume Ilya Sutskever would have received many PPUs & would be holding them as much as possible, Sutskever might have had literally billions of dollars at stake based on how he quit and what he then, say, tweeted… (PPUs which can only be sold in the annual OA-controlled tender offer.)
By “gag order” do you mean just as a matter of private agreement, or something heavier-handed, with e.g. potential criminal consequences?
I have trouble understanding the absolute silence we seem to be having. There seem to be very few leaks, and all of them are very mild-mannered and are failing to build any consensus narrative that challenges OA’s press in the public sphere.
Are people not able to share info over Signal or otherwise tolerate some risk here? It doesn’t add up to me if the risk is just some chance of OA trying to then sue you to bankruptcy, especially since I think a lot of us would offer support in that case, and the media wouldn’t paint OA in a good light for it.
I am confused. (And I grateful to William for at least saying this much, given the climate!)
I would guess that there isn’t a clear smoking gun that people aren’t sharing because of NDAs, just a lot of more subtle problems that add up to leaving (and in some cases saying OpenAI isn’t being responsible etc).
This is consistent with the observation of the board firing Sam but not having a clear crossed line to point at for why they did it.
It’s usually easier to notice when the incentives are pointing somewhere bad than to explain what’s wrong with them, and it’s easier to notice when someone is being a bad actor than it is to articulate what they did wrong. (Both of these run a higher risk of false positives relative to more crisply articulatable problems.)
The lack of leaks could just mean that there’s nothing interesting to leak. Maybe William and others left OpenAI over run-of-the-mill office politics and there’s nothing exceptional going on related to AI.
Rest assured, there is plenty that could leak at OA… (And might were there not NDAs, which of course is much of the point of having them.)
For a past example, note that no one knew that Sam Altman had been fired from YC CEO for similar reasons as OA CEO, until the extreme aggravating factor of the OA coup, 5 years later. That was certainly more than ‘run of the mill office politics’, I’m sure you’ll agree, but if that could be kept secret, surely lesser things now could be kept secret well past 2029?
At least one of them has explicitly indicated they left because of AI safety concerns, and this thread seems to be insinuating some concern—Ilya Sutskever’s conspicuous silence has become a meme, and Altman recently expressed that he is uncertain of Ilya’s employment status. There still hasn’t been any explanation for the boardroom drama last year.
If it was indeed run-of-the-mill office politics and all was well, then something to the effect of “our departures were unrelated, don’t be so anxious about the world ending, we didn’t see anything alarming at OpenAI” would obviously help a lot of people and also be a huge vote of confidence for OpenAI.
It seems more likely that there is some (vague?) concern but it’s been overridden by tremendous legal/financial/peer motivations.
What’s PPU?
From here:
Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?
When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn’t.
Does anyone know if it’s typically the case that people under gag orders about their NDAs can talk to other people who they know signed the same NDAs? That is, if a bunch of people quit a company and all have signed self-silencing NDAs, are they normally allowed to talk to each other about why they quit and commiserate about the costs of their silence?
They would not know if others have signed the SAME NDAs without trading information about their own NDAs, which is forbidden.
From my perspective, the only thing that keeps the OpenAI situation from being all kinds of terrible is that I continue to think they’re not close to human-level AGI, so it probably doesn’t matter all that much.
This is also my take on AI doom in general; my P(doom|AGI soon) is quite high (>50% for sure), but my P(AGI soon) is low. In fact it decreased in the last 12 months.
Apparently[1] there was recently some discussion of Survival Instinct in Offline Reinforcement Learning (NeurIPS 2023). The results are very interesting:
But I heard that some people found these results “too good to be true”, with some dismissing it instantly as wrong or mis-stated. I find this ironic, given that the paper was recently published in a top-tier AI conference. Yes, papers can sometimes be bad, but… seriously? You know the thing where lotsa folks used to refuse to engage with AI risk cuz it sounded too weird, without even hearing the arguments? … Yeaaah, absurdity bias.
Anyways, the paper itself is quite interesting. I haven’t gone through all of it yet, but I think I can give a good summary. The github.io is a nice (but nonspecific) summary.
Summary
It’s super important to remember that we aren’t talking about PPO. Boy howdy, we are in a different part of town when it comes to these “offline” RL algorithms (which train on a fixed dataset, rather than generating more of their own data “online”). ATAC, PSPI, what the heck are those algorithms? The important-seeming bits:
Reward(falling over)+γ⋅Pessimism pen.<11−γPen. for being uprightMany offline RL algorithms pessimistically initialize the value of unknown states
“Unknown” means: “Not visited in the offline state-action distribution”
Pessimistic means those are assigned a super huge negative value (this is a bit simplified)
Because future rewards are discounted, reaching an unknown state-action pair is bad if it happens soon and less bad if it happens farther in the future
So on an all-zero reward function, a model-based RL policy will learn to stay within the state-action pairs it was demonstrated for as long as possible (“length bias”)
In the case of the gridworld, this means staying on the longest demonstrated path, even if the red lava is rewarded and the yellow key is penalized.
In the case of Hopper, I’m not sure how they represented the states, but if they used non-tabular policies, this probably looks like “repeat the longest portion of demonstrated policies without falling over” (because that leads to the pessimistic penalty, and most of the data looked like walking successfully due to length bias, so that kind of data is least likely to be penalized).
On a negated reward function (which e.g. penalizes the Hopper for staying upright and rewards for falling over), if falling over still leads to a terminal/unknown state-action, that leads to a huge negative penalty. So it’s optimal to keep hopping whenever
For example, if the original per-timestep reward for staying upright was 1, and the original penalty for falling over was −1, then now the policy gets penalized for staying upright and rewarded for falling over! At γ=.9, it’s therefore optimal to stay upright whenever
1+.9⋅Pessimism<10⋅−1which holds whenever the pessimistic penalty is at least 12.3. That’s not too high, is it? (When I was in my graduate RL class, we’d initialize the penalties to −1000!)
Significance
DPO, for example, is an offline RL algorithm. It’s plausible that frontier models will be trained using that algorithm. So, these results are more relevant if future DPO variants use pessimism and if the training data (e.g. example user/AI interactions) last for more turns when they’re actually helpful for the user.
While it may be tempting to dismiss these results as irrelevant because “length won’t perfectly correlate with goodness so there won’t be positive bias”, I think that would be a mistake. When analyzing the performance and alignment properties of an algorithm, I think it’s important to have a clear picture of all relevant pieces of the algorithm. The influence of length bias and the support of the offline dataset are additional available levers for aligning offline RL-trained policies.
To close on a familiar note, this is yet another example of how “reward” is not the only important quantity to track in an RL algorithm. I also think it’s a mistake to dismiss results like this instantly; this offers an opportunity to reflect on what views and intuitions led to the incorrect judgment.
I can’t actually check because I only check that stuff on Mondays.
Don’t you mean future values? Also, AFAICT, the only thing going on here that seperates online from offline RL is that offline RL algorithms shape the initial value function to give conservative behaviour. And so you get conservative behaviour.
One lesson you could take away from this is “pay attention to the data, not the process”—this happened because the data had longer successes than failures. If successes were more numerous than failures, many algorithms would have imitated those as well with null reward.
The paper sounds fine quality-wise to me, I just find it implausible that it’s relevant for important alignment work, since the proposed mechanism is mainly an aversion to building new capabilities.
@Jozdien sent me this paper, and I dismissed it with a cursory glance, thinking if they had to present their results using the “safe” shortening in the context they used that shortening, their results can’t be too much to sneeze at. Reading your summary, the results are minorly more impressive than I was imagining, but still in the same ballpark I think. I don’t think there’s much applicability to the safety of systems though? If I’m reading you right, you don’t get guarantees for situations for which the model is very out-of-distribution, but still behaving competently, since it hasn’t seen tabulation sequences there.
Where the results are applicable, they definitely seem like they give mixed, probably mostly negative signals. If (say) I have a stop button, and I reward my agent for shutting itself down if I press that stop button, don’t these results say that the agent won’t shut down, for the same reasons the hopper won’t fall over, even if my reward function has rewarded it for falling over in such situation? More generally, this seems to tell us in such situations we have marginally fewer degrees of freedom with which we can modify a model’s goals than we may have thought, since the stay-on-distribution aspect dominates over the reward aspect. On the other hand, “staying on distribution” is in a sense a property we do want! Is this sort of “staying on distribution” the same kind of “staying on distribution” as that used in quantilization? I don’t think so.
More generally, it seems like whether more or less sensitivity to reward, architecture, and data on the part of the functions neural networks learn is better or worse for alignment is an open problem.
Several dozen people now presumably have Lumina in their mouths. Can we not simply crowdsource some assays of their saliva? I would chip money in to this. Key questions around ethanol levels, aldehyde levels, antibacterial levels, and whether the organism itself stays colonized at useful levels.
Surely so! Hit me up if you ever end doing this—I’m likely getting the Lumina treatment in a couple months.
A before and after would be even better!
Any recommendations on how I should do that? You may assume that I know what a gas chromatograph is and what a Petri dish is and why you might want to use either or both of those for data collection, but not that I have any idea of how to most cost-effectively access either one as some rando who doesn’t even have a MA in Chemistry.
Lumina is incredibly cheap right now. I pre-ordered for 250usd. Even genuinely quite poor people I know don’t find the price off-putting (poor in the sense of absolutely poor for the country they live in). I have never met a single person who decided not to try Lumina because the price was high. If they pass its always because they think its risky.
I think Romoeo is thinking of checking a bunch of mediators of risk (like aldehyde levels) as well as of function (like whether the organism stays colonised)
Maybe I’m late to the conversation but has anyone thought through what happens when Lumina colonizes the mouths of other people? Mouth bacteria is important for things like conversation of nitrate to nitrite for nitric oxide production. How do we know the lactic acid metabolism isn’t important or Lumina won’t outcompete other strains important for overall health?
We inhabit this real material world, the one which we perceive all around us (and which somehow gives rise to perceptive and self-conscious beings like us).
Though not all of our perceptions conform to a real material world. We may be fooled by things like illusions or hallucinations or dreams that mimic perceptions of this world but are actually all in our minds.
Indeed if you examine your perceptions closely, you’ll see that none of them actually give you representations of the material world, but merely reactions to it.
In fact, since the only evidence we have is of perceptions, the “material world” is more of a metaphysical hypothesis we use to explain patterns in our perceptions, not something we can vouch for as actually existing.
Since this hypothesis is untestable, it is best put aside when we consider what actually exists. The “material world” is not a thing, but a framework and vocabulary useful for discussing regularities in what is really real.
What is really “real”—what the word “real” means—is our moment to moment perceptions and interpretations, which appear to us in the form of a material world that we inhabit.
GOTO 1
How to best break out of this loop?
In (3), the word “merely” is doing a lot of unexamined work. My perceptions have a coherence to them, an obstinate coherence that I cannot wish away. I reach out for the coffee cup that I see, and it shows up to my sense of touch. What does it mean to call this a “mere” response, when it maintains a persistent similarity of structure to my idea of what is out there—that is, it is a representation of it.
In (4), if the hypothesis explains the perceptions, the perceptions are evidence for the hypothesis. These are two different ways of saying the same thing.
A hypothesis that explains the perceptions can be a just-so story. For any set of perceptions ζ, there may be a vast number of hypotheses that explain those perceptions. How do you choose among them?
In other words, if f() and g() both explain ζ equally well, but are incompatible in all sorts of other ways for which you do not have perceptions to distinguish them, ζ may be “evidence for the hypothesis” f and ζ may be “evidence for the hypothesis” g, but ζ offers no help in determining whether f or g is truer. Consider e.g. f is idealism, g is realism, or some other incompatible metaphysical positions that start with our perceptions and speculate from there.
An author I read recently compared this obstinate coherence of our perceptions to a GUI. When I move my mouse pointer to a file, click, and drag that file into another folder, I’m doing something that has predictable results, and that is similar to other actions I’ve performed in the past, and that plays nicely with my intuitions about objects and motion and so forth. But it would be a mistake for me to then extrapolate from this and assume that somewhere on my hard drive or in my computer memory is a “file” which I have “dragged” “into” a “folder”. My perceptions via the interface may have consistency and practical utility, but they are not themselves a reliable guide to the actual state of the world.
Obstinate coherence and persistent similarity of structure are intriguing but they are limited in how much they can explain by themselves.
Dragging files around in a GUI is a familiar action that does known things with known consequences. Somewhere on the hard disc (or SSD, or somewhere in the cloud, etc.) there is indeed a “file” which has indeed been “moved” into a “folder”, and taking off those quotation marks only requires some background knowledge (which in fact I have) of the lower-level things that are going on and which the GUI presents to me through this visual metaphor.
Some explanations work better than others. The idea that there is stuff out there that gives rise to my perceptions, and which I can act on with predictable results, seems to me the obvious explanation that any other contender will have to do a great deal of work to topple from the plinth. The various philosophical arguments over doctrines such as “idealism”, “realism”, and so on are more like a musical recreation (see my other comment) than anything to take seriously as a search for truth. They are hardly the sort of thing that can be right or wrong, and to the extent that they are, they are all wrong.
Ok, that’s my personal view of a lot of philosophy, but I’m not the only one.
It sounds like you want to say things like “coherence and persistent similarity of structure in perceptions demonstrates that perceptions are representations of things external to the perceptions themselves” or “the idea that there is stuff out there seems the obvious explanation” or “explanations that work better than others are the best alternatives in the search for truth” and yet you also want to say “pish, philosophy is rubbish; I don’t need to defend an opinion about realism or idealism or any of that nonsense”. In fact what you’re doing isn’t some alternative to philosophy, but a variety of it.
Some philosophy is rubbish. Quite a lot, I believe. And with a statement such as “perceptions are caused by things external to the perceptions themselves”, which I find unremarkable in itself as a prima facie obvious hypothesis to run with, there is a tendency for philosophers to go off the rails immediately by inventing precise definitions of words such as “perceptions”, “are”, and “caused”, and elaborating all manner of quibbles and paradoxes. Hence the whole tedious catalogue of realisms.
Science did not get anywhere by speculating on whether there are four or five elements and arguing about their natures.
There’s a soft patch around 5 and 6. Why is testability important? It’s a charactersitic of science, but science assumes an external world. It’s not a characteristic of philosophy—good explanation is enough in philosophy, and the general posit of some sort of external world does explanatory work. And it’s separate from the specific posit that the external world is knowable in some particular way.
It’s a characteristic of philosophy, too, at least according to the positivists. If you’re humoring a metaphysical theory that could not even in theory be confirmed or falsified by some possible observation, they suggest that you’re really engaging in mythmaking or poetry or something, not philosophy.
A lot of philosophy is like that. Or perhaps it is better compared to music. Music sounds meaningful, but no-one has explained what it means. Even so, much philosophy sounds meaningful, consisting of grammatical sentences with a sense of coherence, but actually meaning nothing. This is why there is no progress in philosophy, any more than there is in music. New forms can be invented and other forms can go out of fashion, but the only development is the ever-greater sprawl of the forest.
Put your phone in the oven and stand in the grass and eat some grass and see how it tastes
What loop? They are all various viewpoints on the nature of reality, not steps you have to go through in some order or anything. (1) is a more useful viewpoint than the rest, and you can adopt that one for 99%+ of everything you think about and only care about the rest as basically ideas to toy with rather than live by.
I don’t know about you (assuming you even exist in any sense other than my perception of words on a screen), but to me a model that an external reality exists beyond what I can perceive is amazingly useful for essentially everything. Even if it might not be actually true, it explains my perceptions to a degree that seems incredible if it were not even partly true. Even most of the apparent exceptions in (2) are well explained by it once your physical model includes much of how perception works.
So while (4) holds, it’s to such a powerful degree that (2) to (6) are essentially identical to (1).
On an apparent missing mood—FOMO on all the vast amounts of automated AI safety R&D that could (almost already) be produced safely
Automated AI safety R&D could results in vast amounts of work produced quickly. E.g. from Some thoughts on automating alignment research (under certain assumptions detailed in the post):
Despite this promise, we seem not to have much knowledge when such automated AI safety R&D might happen, in which order the relevant capabilities might appear (including compared to various dangerous capabilities), etc. AFAICT. we don’t seem to be trying very hard either, neither at prediction, nor at elicitation of such capabilities.
One potential explanation I heard for this apparent missing mood of FOMO on AI automated AI safety R&D is the idea / worry that the relevant automated AI safety R&D capabilities would only appear when models are already / too close to being dangerous. This seems plausible to me, but would still seems like a strange way to proceed given the uncertainty.
For contrast, consider how one might go about evaluating dangerous capabilities and trying to anticipate them ahead of time. From GDM’s recent Introducing the Frontier Safety Framework:
I see no reason why, in principle, a similar high-level approach couldn’t be analogously taken for automated AI safety R&D capabilities, yet almost nobody seems to be working on this, at least publicly (in fact, I am currently working on related topics; I’m still very surprised by the overall neglectedness, made even more salient by current events).
My main vibe is:
AI R&D and AI safety R&D will almost surely come at the same time.
Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).
It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.
Seems like probably the modal scenario to me too, but even limited exceptions like the one you mention seem to me like they could be very important to deploy at scale ASAP, especially if they could be deployed using non-x-risky systems (e.g. like current ones, very bad at DC evals).
This seems good w.r.t. automated AI safety potentially ‘piggybacking’, but bad for differential progress.
Sure, though wouldn’t this suggest at least focusing hard on (measuring / eliciting) what might not come at the same time?
Why think this is important to measure or that this already isn’t happening?
E.g., on the current model organism related project I’m working on, I automate inspecting reasoning traces in various ways. But I don’t feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn’t more important than other tips for doing LLM research better).
Intuitively, I’m thinking of all this as something like a race between [capabilities enabling] safety and [capabilities enabling dangerous] capabilities (related: https://aligned.substack.com/i/139945470/targeting-ooms-superhuman-models); so from this perspective, maintaining as large a safety buffer as possible (especially if not x-risky) seems great. There could also be something like a natural endpoint to this ‘race’, corresponding to being able to automate all human-level AI safety R&D safely (and then using this to produce a scalable solution to aligning / controlling superintelligence).
W.r.t. measurement, I think it would be good orthogonally to whether auto AI safety R&D is already happening or not, similarly to how e.g. evals for automated ML R&D seem good even if automated ML R&D is already happening. In particular, the information of how successful auto AI safety R&D would be (and e.g. what the scaling curves look like vs. those for DCs) seems very strategically relevant to whether it might be feasible to deploy it at scale, when that might happen, with what risk tradeoffs, etc.
To get somewhat more concrete, the Frontier Safety Framework report already proposes a (somewhat vague) operationalization for ML R&D evals, which (vagueness-permitting) seems straightforward to translate into an operationalization for automated AI safety R&D evals:
Note to self, write a post about the novel akrasia solutions I thought up before becoming a rationalist.
Figuring out how to want to want to do things
Personalised advertising of Things I Wanted to Want to Do
What I do when all else fails
Have you tried whiteboarding-related techniques?
I think that suddenly starting to using written media (even journals), in an environment without much or any guidance, is like pressing too hard on the gas; you’re gaining incredible power and going from zero to one on things faster than you ever have before.
Depending on their environment and what they’re interested in starting out, some people might learn (or be shown) how to steer quickly, whereas others might accumulate/scaffold really lopsided optimization power and crash and burn (e.g. getting involved in tons of stuff at once that upon reflection was way too much for someone just starting out).
This seems incredibly interesting to me. Googling “White-boarding techniques” only gives me results about digitally shared idea spaces. Is this what you’re referring to? I’d love to hear more on this topic.
Maybe I could even write a sequence on this?
Unfortunately, it looks like non-disparagement clauses aren’t unheard of in general releases:
http://www.shpclaw.com/Schwartz-Resources/severance-and-release-agreements-six-6-common-traps-and-a-rhetorical-question
https://joshmcguirelaw.com/civil-litigation/adventures-in-lazy-lawyering-the-broad-general-release
Given the way the contract is worded it might be worth checking whether executing your own “general release” (without a non-disparagement agreement in it) would be sufficient, but I’m not a lawyer and maybe you need the counterparty to agree to it for it to count.
And as a matter of industry practice, this is of course an extremely non-standard requirement for retaining vested equity (or equity-like instruments), whereas it’s pretty common when receiving an additional severance package. (Though even in those cases I haven’t heard of any such non-disparagement agreement that was itself covered by a non-disclosure agreement… but would I have?)
AI labs are starting to build AIs with capabilities that are hard for humans to oversee, such as answering questions based on large contexts (1M+ tokens), but they are still not deploying “scalable oversight” techniques such as IDA and Debate. (Gemini 1.5 report says RLHF was used.) Is this more good news or bad news?
Good: Perhaps RLHF is still working well enough, meaning that the resulting AI is following human preferences even out of training distribution. In other words, they probably did RLHF on large contexts in narrow distributions, with human rater who have prior knowledge/familiarity of the whole context, since it would be too expensive to do RLHF with humans reading diverse 1M+ contexts from scratch, but the resulting chatbot is working well even outside the training distribution. (Is it actually working well? Can someone with access to Gemini 1.5 Pro please test this?)
Bad: AI developers haven’t taken alignment seriously enough to have invested enough in scalable oversight, and/or those techniques are unworkable or too costly, causing them to be unavailable.
From a previous comment:
This seems to be evidence that RLHF does not tend to generalize well out-of-distribution, causing me to update the above “good news” interpretation downward somewhat. I’m still very uncertain though. What do others think?
Turns out at least one scalable alignment team has been struggling for resources. From Jan Leike (formerly co-head of Superalignment at OpenAI):
Even worse, apparently the whole Superalignment team has been disbanded.
Apparently Gemini 1.5 Pro isn’t working great with large contexts:
But is this due to limitations of RLHF training, or something else?
I have access to Gemini 1.5 Pro. Willing to run experiments if you provide me with an exact experiment to run, plus cover what they charge me (I’m assuming it’s paid, I haven’t used it yet).
RLHF with humans might also soon get obsoleted by things like online DPO where another chatbot produces preference data for on-policy responses of the tuned model, and there is no separate reward model in the RL sense. If generalization from labeling instructions through preference decisions works in practice, even weak-to-strong setting won’t necessarily be important, if tuning of a stronger model gets bootstrapped by a weaker model (where currently SFT from an obviously off-policy instruct dataset seems to suffice), but then the stronger model re-does the tuning of its equally strong successor that starts with the same base model (as in the self-rewarding paper), using some labeling instructions (“constitution”). So all that remains of human oversight that actually contributes to the outcome is labeling instructions written in English, and possibly some feedback on them from spot checking what’s going on as a result of choosing particular instructions.
My guess is that we’re currently effectively depending on generalization. So “Good” from your decomposition. (Though I think depending on generalization will produce big issues if the model is scheming, so I would prefer avoiding this.)
It’s plausible to me that after doing a bunch of RLHF on short contexts, RLHF on long contexts is extremely sample efficient (when well tuned) such that only (e.g.) 1,000s of samples sufficies[1]. If you have a $2,000,000 budget for long context RLHF and need only 1,000 samples, you can spend $2,000 per sample. This gets you perhaps (e.g.) 10 hours of time of an experienced software engineer which might suffice for good long context supervision without necessarily needing any fancy scalable oversight approaches. (That said, probably people will use another LLM by default when trying to determine the reward if their spending this long: recursive reward modeling seems almost certain by default if we’re assuming that people spend this much time labeling.)
That said, I doubt that anyone has actually started doing extremely high effort data labeling like this, though plausibly they should...
It’s some evidence, but exploiting a reward model seems somewhat orthogonal to generalization out of distribution: exploitation is heavily selected for.
(Separately, I expect that the quoted comment results in a misleadingly negative perception of the current situation.)
I think experiments on sample efficiency of RLHF when generalizing to a new domain could be very important and are surprisingly underdone from my perspective (at least I’m not aware of interesting results). Even more important is sample efficiency in cases where you have a massive number of weak labels, but a limited number of high quality labels. It seems plausible to me that the final RLHF approach used will look like training the reward model on a combination of 100,000s of weak labels and just 1,000 very high quality labels. (E.g. train a head on the weak labels and then train another head to predict the difference between the weak label and the strong label.) In this case, we could spend a huge amount of time on each label. E.g., with 100 skilled employees we could spend 5 days on each label and still be done in 50 days which isn’t too bad of a delay. (If we’re fine with this labels trickling in for online training, the delay could be even smaller.)
Thanks for some interesting points. Can you expand on “Separately, I expect that the quoted comment results in a misleadingly perception of the current situation.”? Also, your footnote seems incomplete? (It ends with “we could spend” on my browser.)
I’m skeptical that increased scale makes hacking the reward model worse. Of course, it could (and likely will/does) make hacking human labelers more of a problem, but this isn’t what the comment appears to be saying.
Note that the reward model is of the same scale as the base model, so the relative scale should be the same.
This also contradicts results from an earlier paper by Leo Gao. I think this paper is considerably more reliable than the comment overall, so I’m inclined to believe the paper or think that I’m misunderstanding the comment.
Additionally, from first principles I think that RLHF sample efficiency should just increase with scale (at least with well tuned hyperparameters) and I think I’ve heard various things that confirm this.
Oops, fixed.