Dumb question: Why doesn’t using constitutional AI, where the constitution is mostly or entirely corrigibility produce a corrigible AI (at arbitrary capability levels)?
My dumb proposal:
1. Train a model in something like o1′s RL training loop, with a scratch pad for chain of thought, and reinforcement of correct answers to hard technical questions across domains.
2. Also, take those outputs, prompt the model to generate versions of those outputs that “are more corrigible / loyal / aligned to the will of your human creators”. Do backprop to reinforce those more corrigible outputs.
Possibly “corrigibility” applies only very weakly to static solutions, and so for this setup to make sense, we’d instead need to train on plans, or time-series of an AI agent’s actions: The AI agent takes a bunch of actions over the course of a day or a week, then we have an AI annotate the time series of action-steps with alternative action-steps that better reflect “corrigibility”, according to its understanding. Then we do backprop to so that the Agent behaves more in ways that are closer to the annotated action transcript.
Would this work to produce a corrigible agent? If not, why not?
There’s a further question of “how much less capable will the more corrigible AI be?” This might be a significant penalty to performance, and so the added safety gets eroded away in the competitive crush. But first and foremost, I want to know if something like this could work.
Eli Tyre
What are the two groups in question here?
AI x-risk is high, which makes cryonics less attractive (because cryonics doesn’t protect you from AI takeover-mediated human extinction). But on the flip side, timelines are short, which makes cryonics more attractive (because one of the major risks of cryonics is society persisting stably enough to keep you preserved until revival is possible, and near term AGI means that that period of time is short).
Cryonics is more likely to work, given a positive AI trajectory, and less likely to work given a negative AI trajectory.
I agree that it seems less likely to work, overall, than it seemed to me a few years ago.
yeahh i’m afraid I have too many other obligations right now to give a elaboration that does it justice.
Fair enough!
otoh i’m in the Bay and we should definitely catch up sometime!
Sounds good.
Frankly, it feels more rooted in savannah-brained tribalism & human interest than a evenkeeled analysis of what factors are actually important, neglected and tractable.
Um, I’m not attempting to do cause prioritization or action-planning in the above comment. More like sense-making. Before I move on to the question of what should we do, I want to have an accurate model of the social dynamics in the space.
(That said, it doesn’t seem a foregone conclusion that there are actionable things to do, that will come out of this analysis. If the above story is true, I should make some kind of update about the strategies that EAs adopted with regards to OpenAI in the late 2010s. Insofar as they were mistakes, I don’t want to repeat them.)It might turn out to be right that the above story is “naive /misleading and ultimately maybe unhelpful”. I’m sure not an expert at understanding these dynamics. But just saying that it’s naive or that it seems rooted in tribalism doesn’t help me or others get a better model.
If it’s misleading, how is it misleading? (And is misleading different than “false”? Are you like “yeah this is technically correct, but it neglects key details”?)Admittedly, you did label it as a tl;dr, and I did prompt you to elaborate on a react. So maybe it’s unfair of me to request even further elaboration.
@Alexander Gietelink Oldenziel, you put a soldier mindset react on this (and also my earlier, similar, comment this week).
What makes you think so?
Definitely this model posits that adversariality, but I don’t think that I’m invested in “my side” of the argument winning here, FWTIW. This currently seems like the most plausible high level summary of the situation, given my level of context.
Is there a version of this comment that would regard as better?
I don’t dispute that he never had any genuine concern. I guess that he probably did have genuine concern (though not necessarily that that was his main motivation for founding OpenAI).
In a private slack someone extended credit to Sam Altman for putting EAs on the on the OpenAI board originally, especially that this turned out to be pretty risky / costly for him.
I responded:
It seems to me that there were AI safety people on the board at all is fully explainable by strategic moves from an earlier phase of the game.
Namely, OpenAI traded a boardseat for OpenPhil grant money, and more importantly, OpenPhil endorsement, which translated into talent sourcing and effectively defused what might have been vocal denouncement from one of the major intellectually influential hubs of the world.
No one knows how counterfactual history might have developed, but it doesn’t seem unreasonable to think that there is an external world in which the EA culture successfully created a narrative that groups trying to build AGI were bad and defecting.
He’s the master at this game and not me, but I would bet at even odds that Sam was actively tracking EA as a potential social threat that could dampen OpenAI’s narrative flywheel.
I don’t know that OpenPhil’s grant alone was sufficient to switch from the “EAs vocally decry OpenAI as making the world worse” equilibrium to a “largely (but not universally) thinking that OpenAI is bad in private, but mostly staying silent in public + going to work at OpenAI” equilibrium. But I think it was a major component. OpenPhil’s cooperation bought moral legitimacy for OpenAI amongst EAs.
In retrospect, it looks like OpenAI successfully bought out the EAs through OpenPhil, to a lesser extent through people like Paul.
And Ilya in particular was a founder and one of the core technical leads. It makes sense for him to be a board member, and my understanding (someone correct me) is that he grew to think that safety was more important over time, rather than starting out as an “AI safety person”.
And even so, the rumor is that the thing that triggered the Coup is that Sam maneuvered to get Helen removed. I highly doubt that Sam planned for a situation where he was removed as CEO, and then did some crazy jujitsu move with the whole company where actually he ends up firing the board instead. But if you just zoom out and look at what actually played out, he clearly came out ahead, with control consolidated. Which is the outcome that he was maybe steering towards all along?
So my first pass summary of the situation is that when OpenAI was small and of only medium fame and social power, Sam maneuvered to get the cooperation of EAs, because that defused a major narrative threat, and bought the company moral legitimacy (when that legitimacy was more uncertain). Then after ChatGPT and GPT-4, when OpenAI is rich and famous and has more narrative power than the EAs, Sam moves to remove the people that he made those prestige-trades with in the earlier phase, since he no longer needs their support, and doesn’t have any reason for them to have power over the now-force-to-be-reckoned-with company.
Granted, I’m far from all of this and don’t have confidence about any of these political games. But it seems wrong to me to give Sam points for putting “AI safety people” on the board.
But it is our mistake that we didn’t stand firmly against drugs, didn’t pay more attention to the dangers of self-experimenting, and didn’t kick out Ziz sooner.
These don’t seem like very relevant or very actionable takeways.
we didn’t stand firmly against drugs—Maybe this would have been a good move generally, but it wouldn’t have helped with this situation at all. Ziz reports that they don’t take psychedelics, and I believe that extends to her compatriots, as well.
didn’t pay more attention to the dangers of self-experimenting—What does this mean concretely? I think plenty of people did “pay attention” to the dangers of self experimenting. But “paying attention” doesn’t automatically address those dangers.
What specific actions would you recommend by which people? Eliezer telling people not to self experiment? CFAR telling people not to self experiment? A blanket ban on “self experimentation” is clearly too broad (“just don’t ever try anything that seems like maybe a good idea to you on first principles”). Some more specific guidelines might have helped, but we need to actually delineate the specific principles.didn’t kick out Ziz sooner—When specifically is the point when Ziz should have been kicked out of the community? With the benefit of hindsight bias, we can look back and wish we had separated sooner, but that was not nearly as clear ex ante.
What should have been the trigger? When she started wearing black robes? When she started calling herself Ziz? When she started writing up her own homegrown theories of psychology? Weird clothes, weird names, and weird beliefs are part and parcel of the rationalist milieu.
As it is, she was banned from the alumni reunion at which she staged the failed protest (she bought tickets in advance, CFAR told her that she was uninvited, and returned her money). Before that, I think that several community leaders had grey-listed her as someone not to invite to events. Should something else have happened, in addition to that? Should she have been banned from public events or private group houses entirely? On what basis? On who’s authority?
[For some of my work for Palisade]
Does anyone know of even very simple examples of AIs exhibiting instrumentally convergent resource aquisition?
Something like “an AI system in a video game learns to seek out the power ups, because that helps it win.” (Even better would be a version in which, you can give the agent one of several distinct-video game goals, but regardless of the goal, it goes and gets the powerups first).
It needs to be an example where the instrumental resource is not strictly required for succeeding at the task, while still being extremely helpful.
Is this taken to be a counterpoint to my story above? I’m not sure exactly how it’s related.
My model is that Sam Altman regarded the EA world as a memetic threat, early on, and took actions to defuse that threat by paying lip service / taking openphil money / hiring prominent AI safety people for AI safety teams.
Like, possibly the EAs could have crea ed a widespread vibe that building AGI is a cartoon evil thing to do, sort of the way many people think of working for a tobacco company or an oil company.
Then, after ChatGPT, OpenAI was a much bigger fish than the EAs or the rationalists, and he began taking moves to extricate himself from them.
My read:
“Zizian ideology” is a cross between rationalist ideas (the historical importance of AI, a warped version timeless decision theory, that more is possible with regards to mental tech) and radical leftist/anarchist ideas (the state and broader society are basically evil oppressive systems, strategic violence is morally justified, veganism), plus some homegrown ideas (all the hemisphere stuff, the undead types, etc).
That mix of ideas is compelling primarily to people who are already deeply invested in both rationality ideas and leftist / social justice ideas, an demographic which is predominantly trans women.
Further, I guess there’s a lot of bigoted / oppressive societal dynamics that are more evident to trans people than they are to, say, me, because they have more direct experience with those dynamics. If you personally feel marginalized and oppressed by society, it’s an easier sell that society is broadly an oppressive system.
Plus very straightforward social network effects, where I think many trans rationalists tend to hang out with other trans rationalist (for normal “people like to hang out to people they relate to” reasons), and so this group initially formed from that social sub-network.
(I endorse personal call outs like this one.)
Why? Forecasting the future is hard, and I expect surprises that deviate from my model of how things will go. But o1 and o3, seem like pretty blatant evidence that reduced my uncertainty a lot. On pretty simple heuristics, it looks like earth now knows how to make a science and engineering superintelligence: by scaling reasoning modes in a self-play-ish regime.
I would take a bet with you about what we expect to see in the next 5 years. But more than that, what kind of epistemology do you think I should be doing that I’m not?
Have the others you listed produced insights on that level? What did you observe that leads you to call them geniuses, “by any reasonable standard”?
It might help if you spelled it as LSuser. (I think you can change that in the settings).
In that sense, for many such people, short timelines actually are totally vibes based.
I dispute this characterization. It’s normal and appropriate for people’s views to update in response to the arguments produced by others.
Sure, sometimes people most parrot other people’s views, without either developing them independently or even doing evaluatory checks to see if those views seem correct. But most of the time, I think people are doing those checks?
Speaking for myself, most of my views on timelines are downstream of ideas that I didn’t generate myself. But I did think about those ideas, and evaluate if they seemed true.
I find your commitment to the basics of rational epistemology inspiring.
Keep it up and let me know if you could use support.
I currently believe it’s el-es-user, as in LSuser. Is that right?
I think SPARC and its decedents are something like this.