Eli Tyre

Karma: 7,518

Eli Tyre Apr 19, 2025, 5:21 PM
2 points
0
in reply to: TsviBT’s comment on: TsviBT’s Shortform
@Valentine comes to mind as a person who was raised lifeist and is now still lifeist, but I think has more complicated feelings/views about the situation related to enlightenment and metaphysics that make death an illusion, or something.

Eli Tyre Apr 18, 2025, 6:09 AM
2 points
0
in reply to: habryka’s comment on: Eli’s shortform feed
Of course the default outcome of doing finetuning on any subset of data with easy-to-predict biases will be that you aren’t shifting the inductive biases of the model on the vast majority of the distribution. This isn’t because of an analogy with evolution, it’s a necessity of how we train big transformers. In this case, the AI will likely just learn how to speak the “corrigible language” the same way it learned to speak french, and this will make approximately zero difference to any of its internal cognition, unless you are doing transformations to its internal chain of thought that substantially change its performance on actual tasks that you are trying to optimize for.
This is a pretty helpful answer.

(Though you keep referencing the AI’s chain of thought. I wasn’t imagining training over the chain of thought. I was imagining training over the AI’s outputs, whatever those are in the relevant domain.)

Eli Tyre Apr 18, 2025, 6:06 AM
2 points
0
in reply to: habryka’s comment on: Eli’s shortform feed
Would you expect that if you trained an AI system on translating its internal chain of thought into a different language, that this would make it substantially harder for it to perform tasks in the language in which it was originally trained in?
I would guess that if you finetuned a model so that it always responded in French, regardless of the languge you prompt it with, it would persistently respond in French (absent various jailbreaks which would almost definitely exist).

Eli Tyre Apr 17, 2025, 8:32 PM
2 points
0
in reply to: Kaj_Sotala’s comment on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
I’m not sure that I share that intuition, I think because my background model of humans has them as much less general than I imagine yours does.

Eli Tyre Apr 17, 2025, 7:49 PM
5 points
0
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Fascinating and useful post.

Thank you for writing it.

Eli Tyre Apr 17, 2025, 7:23 PM
13 points
1
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
In my experience, this is a common kind of failure with LLMs—that if asked directly about how to best a solve problem, they do know the answer. But if they aren’t given that slight scaffolding, they totally fail to apply it.
Notably, this is also true of almost all humans, at least of content that they’ve learned in school. The literature on transfer learning is pretty dismal in this respect. Almost all students will fail to apply their knowledge to new domains without very explicit prompting.

Eli Tyre Apr 17, 2025, 7:15 PM
4 points
0
on: Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
implies that they would also be inable to deal with the kind of novelty that an AGI would by definition need to deal with.
I guess this is technically true, because of the “General” in “AGI”. But I this doesn’t imply as much about how dangerous future LLM-based AI systems will be.

The first Strategically Superhuman AI systems might be importantly less general than humans, but still shockingly competent in the many specific domains on which they’ve been trained. An AI might make many basic reasoning failures in domains that are not represented in the training setup, but still be superhumanly persuasive and superhumanly strategic in comettive contexts.

This is especially the case if one of the specialized domains in which the AI is competent, is identifying domains in which it is weak, and then constructing datasets / training environments to learn those domains.

Eli Tyre Apr 15, 2025, 5:14 AM
3 points
0
in reply to: Lucius Bushnaq’s comment on: Eli’s shortform feed
For the same reasons ‘training an agent on a constitution that says to care about $x$ ’ does not, at arbitrary capability levels, produce an agent that cares about $x$
Ok, but I’m trying to ask why not.

Here’s the argument that I would make for why not, followed by why I’m skeptical of it right now.

New options for the AI will open up at high capability levels that were not available at lower capability levels. This could in principle lead to undefined behavior that deviates from what we intended.
More specifically, if it’s the case that if...
1. The best / easiest-for-SGD-to-find way to compute corrigible outputs (as evaluated by the AI) is to reinforce an internal proxy measure that is correlated with corrigibility (as evaluated by the AI) in distribution, instead of to reinforce circuits that implement corrigibility more-or-less directly.
2. When the AI gains new options unlocked by new advanced capabilities, that proxy measure comes apart from corrigibility (as evaluated by the AI), in the limit of capabilities, so that the poxy measure is almost uncorrelated with corrigibility
...then the resulting system will not end up corrigible.

(Is this the argument that you would give, or is there another reason why you expect that “training an agent on a constitution that says to care about $x$ ′ does not, at arbitrary capability levels, produce an agent that cares about $x$ ”?)

But, at the moment, I’m skeptical of the above line of argument for several reasons.
- I’m skeptical of the first premise, that the best way that SGD can find to produce corrigible (as evaluated by the AI) is to reinforce a proxy measure.
  - I understand that natural selection, when shaping humans for inclusive genetic fitness, instilled in them a bunch of proxy-drives. But I think this analogy is misleading in several ways.
  - Most relevantly, there’s a genetic bottleneck, so evolution could only shape human behavior by selecting over genomes, and genomes don’t encode that much knowledge about the world. If humans were born into the world with detailed world models, that included the concept of inclusive genetic fitness baked in, evolution would absolutely shaped humans to be inclusive fitness maximizers. AIs are “born into the world” with expansive world models that already include concepts like corrigibility (indeed, if they didn’t, Constitutional AI wouldn’t work at all). So it would be surprising if SGD opted to reinforce proxy measures instead of relying on the concepts directly.
- We would run the constitutional AI reinforcement process continuously, in parallel with the capability improvements from the RL training.
  - AI’s capabilities increase, it will gain new options. If the AI is steering based on proxy measures, some of those options will involved the proxy coming apart from the target of the proxy. But when that starts to happen, the constitutional AI loop will exert an optimization pressure on the AI’s internals to hit the target, not just the proxies.
Is this the main argument? What are other reasons to think that ‘training an agent on a constitution that says to care about $x$ ’ does not, at arbitrary capability levels, produce an agent that cares about $x$ ?

Eli Tyre Apr 14, 2025, 5:03 AM
2 points
0
in reply to: Chris_Leong’s comment on: Chris_Leong’s Shortform
Nonetheless, it does seem as though there should be at least one program that aims to find the best talent (even if they aren’t immediately useful) and which provides them with the freedom to explore and the intellectual environment in which to do so.
I think SPARC and its decedents are something like this.

Eli Tyre Apr 14, 2025, 4:54 AM
25 points
4
on: Eli’s shortform feed
Dumb question: Why doesn’t using constitutional AI, where the constitution is mostly or entirely corrigibility produce a corrigible AI (at arbitrary capability levels)?

My dumb proposal:

1. Train a model in something like o1′s RL training loop, with a scratch pad for chain of thought, and reinforcement of correct answers to hard technical questions across domains.

2. Also, take those outputs, prompt the model to generate versions of those outputs that “are more corrigible / loyal / aligned to the will of your human creators”. Do backprop to reinforce those more corrigible outputs.

Possibly “corrigibility” applies only very weakly to static solutions, and so for this setup to make sense, we’d instead need to train on plans, or time-series of an AI agent’s actions: The AI agent takes a bunch of actions over the course of a day or a week, then we have an AI annotate the time series of action-steps with alternative action-steps that better reflect “corrigibility”, according to its understanding. Then we do backprop to so that the Agent behaves more in ways that are closer to the annotated action transcript.

Would this work to produce a corrigible agent? If not, why not?

There’s a further question of “how much less capable will the more corrigible AI be?” This might be a significant penalty to performance, and so the added safety gets eroded away in the competitive crush. But first and foremost, I want to know if something like this could work.
What links here?
- A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives by Knight Lee (Apr 14, 2025, 10:27 AM; -3 points)

Eli Tyre Mar 2, 2025, 9:28 PM
8 points
16
in reply to: No77e’s comment on: No77e’s Shortform
What are the two groups in question here?

Eli Tyre Feb 18, 2025, 5:46 PM
9 points
2
in reply to: Martin Randall’s comment on: Martin Randall’s Shortform
AI x-risk is high, which makes cryonics less attractive (because cryonics doesn’t protect you from AI takeover-mediated human extinction). But on the flip side, timelines are short, which makes cryonics more attractive (because one of the major risks of cryonics is society persisting stably enough to keep you preserved until revival is possible, and near term AGI means that that period of time is short).

Cryonics is more likely to work, given a positive AI trajectory, and less likely to work given a negative AI trajectory.

I agree that it seems less likely to work, overall, than it seemed to me a few years ago.

Eli Tyre Feb 14, 2025, 2:07 AM
2 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Eli’s shortform feed
yeahh i’m afraid I have too many other obligations right now to give a elaboration that does it justice.
Fair enough!
otoh i’m in the Bay and we should definitely catch up sometime!
Sounds good.

Eli Tyre Feb 13, 2025, 8:48 PM
11 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Eli’s shortform feed
Frankly, it feels more rooted in savannah-brained tribalism & human interest than a evenkeeled analysis of what factors are actually important, neglected and tractable.

Um, I’m not attempting to do cause prioritization or action-planning in the above comment. More like sense-making. Before I move on to the question of what should we do, I want to have an accurate model of the social dynamics in the space.

(That said, it doesn’t seem a foregone conclusion that there are actionable things to do, that will come out of this analysis. If the above story is true, I should make some kind of update about the strategies that EAs adopted with regards to OpenAI in the late 2010s. Insofar as they were mistakes, I don’t want to repeat them.)
It might turn out to be right that the above story is “naive /misleading and ultimately maybe unhelpful”. I’m sure not an expert at understanding these dynamics. But just saying that it’s naive or that it seems rooted in tribalism doesn’t help me or others get a better model.

If it’s misleading, how is it misleading? (And is misleading different than “false”? Are you like “yeah this is technically correct, but it neglects key details”?)
Admittedly, you did label it as a tl;dr, and I did prompt you to elaborate on a react. So maybe it’s unfair of me to request even further elaboration.

Eli Tyre Feb 12, 2025, 4:01 AM
9 points
2
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
@Alexander Gietelink Oldenziel, you put a soldier mindset react on this (and also my earlier, similar, comment this week).

What makes you think so?

Definitely this model posits that adversariality, but I don’t think that I’m invested in “my side” of the argument winning here, FWTIW. This currently seems like the most plausible high level summary of the situation, given my level of context.

Is there a version of this comment that would regard as better?

Eli Tyre Feb 12, 2025, 3:49 AM
2 points
0
in reply to: RobertM’s comment on: sjadler’s Shortform
I don’t dispute that he never had any genuine concern. I guess that he probably did have genuine concern (though not necessarily that that was his main motivation for founding OpenAI).

Eli Tyre Feb 10, 2025, 6:41 PM
67 points
35
on: Eli’s shortform feed
In a private slack someone extended credit to Sam Altman for putting EAs on the on the OpenAI board originally, especially that this turned out to be pretty risky / costly for him.

I responded:

It seems to me that there were AI safety people on the board at all is fully explainable by strategic moves from an earlier phase of the game.

Namely, OpenAI traded a boardseat for OpenPhil grant money, and more importantly, OpenPhil endorsement, which translated into talent sourcing and effectively defused what might have been vocal denouncement from one of the major intellectually influential hubs of the world.

No one knows how counterfactual history might have developed, but it doesn’t seem unreasonable to think that there is an external world in which the EA culture successfully created a narrative that groups trying to build AGI were bad and defecting.

He’s the master at this game and not me, but I would bet at even odds that Sam was actively tracking EA as a potential social threat that could dampen OpenAI’s narrative flywheel.

I don’t know that OpenPhil’s grant alone was sufficient to switch from the “EAs vocally decry OpenAI as making the world worse” equilibrium to a “largely (but not universally) thinking that OpenAI is bad in private, but mostly staying silent in public + going to work at OpenAI” equilibrium. But I think it was a major component. OpenPhil’s cooperation bought moral legitimacy for OpenAI amongst EAs.

In retrospect, it looks like OpenAI successfully bought out the EAs through OpenPhil, to a lesser extent through people like Paul.

And Ilya in particular was a founder and one of the core technical leads. It makes sense for him to be a board member, and my understanding (someone correct me) is that he grew to think that safety was more important over time, rather than starting out as an “AI safety person”.

And even so, the rumor is that the thing that triggered the Coup is that Sam maneuvered to get Helen removed. I highly doubt that Sam planned for a situation where he was removed as CEO, and then did some crazy jujitsu move with the whole company where actually he ends up firing the board instead. But if you just zoom out and look at what actually played out, he clearly came out ahead, with control consolidated. Which is the outcome that he was maybe steering towards all along?

So my first pass summary of the situation is that when OpenAI was small and of only medium fame and social power, Sam maneuvered to get the cooperation of EAs, because that defused a major narrative threat, and bought the company moral legitimacy (when that legitimacy was more uncertain). Then after ChatGPT and GPT-4, when OpenAI is rich and famous and has more narrative power than the EAs, Sam moves to remove the people that he made those prestige-trades with in the earlier phase, since he no longer needs their support, and doesn’t have any reason for them to have power over the now-force-to-be-reckoned-with company.

Granted, I’m far from all of this and don’t have confidence about any of these political games. But it seems wrong to me to give Sam points for putting “AI safety people” on the board.

Eli Tyre Feb 10, 2025, 6:24 PM
9 points
2
in reply to: Viliam’s comment on: Thread for Sense-Making on Recent Murders and How to Sanely Respond
But it is our mistake that we didn’t stand firmly against drugs, didn’t pay more attention to the dangers of self-experimenting, and didn’t kick out Ziz sooner.
These don’t seem like very relevant or very actionable takeways.
1. we didn’t stand firmly against drugs—Maybe this would have been a good move generally, but it wouldn’t have helped with this situation at all. Ziz reports that they don’t take psychedelics, and I believe that extends to her compatriots, as well.
2. didn’t pay more attention to the dangers of self-experimenting—What does this mean concretely? I think plenty of people did “pay attention” to the dangers of self experimenting. But “paying attention” doesn’t automatically address those dangers.
  
  What specific actions would you recommend by which people? Eliezer telling people not to self experiment? CFAR telling people not to self experiment? A blanket ban on “self experimentation” is clearly too broad (“just don’t ever try anything that seems like maybe a good idea to you on first principles”). Some more specific guidelines might have helped, but we need to actually delineate the specific principles.
3. didn’t kick out Ziz sooner—When specifically is the point when Ziz should have been kicked out of the community? With the benefit of hindsight bias, we can look back and wish we had separated sooner, but that was not nearly as clear ex ante.
  
  What should have been the trigger? When she started wearing black robes? When she started calling herself Ziz? When she started writing up her own homegrown theories of psychology? Weird clothes, weird names, and weird beliefs are part and parcel of the rationalist milieu.
  
  As it is, she was banned from the alumni reunion at which she staged the failed protest (she bought tickets in advance, CFAR told her that she was uninvited, and returned her money). Before that, I think that several community leaders had grey-listed her as someone not to invite to events. Should something else have happened, in addition to that? Should she have been banned from public events or private group houses entirely? On what basis? On who’s authority?

Eli Tyre Feb 10, 2025, 6:02 PM
10 points
0
on: Eli’s shortform feed
[For some of my work for Palisade]

Does anyone know of even very simple examples of AIs exhibiting instrumentally convergent resource aquisition?

Something like “an AI system in a video game learns to seek out the power ups, because that helps it win.” (Even better would be a version in which, you can give the agent one of several distinct-video game goals, but regardless of the goal, it goes and gets the powerups first).

It needs to be an example where the instrumental resource is not strictly required for succeeding at the task, while still being extremely helpful.

Eli Tyre Feb 10, 2025, 5:57 PM
3 points
1
in reply to: RobertM’s comment on: sjadler’s Shortform
Is this taken to be a counterpoint to my story above? I’m not sure exactly how it’s related.