Born too late to explore Earth; born too early to explore the galaxy; born just the right time to save humanity.
Ulisse Mini
Thanks for the insightful response! Agree it’s just suggestive for now. Though more then with image models (where I’d expect lenses to transfer really badly, but don’t know). Perhaps it being a residual network is the key thing, since effective path lengths are low most of the information is “carried along” unchanged, meaning the same probe continues working for other layers. Idk
Don’t we have some evidence GPTs are doing iterative prediction updating from the logit lens and later tuned lens? Not that that’s all they’re doing of course.
Strong upvoted and agreed. I don’t think the public has opinions on AI X-Risk yet, so any attempt to elicit them will entirely depend on framing.
Strong upvoted to counter some of the downvotes.
I’ll note (because some commenters seem to miss this) that Eliezer is writing in a convincing style for a non-technical audience. Obviously the debates he would have with technical AI safety people are different then what is most useful to say to the general population.
EDIT: I think the effects were significantly worse than this and caused a ton of burnout and emotional trauma. Turns out thinking the world will end with 100% probability if you don’t save it, plus having heroic responsibility, can be a little bit tough sometimes...
I worry most people will ignore the warnings around willful inconsistency, so let me self-report that I did this and it was a bad idea. Central problem: It’s hard to rationally update off new evidence when your system 1 is utterly convinced of something. And I think this screwed with my epistemics around Shard Theory while making communication with people about x-risk much harder, since I’d often typical mind and skip straight to the paperclipper—the extreme scenario I was (and still am to some extent) trying to avoid as my main case.
When my rationality level is higher and my takes have solidified some more I might try this again, but right now it’s counterproductive. System 2 rationality is hard when you have to constantly correct for false System 1 beliefs!
I feel there’s often a wrong assumption in probabilistic reasoning, something like moderate probabilities for everything by default? after all, if you say you’re 70⁄30 nobody who disagrees will ostracize you like if you say 99⁄1.
“If alignment is easy I want to believe alignment is easy. If alignment is hard I want to believe alignment is hard. I will work to form accurate beliefs”
Petition to rename “noticing confusion” to “acting on confusion” or “acting to resolve confusion”. I find myself quite good at the former but bad at the latter—and I expect other rationalists are the same.
For example: I remember having the insight thought leading to lsusr’s post on how self-reference breaks the orthogonality thesis, but never pursued the line of questioning since it would require sitting down and questioning my beliefs with paper for a few minutes, which is inconvenient and would interrupt my coding.
Strongly agree. Rationalist culture is instrumentally irrational here. It’s very well known how important self-belief & a growth mindset is for success, and rationalists obsession with natural intelligence quite bad imo, to the point where I want to limit my interaction with the community so I don’t pick up bad patterns.
I do wonder if you’re strawmanning the advice a little, in my friend circles dropping out is seen as reasonable, though this could just be because a lot of my high-school friends already have some legible accomplishments and skills.
Each non-waluigi step increases the probability of never observing a transition to a waluigi a little bit.
Each non-Waluigi step increases the probability of never observing a transition to Waluigi a little bit, but not unboundedly so. As a toy example, we could start with P(Waluigi) = P(Luigi) = 0.5. Even if P(Luigi) monotonically increases, finding novel evidence that Luigi isn’t a deceptive Waluigi becomes progressively harder. Therefore, P(Luigi) could converge to, say, 0.8.
However, once Luigi says something Waluigi-like, we immediately jump to a world where P(Waluigi) = 0.95, since this trope is very common. To get back to Luigi, we would have to rely on a trope where a character goes from good to bad to good. These tropes exist, but they are less common. Obviously, this assumes that the context window is large enough to “remember” when Luigi turned bad. After the model forgets, we need a “bad to good” trope to get back to Luigi, and these are more common.
I’d be happy to talk to [redacted] and put them in touch with other smart young people. I know a lot from Atlas, ESPR and related networks. You can pass my contact info on to them.
Exercise: What mistake is the following sentiment making?
If there’s only a one in a million chance someone can save the world, then there’d better be well more than a million people trying.
Answer:
The whole challenge of “having a one in a million chance of saving the world” is the wrong framing, the challenge is having a positive impact in the first case (for example: by not destroying the world or making things worse, e.g. from s-risks). You could think of this as a setting the zero point thing going on, though I like to think of it in terms of Bayes and Pascel’s wagers:
In terms of Bayes: You’re fixating on the expected value contributed from and ignoring the rest of the hypothesis space. In most cases, there are corresponding low probability events which “cancel out” the EV contributed from ’s direct reasoning.
(I will also note that, empirically, it could be argued Eliezer was massively net-negative from a capabilities advancements perspective; having causal links to founding of deepmind & openai. I bring this up to point out how nontrivial having a positive impact at all is, in a domain like ours)
Isn’t this only S-risk in the weak sense of “there’s a lot of suffering”—not the strong sense of “literally maximize suffering”? E.g. it seems plausible to me mistakes like “not letting someone die if they’re suffering” still gives you a net positive universe.
Also, insofar as shard theory is a good description of humans, would you say random-human-god-emperor is an S-risk? and if so, with what probability?
The enlightened have awakened from the dream and no longer mistake it for reality. Naturally, they are no longer able to attach importance to anything. To the awakened mind the end of the world is no more or less momentous than the snapping of a twig.
Looks like I’ll have to avoid enlightenment, at least until the work is done.
Take the example of the Laplace approximation. If there’s a local continuous symmetry in weight space, i.e., some direction you can walk that doesn’t affect the probability density, then your density isn’t locally Gaussian.
Haven’t finished the post, but doesn’t this assume the requirement that when and induce the same function? This isn’t obvious to me, e.g. under the induced prior from weight decay / L2 regularization we often have for weights that induce the same function.
Seems tangentially related to the train a sequence of reporters strategy for ELK. They don’t phrase it in terms of basins and path dependence, but they’re a great frame to look at it with.
Personally, I think supervised learning has low path-dependence because of exact gradients plus always being able find a direction to escape basins in high dimensions, while reinforcement learning has high path-dependence because updates influence future training data causing attractors/equilibra (more uncertain about the latter, but that’s what I feel like)
So the really out there take: We want to give the LLM influence over its future training data in order to increase path-dependence, and get the attractors we want ;)
I was more thinking along the lines of “you’re the average of the five people you spend the most time with” or something. I’m against external motivation too.
Edited
Character.ai seems to have a lot more personality then ChatGPT. I feel bad for not thanking you earlier (as I was in disbelief), but everything here is valuable safety information. Thank you for sharing, despite potential embarrassment :)
Downvoted because I view some of the suggested strategies as counterproductive. Specifically, I’m afraid of people flailing. I’d be much more comfortable if there was a bolded paragraph saying something like the following:
To give specific examples illustrating this (which may also be good to include and/or edit the post):
I believe tweets like this are much better (and net positive) then the tweet you give as an example. Sharing anything less then the strongest argument can be actively bad to the extent it immunizes people against the actually good reasons to be concerned.
Most forms of civil disobedience seems actively harmful to me. Activating the tribal instincts of more mainstream ML researchers, causing them to hate the alignment community, would be pretty bad in my opinion. Protesting in the streets seems fine, protesting by OpenAI hq does not.
Don’t have time to write more. For more info see this twitter exchange I had with the author, though I could share more thoughts and models my main point is be careful, taking action is fine, and don’t fall into the analysis-paralysis of some rationalists, but don’t make everything worse.