Born too late to explore Earth; born too early to explore the galaxy; born just the right time to save humanity.
Ulisse Mini
Understanding and controlling a maze-solving policy network
Each non-waluigi step increases the probability of never observing a transition to a waluigi a little bit.
Each non-Waluigi step increases the probability of never observing a transition to Waluigi a little bit, but not unboundedly so. As a toy example, we could start with P(Waluigi) = P(Luigi) = 0.5. Even if P(Luigi) monotonically increases, finding novel evidence that Luigi isn’t a deceptive Waluigi becomes progressively harder. Therefore, P(Luigi) could converge to, say, 0.8.
However, once Luigi says something Waluigi-like, we immediately jump to a world where P(Waluigi) = 0.95, since this trope is very common. To get back to Luigi, we would have to rely on a trope where a character goes from good to bad to good. These tropes exist, but they are less common. Obviously, this assumes that the context window is large enough to “remember” when Luigi turned bad. After the model forgets, we need a “bad to good” trope to get back to Luigi, and these are more common.
I’d be happy to talk to [redacted] and put them in touch with other smart young people. I know a lot from Atlas, ESPR and related networks. You can pass my contact info on to them.
Predictions for shard theory mechanistic interpretability results
Exercise: What mistake is the following sentiment making?
If there’s only a one in a million chance someone can save the world, then there’d better be well more than a million people trying.
Answer:
The whole challenge of “having a one in a million chance of saving the world” is the wrong framing, the challenge is having a positive impact in the first case (for example: by not destroying the world or making things worse, e.g. from s-risks). You could think of this as a setting the zero point thing going on, though I like to think of it in terms of Bayes and Pascel’s wagers:
In terms of Bayes: You’re fixating on the expected value contributed from and ignoring the rest of the hypothesis space. In most cases, there are corresponding low probability events which “cancel out” the EV contributed from ’s direct reasoning.
(I will also note that, empirically, it could be argued Eliezer was massively net-negative from a capabilities advancements perspective; having causal links to founding of deepmind & openai. I bring this up to point out how nontrivial having a positive impact at all is, in a domain like ours)
[ASoT] Policy Trajectory Visualization
Isn’t this only S-risk in the weak sense of “there’s a lot of suffering”—not the strong sense of “literally maximize suffering”? E.g. it seems plausible to me mistakes like “not letting someone die if they’re suffering” still gives you a net positive universe.
Also, insofar as shard theory is a good description of humans, would you say random-human-god-emperor is an S-risk? and if so, with what probability?
The enlightened have awakened from the dream and no longer mistake it for reality. Naturally, they are no longer able to attach importance to anything. To the awakened mind the end of the world is no more or less momentous than the snapping of a twig.
Looks like I’ll have to avoid enlightenment, at least until the work is done.
Take the example of the Laplace approximation. If there’s a local continuous symmetry in weight space, i.e., some direction you can walk that doesn’t affect the probability density, then your density isn’t locally Gaussian.
Haven’t finished the post, but doesn’t this assume the requirement that when and induce the same function? This isn’t obvious to me, e.g. under the induced prior from weight decay / L2 regularization we often have for weights that induce the same function.
Seems tangentially related to the train a sequence of reporters strategy for ELK. They don’t phrase it in terms of basins and path dependence, but they’re a great frame to look at it with.
Personally, I think supervised learning has low path-dependence because of exact gradients plus always being able find a direction to escape basins in high dimensions, while reinforcement learning has high path-dependence because updates influence future training data causing attractors/equilibra (more uncertain about the latter, but that’s what I feel like)
So the really out there take: We want to give the LLM influence over its future training data in order to increase path-dependence, and get the attractors we want ;)
I was more thinking along the lines of “you’re the average of the five people you spend the most time with” or something. I’m against external motivation too.
Incentives considered harmful
Edited
Character.ai seems to have a lot more personality then ChatGPT. I feel bad for not thanking you earlier (as I was in disbelief), but everything here is valuable safety information. Thank you for sharing, despite potential embarrassment :)
[Question] Where do you find people who actually do things?
That link isn’t working for me, can you send screenshots or something? When I try and load it I get an infinite loading screen.
Re(prompt ChatGPT): I’d already tried what you did and some (imo) better prompt engineering, and kept getting a character I thought was overly wordy/helpful (constantly asking me what it could do to help vs. just doing it). A better prompt engineer might be able to get something working though.
Can you give specific example/screenshots of prompts and outputs? I know you said reading the chat logs wouldn’t be the same as experiencing it in real time, but some specific claims like the prompt
The following is a conversation with Charlotte, an AGI designed to provide the ultimate GFE
Resulting in a conversation like that are highly implausible.[1] At a minimum you’d need to do some prompt engineering, and even with that, some of this is implausible with ChatGPT which typically acts very unnaturally after all the RLHF OAI did.
- ↩︎
Source: I tried it, and tried some basic prompt engineering & it still resulted in bad outputs
- ↩︎
Interesting I didn’t know the history, maybe I’m insufficiently pessimistic about these things. Consider my query retracted
Strongly agree. Rationalist culture is instrumentally irrational here. It’s very well known how important self-belief & a growth mindset is for success, and rationalists obsession with natural intelligence quite bad imo, to the point where I want to limit my interaction with the community so I don’t pick up bad patterns.
I do wonder if you’re strawmanning the advice a little, in my friend circles dropping out is seen as reasonable, though this could just be because a lot of my high-school friends already have some legible accomplishments and skills.