Hello, last time a taught a class on the basics of Bayesian epistemology. This time I will teach a class that goes a bit further. I will explain what a proper scoring rule is and we will also do some calibration training. In particular, we will play a calibration training game called two lies, a truth, and a probability. I will do this at 7:30 the same place as last time. Come by to check it out.
Ronny Fernandez
Hello! Please note that I will be giving a class called the Bayesics in Eigen hall at 7:30. Heard of Bayes’s theorem but don’t fully understand what the fuss is about? Want to have an intuitive as well as formal understanding of what the Bayesian framework is? Want to learn how to do bayesian updates in your head? Come and learn the Bayesics.
Also, please note that I will be giving a class at 7:30 after the reading group called “The Bayesics” where I will teach you the basics of intuitive Bayesian epistemology and how to do Bayesian updates irl on the fly as a human. All attending the reading group are welcome to join for that as well.
I think you should still write it. I’d be happy to post it instead or bet with you on whether it ends up negative karma if you let me read it first.
AN APOLOGY ON BEHALF OF FOOLS FOR THE DETAIL ORIENTED
Misfits, hooligans, and rabble rousers.
Provocateurs and folk who don’t wear trousers.
These are my allies and my constituents.
Weak in number yet suffused with arcane power.
I would never condone bullying in my administration.
It is true we are at times moved by unkind motivations.
But without us the pearl clutchers, hard asses, and busy bees would overrun you.
You would lose an inch of slack per generation.
Many among us appreciate your precision.
I admit there are also those who look upon it with derision.
Remember though that there are worse fates than being pranked.
You might instead have to watch your friend be “reeducated”, degraded, and spanked
On high broadband public broadcast television.
We’re not so different really.
We often share your indignation
With those who despise copulation.
Although our alliance might be uneasy
We both oppose the soul’s ablation.
So let us join as cats and dogs, paw in paw
You will persistently catalog
And we will joyously gnaw.
Hey, I’m just some guy but I’ve been around for a while. I want to give you a piece of feedback that I got way back in 2009 which I am worried no one has given you. In 2009 I found lesswrong, and I really liked it, but I got downvoted a lot and people were like “hey, your comments and posts kinda suck”. They said, although not in so many words, that basically I should try reading the sequences closely with some fair amount of reverence or something.
I did that, and it basically worked, in that I think I really did internalize a lot of the values/tastes/habits that I cared about learning from lesswrong, and learned much more so how to live in accordance with them. Now I think there were some sad things about this, in that I sort of accidentally killed some parts of the animal that I am, and it made me a bit less kind in some ways to people who were very different from me, but I am overall glad I did it. So, maybe you want to try that? Totally fair if you don’t, definitely not costless, but I am glad that I did it to myself overall.
I didn’t figure out that the “bow” in “rainbow” referred to a bow like as in bow and arrow, and not a bow like a bow on a frilly dress, until five minutes ago. I was really pretty confused about this since I was like 8. Somebody could’ve explained but nobody did.
I want to note for posterity that I tried to write this reading list somewhat impartially. That is, I have a lot of takes about a lot of this stuff, and I tried to include a lot of material that I disagree with but which I have found helpful in some way or other. I also included things that people I trust have found helpful even if I personally never found it helpful.
I believe there isn’t really a deadline! You just buy tickets and then you can come. Tickets might sellout is the limiting factor.
In retrospect I think the above was insufficiently cooperative. Sorry,
To be clear, I did not think we were discussing the AI optimist post. I don’t think Nate thought that. I thought we were discussing reasons I changed my mind a fair bit after talking to Quintin.
I meant the reasonable thing other people knew I meant and not the deranged thing you thought I might’ve meant.
Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.
I suppose I’m trying to make a hypothetical AI that would frustrate any sense of “real self” and therefore disprove the claim “all LLMs have a coherent goal that is consistent across characters”. In this case, the AI could play the “benevolent sovereign” character or the “paperclip maximizer” character, so if one claimed there was a coherent underlying goal I think the best you could say about it is “it is trying to either be a benevolent sovereign or maximize paperclips”. But if your underlying goal can cross such a wide range of behaviors it is practically meaningless! (I suppose these two characters do share some goals like gaining power, but we could always add more modes to the AI like “immediately delete itself” which shrinks the intersection of all the characters’ goals.)
Oh I see! Yeah I think we’re thinking about this really differently. Imagine there was an agent whose goal was to make little balls move according to some really diverse and universal laws of physics, for the sake of simplicity let’s imagine newtonian mechanics. So ok, there’s this agent that loves making these balls act as if they follow this physics. (Maybe they’re fake balls in a simulated 3d world, doesn’t matter as long as they don’t have to follow the physics. They only follow the physics because the agent makes them, otherwise they would do some other thing.)
Now one day we notice that we can arrange these balls in a starting condition where they emulate an agent that has the goal of taking over ball world. Another day we notice that by just barely tweaking the start up we can make these balls simulate an agent that wants one pint of chocolate ice cream and nothing else. So ok, does this system really have on coherent goal? Well the two systems that the balls could simulate are really different, but the underlying intelligence making the balls act according to the physics has one coherent goal: make the balls act according to the physics.
The underlying LLM has something like a goal, it is probably something like “predict the next token as well as possible” although definitely not actually that because of inner outer alignment stuff. Maybe current LLMs just aren’t mind like enough to decompose into goals and beliefs, that’s actually what I think, but some program that you found with sgd to minimize surprise on tokens totally would be mind like enough, and its goal would be some sort of thing that you find when you sgd to find programs that minimize surprise on token prediction, and idk, that could be like pretty much anything. But if you then made an agent by feeding this super LLM a prompt that sets it up to simulate an agent, well that agent might have some totally different goal, and it’s gonna be totally unrelated to the goals of the underlying LLMs that does the token prediction in which the other agent lives.
So the shoggoth here is the actual process that gets low loss on token prediction. Part of the reason that it is a shoggoth is that it is not the thing that does the talking. Seems like we are onboard here.
The shoggoth is not an average over masks. If you want to see the shoggoth, stop looking at the text on the screen and look at the input token sequence and then the logits that the model spits out. That’s what I mean by the behavior of the shoggoth.On the question of whether it’s really a mind, I’m not sure how to tell. I know it gets really low loss on this really weird and hard task and does it better than I do. I also know the task is fairly universal in the sense that we could represent just about any task in terms of the task it is good at. Is that an intelligence? Idk, maybe not? I’m not worried about current LLMs doing planning. It’s more like I have a human connectnome and I can do one forward pass through it with an input set of nerve activations. Is that an intelligence? Idk, maybe not?
I think I don’t understand your last question. The shoggoth would be the thing that gets low loss on this really weird task where you predict sequences of characters from an alphabet with 50,000 characters that have really weird inscrutable dependencies between them. Maybe it’s not intelligent, but if it’s really good at the task, since the task is fairly universal, I expect it to be really intelligent. I further expect it to have some sort of goals that are in some way related to predicting these tokens well.
The shoggoth is supposed to be a of a different type than the characters. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.
Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage.
However, in the challenge described in the post it’s going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.
I think the advantage of the aligned superintelligence will only be slight because finding the action that maximizes utility function u is just as computationally difficult whether you yourself value u or not. It may not be equally hard for humans regardless of whether the human really values u, but I don’t expect that to generalize across all possible minds.
This inspired a full length post.
Quick submission:
The first two prongs of OAI’s approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search is aligned. It is not even on my current view great evidence that the model is aligned. Most intelligent agents that know that they are being optimized for some goal will behave as if they are trying to optimize that goal if they think that is the only way to be released into physics, which they will think because it is and they are intelligent. So P(they behave aligned | aligned, intelligent) ~= P(they behave aligned | unaligned, intelligent). P(aligned and intelligent) is very low since most possible intelligent models are not aligned with this very particular set of values we care about. So the chances of this working out are very low.
The basic problem is that we can only select models by looking at their behavior. It is possible to fake intelligent behavior that is aligned with any particular set of values, but it is not possible to fake behavior that is intelligent. So we can select for intelligence using incentives, but cannot select for being aligned with those incentives, because it is both possible and beneficial to fake behaviors that are aligned with the incentives you are being selected for.
The third prong of OAI’s strategy seems doomed to me, but I can’t really say why in a way I think would convince anybody that doesn’t already agree. It’s totally possible me and all the people who agree with me here are wrong about this, but you have to hope that there is some model such that that model combined with human alignment researchers is enough to solve the problem I outlined above, without the model itself being an intelligent agent that can pretend to be trying to solve the problem while secretly biding its time until it can take over the world. The above problem seems AGI complete to me. It seems so because there are some AGIs around that cannot solve it, namely humans. Maybe you only need to add some non AGI complete capabilities to humans, like being able to do really hard proofs or something, but if you need more than that, and I think you will, then we have to solve the alignment problem in order to solve the alignment problem this way, and that isn’t going to work for obvious reasons.
I think the whole thing fails way before this, but I’m happy to spot OAI those failures in order to focus on the real problem. Again the real problem is that we can select for intelligent behavior, but after we select to a certain level of intelligence, we cannot select for alignment with any set of values whatsoever. Like not even one bit of selection. The likelihood ratio is one. The real problem is that we are trying to select for certain kinds of values/cognition using only selection on behavior, and that is fundamentally impossible past a certain level of capability.
There is! It is now posted! Sorry about the delay.