918,367 kg
An average chimp is 45 kg
918,367 kg / 45 (kg / chimp)
= 20,408 chimps
918,367 kg
An average chimp is 45 kg
918,367 kg / 45 (kg / chimp)
= 20,408 chimps
Based on my limited experience with lucid dreaming, my impression is that, roughly, whatever you expect to happen in a dream will happen. This includes things like the “lessons” in this post. As far as I know, there’s no particular reason that lucid dreams have to be easier for women or people ages 20-30, or that you can’t transform into someone else in a lucid dream (it’s happened to me). But if you convince yourself of these lessons, then they’ll be true for you, and you’ll be limiting yourself for no reason.
Your example wouldn’t be true, but “Dragons attack Paris” would be, interpreted as a statement about actual dragons’ habits
Yep, or use method #2 on my list to make the paraphraser remove as much information as possible
Yeah, this is actually very similar to an earlier version of incremental steering, before I thought of making the “frozen planner” jump in and finish the reasoning process.
The problem is, even if we suppose that the Shoggoth won’t be at all motivated to hide its reasoning from the Face (i.e. ignoring what @ryan_greenblatt brought up in his comment), the Shoggoth still might use steganography in order to make its thoughts more compact (or just because it’s not particularly motivated to make everything human-understandable).
Even if we don’t reward the planner directly for compactness, the reality is that we will have to choose some cutoff point during training so its thoughts don’t run literally forever, which gives it some reason to be succinct.
I figured out a workaround to edit my post—it’s up to date now
I just realized this post is actually one of my earlier drafts, and a bug is keeping me from editing the post.
The final version of my post is here: https://docs.google.com/document/d/1gHaCHhNOUBqxcXxLeTQiqRjG_y7ELkBdM2kcySKDi_Q/edit
I just wrote a response to this post listing 5 specific ways we could improve CoT faithfulness: https://www.lesswrong.com/posts/TecsCZ7w8s4e2umm4/5-ways-to-improve-cot-faithfulness
> seems likely that o1 was trained with supervision on the individual CoT steps
OpenAI directly says that they didn’t do that
This is also how I interpreted them at first, but after talking to others about it, I noticed that they only say they “cannot train any policy compliance or user preferences onto the chain of thought”. Arguably, this doesn’t eliminate the possibility that they train a PRM using human raters’ opinions about whether the chain of thought is on the right track to solve the problem, independently of whether it’s following a policy. Or even if they don’t directly use human preferences to train the PRM, they could have some other automated reward signal on the level of individual steps.
I agree that reading the CoT could be very useful, and is a very promising area of research. In fact, I think reading CoTs could be a much more surefire interpretability method than mechanistic interpretability, although the latter is also quite important.
I feel like research showing that CoTs aren’t faithful isn’t meant to say “we should throw out the CoT.” It’s more like “naively, you’d think the CoT is faithful, but look, it sometimes isn’t. We shouldn’t take the CoT at face value, and we should develop methods that ensure that it is faithful.”
Personally, what I want most out of a chain of thought is that its literal, human-interpretable meaning contains almost all the value the LLM gets out of the CoT (vs. immediately writing its answer). Secondarily, it would be nice if the CoT didn’t include a lot of distracting junk that doesn’t really help the LLM (I suspect this was largely solved by o1, since it was trained to generate helpful CoTs).
I don’t actually care much about the LLM explaining why it believes things that it can determine in a single forward pass, such as “French is spoken in Paris.” It wouldn’t be practically useful for the LLM to think these things through explicitly, and these thoughts are likely too simple to be helpful to us.
If we get to the point that LLMs can frequently make huge, accurate logical leaps in a single forward pass that humans can’t follow at all, I’d argue that at that point, we should just make our LLMs smaller and focus on improving their explicit CoT reasoning ability, for the sake of maintaining interpretability.
Yeah, you kind of have to expect from the beginning that there’s some trick, since taken literally the title can’t actually be true. So I think it’s fine
I don’t think so, just say as prediction accuracy approaches 100%, the likelihood that the mind will use the natural abstraction increases, or something like that
If you read the post I linked, it probably explains it better than I do—I’m just going off of my memory of the natural abstractions agenda. I think another aspect of it is that all sophisticated-enough minds will come up with the same natural abstractions, insofar as they’re natural.
In your example, you could get evidence that 0 and 1 voltages are natural abstractions in a toy setting by:
Training 100 neural networks to take the input voltages to a program and return the resulting output
Doing some mechanistic interpretability on them
Demonstrating that in every network, values below 2.5V are separated from values above 2.5V in some sense
See “natural abstractions,” summarized here: https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1
In your example, it makes more sense to treat voltages <2.5 and >2.5 as different things, rather than <5.0 and >5.0, because the former helps you predict things about how the computer will behave. That is, those two ranges of voltage are natural abstractions.
It can also be fun to include prizes that are extremely low-commitment or obviously jokes/unlikely to ever be followed up on. Like “a place in my court when I ascend to kinghood” from Alexander Wales’ Patreon
This is great! Maybe you’d get better results if you “distill” GPT2-LN into GPT2-noLN by fine-tuning on the entire token probability distribution on OpenWebText.
Just curious, why do you spell “useful” as “usefwl?” I googled the word to see if it means something special, and all of the top hits were your comments on LessWrong or the EA Forum
If I understand correctly, you’re basically saying:
We can’t know how long it will take for the machine to finish its task. In fact, it might take an infinite amount of time, due to the halting problem which says that we can’t know in advance whether a program will run forever.
If our machine took an infinite amount of time, it might do something catastrophic in that infinite amount of time, and we could never prove that it doesn’t.
Since we can’t prove that the machine won’t do something catastrophic, the alignment problem is impossible.
The halting problem doesn’t say that we can’t know whether any program will halt, just that we can’t determine the halting status of every single program. It’s easy to “prove” that a program that runs an LLM will halt. Just program it to “run the LLM until it decides to stop; but if it doesn’t stop itself after 1 million tokens, cut it off.” This is what ChatGPT or any other AI product does in practice.
Also, the alignment problem isn’t necessarily about proving that a AI will never do something catastrophic. It’s enough to have good informal arguments that it won’t do something bad with (say) 99.99% probability over the length of its deployment.
I believe o1-type models that are trained to effectively reason out loud may actually be better for AI safety than the alternative. However, this is conditional on their CoTs being faithful, even after optimization by RL. I believe that scenario is entirely possible, though not necessarily something that will happen by default. See the case for CoT unfaithfulness is overstated, my response with specific ideas for ensuring faithfulness, and Daniel Kokotajlo’s doc along similar lines.
There seems to be quite a lot of low-hanging fruit here! I’m optimistic that highly-faithful CoTs can demonstrate enough momentum and short-term benefits to win out over uninterpretable neuralese, but it could easily go the other way. I think way more people should be working on methods to make CoT faithfulness robust to optimization (and how to minimize the safety tax of such methods).