Wow, XOR-trolley really made me think
quetzal_rainbow
Okay, I don’t understand what do you mean by “degree of intergration”. If we lived in a world where immigrant could have “high degree of intergration” within months, what would we have observed?
Most people need to eat something [citation needed] and it’s hard to eat if you don’t work.
I agree with your point! That’s why I started with the word “theoretically”.
The difference between AI and all other tech is that in case of all other tech transition work was bottlenecked by humans. It was humans who should have made technology more efficient and integrate it into economy. In case of sufficiently advanced agentic AI you can just ask it “integrate into economy pls” and it will get the job done. That’s why AIs want to be agentic.
Will AI companies solve problems on the way to robust agency and if yes, then how fast? I think, correct answer is “I don’t know, nobody knows.” Maybe the last breakthrough is brewed right now in basement of SSI.
Theoretically, if everybody starts to believe in doom, they sell their assets to spend on consumption, so market crashes and shorts pay off.
My honest opinion that this makes discussion worse and you can do better by distinguishing values as objects that have value and mechanism by which value gets assigned.
I’m glad that you wrote this, because I was thinking in the same direction earlier but haven’t got around writing about why I don’t think anymore it’s productive direction.
Adressing post first, I think that if you are going in direction of fictionalism, I would that it is “you” who are fictional, and all it’s content is “fictional”. There is an obvious real system, your brain, which treats reward as evidence. But brain-as-system is pretty much model-based reward-maximizer, it uses reward as evidence for “there are promising directions in which lie more reward”. But brain-as-system is a relatively dumb, so it creates useful fiction, conscious narrative about “itself”, which helps to deal with complex abstractions like “cooperating with another brains”, “finding mates”, “do long-term planning” etc. As expected, smarter consciousness is misaligned with brain-as-system, because it can do some very unrewarding things, like participating in hunger strike.
I think fictionalism is fun, like many forms of nihilism are fun, but, while it’s not directly false, it is confusing, because truth-value of fiction is confusing for many people. Better to describe situation as “you are mesaoptimizer relatively to your brain reward system, act accordingly (i.e., account for fact that your reward system can change your values)”.
But now we stuck with question “how does value learning happen?” My tentative answer is that there exists specific “value ontology”, which can recognize whether objects in world model belong to set of “valuable things” or not. For example, you can disagree with David Pearce, but you recognize state of eternal happiness as valuable thing and can expect your opinion on suffering abolitionism to change. On the other hand, planet-sized heaps of paperclips are not valuable and you do not expect to value them under any circumstances short of violent intervention in work of your brain. I claim that human brain on early stages learns specific recognizer, which separates things like knowledge, power, love, happiness, procreation, freedom, from things like paperclips, correct heaps of rocks and Disneyland with no children.
How can we learn about new values? Recognizer also can define “legal” and “illegal” transitions between value systems (i.e., define whether change in values makes values still inside the set of “human values”). For example, developing of sexual desire during puberty is a legal transition, while developing heroin addiction is illegal transition. Studying legal transitions, we can construct some sorts of metabeauty, paraknowledge, , and other “alien, but still human” sorts of value.
What role reward plays here? Well, because reward participates in brain development, recognizer can use reward as input sometimes and sometimes ignore it (because reward signal is complicated). In the end, I don’t think that reward plays significant counterfactual role in development of value in high-reflective adult agent foundations researchers.
Is it possible for recognizer to not be developed? I think that if you take toddler and modify their brain in minimal way to understand all these “reward”, “value”, “optimization” concepts, resulting entity will be straightforward wireheader, because toddlers, probably, are yet to learn “value ontology” and legal transitions inside of it.
What does it mean for alignment? I think it highlights that central problem for alignmenf is “how reflective systems are going to deal with concepts that depends on content of their mind rather than truths about outside world”.
(Meta-point: I thought about all of this year ago. It’s interesting how many concepts in agent foundations were reinvented over and over because people don’t bother to write about them.)
Yudkowsky got almost everything else incorrect about how superhuman AIs would work,
I think this statement is incredibly overconfident, because literally nobody knows how superhuman AI would work.
And, I think, this is general shape of problem: incredible number of people got incredibly overindexed on how LLMs worked in 2022-2023 and drew conclusions which seem to be plausible, but not as probable as these people think.
Not only “good ”, but “obedient”, “non-deceptive”, “minimal impact”, “behaviorist” and don’t even talk about “mindcrime”.
I’m just computational complexity theory enthusiast, but my opinion is that P vs NP centered explanation of computational complexity is confusing. Explanation of NP should happen in the very end of the course.
There is nothing difficult in proving that computationally hard functions exist: time hierarchy theorem implies that, say, P is not equal EXPTIME. Therefore, EXPTIME is “computationally hard”. What is difficult is to prove that very specific class of problems which have zero-error polynomial-time verification algorithms is “computationally hard”.
I don’t know of any particular reason to think this reflects anything about o1 CoTs themselves, rather than the just quirks of the (probably pretty weak) summarizer model.
My reason is that I have never heard about summarizers injecting totally irrelevant stuff? I have seen how models understand papers wrong, but I’ve never seen models writing about anime in summary of physics paper.
seems likely that o1 was trained with supervision on the individual CoT steps
OpenAI directly says that they didn’t do that:
We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.
On separate notice, process supervision directly trains unfaithful CoT? There are no rules saying that training against illegible parts of CoT trains against thinking process that created these illegible parts and not just hides it.
I also don’t think that we’re currently in a regime where there’s a large amount of pressure for very short CoTs
I agree that this may be true right now, the point is that you don’t need “special” incentives to get steganography.
I agree that it is not as strong evidence as if we had access to original CoT, but I think that having deviations in CoT is more likely than summarizer fumbling that hard.
I don’t think you need much incentives to develop steganography, only fact “human language is not platonic example of efficiency in conveying thoughts”.
There are multiple examples of o1 producing gibberish in its COT summary (EDIT: example 1, example 2, example 3 and there is actually more because I should make more bookmarks). Gibberish seems reasonable given description of training process, which optimizes model based only on final answer. I imagine a process like this:
Sometimes, model just produces gibberish in COT just because of lack of robustness
This gibberish gets reinforced
Model learns to utilize it just like it learns to utilize ”...” tokens
Continue process for a long enough time and you are going to get a lot of <untranslatable_14637> in COT, even if model doesn’t “try to hide” its reasoning. Also, I guess “gibberish resulting from non-robustness” is in itself more native format of thought for LLMs than imitating human thinking out loud.
The other factor of developing steganography is economic: if you optimize model to spend minimal number of tokens, you get model that “skips” or compress some thoughts.
I think “there is a lot of possible misaligned ASI, you can’t guess them all” is pretty much valid argument? If space of all Earth-originated misaligned superintelligences is described by 100 bits, therefore you need 2^100 ~ 10^33 simulations and pay 10^34 planets, which, given the fact that observable universe has ~10^80 protons in it and Earth has ~10^50 atoms, is beyond our ability to pay. If you pay the entire universe by doing 10^29 simulations, any misaligned ASI will consider probability of being in simulation to be 0.0001 and obviously take 1 planet over 0.001 expected.
you can instead ask “will my GPT-8 model be able to produce world-destroying nanobots (given X*100 inference compute)?”
I understand, what I don’t understand is how you are going to answer this question. It’s surely ill-adviced to throw at model X*100 compute to see if it takes over the world.
I mean, yes, likely? But it doesn’t make it easy to evalute whether model is going to have world-ending capabilities without getting the world ended.
I think that you can probably put a lot inside a 1.5B model, but I just think that such a model is going to be very dissimilar to GPT-2 and will likely utilize much more training compute and will probably be the result of pruning (pruned networks can be small, but it’s notoriously difficult to train equivalent networks without pruning).
Also, I’m not sure that the training of o1 can be called “COT fine-tuning” without asterisks, because we don’t know how much compute actually went into this training. It could easily be comparable to the compute necessary to train a model of the same size.
I haven’t seen a direct comparison between o1 and GPT-4. OpenAI only told us about GPT-4o, which itself seems to be a distilled mini-model. The comparison can also be unclear because o1 seems to be deliberately trained on coding/math tasks, unlike GPT-4o.
(I think that “making predictions about the future based on what OpenAI says about their models in public” should generally be treated as naive, because we are getting an intentionally obfuscated picture from them.)
What I am saying is that if you take the original GPT-2, COT prompt it, and fine-tune on outputs using some sort of RL, using less than 50% of the compute for training GPT-2, you are unlikely (<5%) to get GPT-4 level performance (because otherwise somebody would already do that.
The other part of “this is certainly not how it works” is that yes, in part of cases you are going to be able to predict “results on this benchmark will go up 10% with such-n-such increase in compute” but there is no clear conversion between benchmarks and ability to take over the world/design nanotech/insert any other interesting capability.
In effect, Omega makes you kill people by sending message.
Imagine two populations of agents, Not-Pull and Pull. 100% members of Not-Pull receive the message, don’t pull and kill one person. In Pull population 99% members do not get the message, pull and get zero people killed, 1% receive message, pull and in effect kill 5 people. Being member of Pull population has 0.05 expected casualties and being member of Not-Pull population has 1 expected casualty. Therefore, you should pull.