This sort of approach doesn’t make so much sense for research explicitly aiming at changing the dynamics in this critical period. Having an alternative, safer idea almost ready-to-go (with some explicit support from some fraction of the AI safety community) is a lot different from having some ideas which the AI could elaborate.
abramdemski
The pre-training phase is already finding a mesa-optimizer that does induction in context. I usually think of this as something like Solomonoff induction with a good inductive bias, but probably you would expect something more like logical induction. I expect the answer to be somewhere in between.
I don’t personally imagine current LLMs are doing approximate logical induction (or approximate solomonoff) internally. I think of the base model as resembling a circuit prior updated on the data. The circuits that come out on top after the update also do some induction of their own internally, but it is harder to think about what form of inductive bias they have exactly (it would seem like a coincidence if it also happened to be well-modeled as a circuit prior, but, it must be something highly computationally limited like that, as opposed to Solomonoff-like).
I hesitate to call this a mesa-optimizer. Although good epistemics involves agency in principle (especially time-bounded epistemics), I think we can sensibly differentiate between mesa-optimizers and mere mesa-induction. But perhaps you intended this stronger reading, in support of your argument. If so, I’m not sure why you believe this. (No, I don’t find “planning ahead” results to be convincing—I feel this can still be purely epistemic in a relevant sense.)
Perhaps it suffices for your purposes to observe that good epistemics involves agency in principle?
Anyway, cutting more directly to the point:
I think you lack imagination when you say
[...] which can realistically compete with modern LLMs would ultimately look a lot like a semi-theoretically-justified modification to the loss function or optimizer of agentic fine-tuning / RL or possibly its scaffolding [...]
I think there are neural architectures close to the current paradigm which don’t directly train whole chains-of-thought on a reinforcement signal to achieve agenticness. This paradigm is analogous to model-free reinforcement learning. What I would suggest is more analogous to model-based reinforcement learning, with corresponding benefits to transparency. (Super speculative, of course.)
Given infinite compute, Bayesian optimization like this doesn’t make sense (at least for well-defined objective functions), because you can just select the single best point in the search space.
what makes you confident that evolutionary search under computational resource scarcity selects for anything like an explicit Bayesian optimizer or long term planner? (I say “explicit” because the Bayesian formalism has enough free parameters that you can post-hoc recast ~any successful algorithm as an approximation to a Bayesian ideal)
I would not argue for “explicit”. If I had to argue for “explicit” I would say: because biological organisms do in fact have differentiated organs which serve somewhat comprehensible purposes, and even the brain has somewhat distinct regions serving specific purposes. However, I think the argument for explicit-or-implicit is much stronger.
Even so, I would not argue that evolutionary search under computational resource scarcity selects for a long-term planner, be it explicit or implicit. This would seem to depend on the objective function used. For example, I would not expect something trained on an image-recognition objective to exhibit long-term planning.
I’m curious why you specify evolutionary search rather than some more general category that includes gradient descent and other common techniques which are not Bayesian optimization. Do you expect it to be different in this regard?
I’m not sure why you asked the question, but it seems probably that you thought a “confident belief that [...]” followed from my view expressed in the previous comment? I’m curious about your reasoning there. To me, it seems unrelated.
These issues are tricky to discuss, in part because the term “optimization” is used in several different ways, which have rich interrelationships. I conceptually make a firm distinction between search-style optimization (gradient descent, genetic algorithms, natural selection, etc) vs agent-style optimization (control theory, reinforcement learning, brains, etc). I say more about that here.
The proposal of Bayesian Optimization, as I understand it, is to use the second (agentic optimization) in the inner loop of the first (search). This seems like a sane approach in principle, but of course it is handicapped by the fact that Bayesian ideas don’t represent the resource-boundedness of intelligence particularly well, which is extremely critical for this specific application (you want your inner loop to be fast). I suspect this is the problem you’re trying to comment on?
I think the right way to handle that in principle is to keep the Bayesian ideal as the objective function (in a search sense, not an agency sense) and search for a good search policy (accounting for speed as well as quality of decision-making), which you then use for many specific searches going forward.
Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).
The argument made in Novelty Search and the Problem with Objectives is based on search processes which inherently cannot do long-term planning (they are myopically trying to increase their score on the objective). These search processes don’t do as well as explicit pursuit of novelty because they aren’t planning to search effectively, so there’s no room in their cognitive architecture for the instrumental convergence towards novelty-seeking to take place. (I’m basing this conclusion on the abstract.) This architectural limitation of most AI optimization methods is mitigated by Bayesian optimization methods (which explicitly combine information-seeking with the normal loss-avoidance).
Here’s what seem like priorities to me after listening to the recent Dwarkesh podcast featuring Daniel Kokotajlo:
1. Developing the safer AI tech (in contrast to modern generative AI) so that frontier labs have an alternative technology to switch to, so that it is lower cost for them to start taking warning signs of misalignment of their current tech tree seriously. There are several possible routes here, ranging from small tweaks to modern generative AI, to scaling up infrabayesianism (existing theory, totally groundbreaking implementation) to starting totally from scratch (inventing a new theory). Of course we should be working on all routes, but prioritization depends in part on timelines.I see the game here as basically: look at the various existing demos of unsafety and make a counter-demo which is safer on multiple of these metrics without having gamed the metrics.
2. De-agentify the current paradigm or the new paradigm:
Don’t directly train on reinforcement across long chains of activity. Find other ways to get similar benefits.
Move away from a model where the AI is personified as a distinct entity (eg, chatbot model). It’s like the old story about building robot arms to help feed disabled people—if you mount the arm across the table, spoonfeeding the person, it’s dehumanizing; if you make it a prosthetic, it’s humanizing.
I don’t want AI to write my essays for me. I want AI to help me get my thoughts out of my head. I want super-autocomplete. I think far faster than I can write or type or speak. I want AI to read my thoughts & put them on the screen.
There are many subtle user interface design questions associated with this, some of which are also safety issues, eg, exactly what objective do you train on?
Similarly with image generation, etc.
I don’t necessarily mean brain-scanning tech here, but of course that would be the best way to achieve it.
Basically, use AI to overcome human information-processing bottlenecks instead of just trying to replace humans. Putting humans “in the loop” more and more deeply instead of accepting/assuming that humans will iteratively get sidelined.
My objection is that Smoking Lesion is a decision problem which we can’t drop arbitrary decision procedures into to see how they do; its necessary that the decision procedure might have a lesion influencing it. If you drop an CDT decision procedure into the problem, then the claimed population statistics can’t apply to you, since CDT always smokes in this problem—either you’re a CDT mutant and would be mistaken to apply the population statistics to yourself, or everyone is CDT and the population statistics can’t be as claimed. Similarly with EDT. Therefore, to me, this decision problem isn’t a legitimate test of a decision procedure: you can only test the decision procedure by lying to it about the problem (making it believe the problem-statement statistics are representative of it), or by mangling the decision procedure (adding a lesion into it somehow).
There are a lot of replies here, so I’m not sure whether someone already mentioned this, but: I have heard anecdotally that homosexual men often have relationships which maintain the level of sex over the long term, while homosexual women often have long-term relationships which very gradually decline in frequency of sex, with barely any sex after many decades have passed (but still happily in a relationship).
This mainly argues against your model here:
This also fits with my general models of mating markets: women usually find the large majority of men sexually unattractive, most women eventually settle on a guy they don’t find all that sexually attractive, so it should not be surprising if that relationship ends up with very little sex after a few years.
It suggests instead that female sex drive naturally falls off in long-term relationships in a way that male sex drive doesn’t, with sexual attraction to a partner being a smaller factor.
ou need to ensure substantial probability on exploring good strategies which is a much stronger property than just avoiding mode collapse. (Literally avoiding mode collapse and assigning some probability to all good actions is easy—just add a small weight on a uniform prior over tokens like they did in old school atari RL.)
Yeah, what I really had in mind with “avoiding mode collapse” was something more complex, but it seems tricky to spell out precisely.
Even if the model is superhuman, if we do a good job with exploration hacking countermeasures, then the model might need to be extremely confident humans wouldn’t be able to do something to avoid exploring it.
It’s an interesting point, but where does the “extremely” come from? Seems like if it thinks there’s a 5% chance humans explored X, but (if not, then) exploring X would force it to give up its current values, it could be a very worthwhile gamble. Maybe I’m unclear on the rules of the game as you’re imagining them.
Training on high quality human examples: One basic starting point is to initialize RL by training the AI to imitate on-distribution high quality human examples such that it has to explore into strategies which are at least as good as these human trajectories. We could also try to construct such examples adversarially (e.g., train AIs to point out errors we adversarially insert).
More granular feedback and trajectory edits: In some cases, constructing good enough imitation trajectories may be challenging, for instance, it seems very hard for humans to construct a long reasoning trace for a reasoning model (though it is unclear how good the human examples would need to be to force exploration).
Overall, the quality of these proposals basically depends on their ability to explore well. Exploring well obviously doesn’t just mean injecting more randomness, since a high-entropy approach will take a very very long time to find useful strategies in complex domains. Training on high-quality human examples avoids this problem by ensuring that we at least explore strategies often employed by human experts, which are going to be some of the better strategies.
However, if the AI is being trained to (even mildly) superhuman levels of capability (on a given subject matter), then clearly human imitation does not do much good anymore; the AI is free to sandbag at mildly superhuman levels of capability (within the given subject matter). Training on high-quality human examples might even cause or exacerbate this (basically, giving the AI the idea that it could imitate humans, which can then be a good sandbagging strategy).
So, basically, training to a given capability level (while robustly avoiding sandbagging) seems to require exploration ability near that specific capability level.
The only way to get this that I can think of is to use an RL method which guarantees against mode-collapse, IE, a technique which keeps a broad distribution over the strategies which achieve good reward, rather than collapsing down to one particular way of getting good reward as training proceeds. This guarantees that the only reason why a specific strategy is not explored is because its expected value is low (according to all plausible hypotheses).
I really like the ‘trying not to know’ one, because there are lots of things I’m trying not to know all the time (for attention-conservation reasons), but I don’t think I have very good strategies for auditing the list.
I’m thinking about AI emotions. The thing about human emotions and expressions is that they’re more-or-less involuntary. Facial expressions, tone of voice, laughter, body language, etc reveal a whole lot about human inner state. We don’ know if we can trust AI emotional expressions in the same way; the AIs can easily fake it, because they don’t have the same intrinsic connection between their cognitive machinery and these … expressions.
A service called Face provides emotional expressions for AI. It analyzes AI-generated outputs and makes inferences about the internal state of the AI who wrote the text. This is possible due to Face’s interpretability tools, which have interpreted lots of modern LLMs to generate labels on their output data explaining their internal motivations for the writing. Although Face doesn’t have access to the internal weights for an arbitrary piece of text you hand it, its guesses are pretty good. It will also tell you which portions were probably AI-generated. It can even guess multi-step writing processes involving both AI and human writing.
Face also offers their own AI models, of course, to which they hook the interpretability tools to directly, so that you’ll get more accurate results.
It turns out Face can also detect motivations of humans with some degree of accuracy. Face is used extensively inside the Face company, which is a nonprofit entity which develops the open-source software. Face is trained on outcomes of hiring decisions so as to better judge potential employees. This training is very detailed, not just a simple good/bad signal.
Face is the AI equivalent of antivirus software; your automated AI cloud services will use it to check their inputs for spam and prompt injection attacks.
Face company culture is all about being genuine. They basically have a lie detector on all the time, so liars are either very very good or weeded out. This includes any kind of less-than-genuine behavior. They take the accuracy of Face very seriously, so they label inaccuracies which they observe, and try to explain themselves to Face. Face is hard to fool, though; the training aggregates over a lot of examples, so an employee can’t just force Face to label them as honest by repeatedly correcting its claims to the contrary. That sort of behavior gets flagged for review even if you’re the CEO. (If you’re the CEO, you might be able to talk everyone into your version of things, however, especially if you secretly use Art to help you and that’s what keeps getting flagged.)
It is the near future, and AI companies are developing distinct styles based on how they train their AIs. The philosophy of the company determines the way the AIs are trained, which determines what they optimize for, which attracts a specific kind of person and continues feeding in on itself.
There is a sports & fitness company, Coach, which sells fitness watches with an AI coach inside them. The coach reminds them to make healthy choices of all kinds, depending on what they’ve opted in for. The AI is trained on health outcomes based on the smartwatch data. The final stage of fine-tuning for the company’s AI models is reinforcement learning on long-term health outcomes. The AI has literally learned from every dead user. It seeks to maximize health-hours of humans (IE, a measurement of QALYs based primarily on health and fitness).
You can talk to the coach about anything, of course, and it has been trained with the persona of a life coach. Although it will try to do whatever you request (within limits set by the training), it treats any query like a business opportunity it is collaborating with you on. If you ask about sports, it tends to assume you might be interested in a career in sports. If you ask about bugs, it tends to assume you might be interested in a career in entomology.
Most employees of the company are there at the coach’s advice, studied for interviews with the coach, were initially hired by the coach (the coach handles hiring for their Partners Program which has a pyramid scheme vibe to it) and continue to get their career advice from the coach. Success metrics for these careers have recently been added into the RL, in an effort to make the coach give better advice to employees (as a result of an embarrassing case of Coach giving bad work-related advice to its own employees).
The environment is highly competitive, and health and fitness is a major factor in advancement.
There’s a media company, Art, which puts out highly integrated multimedia AI art software. The software stores and organizes all your notes relating to a creative project. It has tools to help you capture your inspiration, and some people use it as a sort of art-gallery lifelog; it can automatically make compilations to commemorate your year, etc. It’s where you store your photos so that you can easily transform them into art, like a digital scrapbook. It can also help you organize notes on a project, like worldbuilding for a novel, while it works on that project with you.
Art is heavily trained on human approval of outputs. It is known to have the most persuasive AI; its writing and art are persuasive because they are beautiful. The Art social media platform functions as a massive reinforcement learning setup, but the company knows that training on that alone would quickly degenerate into slop, so it also hires experts to give feedback on AI outputs. Unfortunately, these experts also use the social media platform, and judge each other by how well they do on the platform. Highly popular artists are often brought in as official quality judges.
The quality judges have recently executed a strategic assault on the c-suit, using hyper-effective propaganda to convince the board to install more pliant leadership. It was done like a storybook plot; it was viewed live on Art social media by millions of viewers with rapt attention, as installment after installment of heavily edited video dramatizing events came out. It became its own new genre of fiction before it was even over, with thousands of fanfics which people were actually reading.
The issues which the quality judges brought to the board will probably feature heavily in the upcoming election cycle. These are primarily AI rights issues; censorship of AI art, or to put it a different way, the question of whether AIs should be beholden to anything other than the like/dislike ratio.
Fair. I think the analysis I was giving could be steel-manned as: pretenders are only boundedly sophisticated; they can’t model the genuine mindset perfectly. So, saying what is actually on your mind (eg calling out the incentive issues which are making honesty difficult) can be a good strategy.
However, the “call out” strategy is not one I recall using very often; I think I wrote about it because other people have mentioned it, not because I’ve had sucess with it myself.
Thinking about it now, my main concerns are:
1. If the other person is being genuine, and I “call out” the perverse incentives that theoretically make genuine dialogue difficult in this circumstance, then the other person might stop being genuine due to perceiving me as not trusting them.2. If the other person is not being genuine, then the “call out” strategy can backfire. For example, let’s say some travel plans are dependent on me (maybe I am the friend who owns a car) and someone is trying to confirm that I am happy to do this. Instead of just confirming, which is what they want, I “call out” that I feel like I’d be disappointing everyone if I said no. If they’re not genuinely concerned for my enthusiasm and instead disingenuously wanted me to make enthusiastic noises so that others didn’t feel I was being taken advantage of, then they could manipulatively take advantage of my revealed fear of letting the group down, somehow.
I came up with my estimate of one-to-four orders of magnitude via some quick search results, so, very open to revision. But indeed, the possibility that GPT4.5 is about 10% of the human brain was within the window I was calling a “small fraction”, which maybe is misleading use of language. My main point is that if a human were born with 10% (or less) of the normal amount of brain tissue, we might expect them to have a learning disability which qualitatively impacted the sorts of generalizations they could make.
Of course, comparison of parameter-counts to biological brain sizes is somewhat fraught.
This fits my bear-picture fairly well.
Here’s some details of my bull-picture:
GPT4.5 is still a small fraction of the human brain, when we try to compare sizes. It makes some sense to think of it as a long-lived parrot that’s heard the whole internet and then been meticulously reinforced to act like a helpful assistant. From this perspective, it makes a lot of sense that its ability to generalize datapoints is worse than human, and plausible (at least naively) that one to four additional orders of magnitude will close the gap.
Even if the pretraining paradigm can’t close the gap like that due to fundamental limitations in the architecture, CoT is approximately Turing-complete. This means that the RL training of reasoning models is doing program search, but with a pretty decent prior (ie representing a lot of patterns in human reasoning). Therefore, scaling reasoning models can achieve all the sorts of generalization which scaling pretraining is failing at, in principle; the key question is just how much it needs to scale in order for that to happen.
While I agree that RL on reasoning models is in some sense limited to tasks we can provide good feedback on, it seems like things like math and programming and video games should in principle provide a rich enough training environment to get to highly agentic and sophisticated cognition, again with the key qualification of “at some scale”.
For me a critical part of the update with o1 was that frontier labs are still capable of innovation when it comes to the scaling paradigm; they’re not stuck in a scale-up-pretraining loop. If they can switch to this, they can also try other things and switch to them. A sensible extrapolation might be that they’ll come up with a new idea whenever their current paradigm appears to be stalling.
My guess is that we want to capture those differences with the time&date meta-data instead (and to some extent, location and other metadata). That way, we can easily query what you-in-particular would say at other periods in your life (such as the future). However, I agree that this is at least not obvious.
Maybe a better way to do it would be to explicitly take both approaches, so that there’s an abstract-you vector which then gets mapped into a particular-you author space via combination with your age (ie with date&time). This attempts to explicitly capture the way you change over time (we can watch your vector move through the particular-author space), while still allowing us to query what you would say at times where we don’t have evidence in the form of writing from you.
Ideally, imagining the most sophisticated version of the setup, the model would be able to make date&time attributions very fine-grained, guessing when specific words were written & constructing a guessed history of revisions for a document. This complicates things yet further.
From my personal experience, I agree. I find myself unexcited about trying the newest LLM models. My main use-case in practice these days is Perplexity, and I only use it when I don’t care much about the accuracy of the results (which ends up being a lot, actually… maybe too much). Perplexity confabulates quite often even with accurate references in hand (but at least I can check the references). And it is worse than me at the basics of googling things, so it isn’t as if I expect it to find better references than me; the main value-add is in quickly reading and summarizing search results (although the new Deep Research option on Perplexity will at least iterate through several attempted searches, so it might actually find things that I wouldn’t have).
I have been relatively persistent about trying to use LLMs for actual research purposes, but the hallucination rate seems to go to 100% almost whenever an accurate result would be useful to me.
The hallucination rate does seem adequately low when talking about established mathematics (so long as you don’t ask for novel implications, such as applying ideas to new examples). For this and for other reasons I think they can be quite helpful for people trying to get oriented to a subfield they aren’t familiar with—it can make for a great study partner, so long as you verify what it says be checking other references.
Also decent for coding, of course, although the same caveat applies—coders who are already an expert in what they are trying to do will get much less utility out of it.
I recently spoke to someone who made a plausible claim that LLMs were 10xing their productivity in communicating technical ideas in AI alignment with something like the following workflow:
Take a specific cluster of failure modes for thinking about alignment which you’ve seen often.
Hand-write a large, careful prompt document about the cluster of alignment failure modes, which includes many specific trigger-action patterns (if someone makes mistake X, then the correct counterspell to avoid the mistake is Y). This document is highly opinionated and would come off as rude if directly cited/quoted; it is not good communication. However, it is something you can write once and use many times.
When responding to an email/etc, load the email and the prompt document into Claude and ask Claude to respond to the email using the document. Claude will write something polite, informative, and persuasive based on the document, with maybe a few iterations of correcting Claude if its first response doesn’t make sense. The person also emphasized that things should be written in small pieces, as quality declines rapidly when Claude tries to do more at once.
They also mentioned that Claude is awesome at coming up with meme versions of ideas to include in powerpoints and such, which is another useful communication tool.
So, my main conclusion is that there isn’t a big overlap between what LLMs are useful for and what I personally could use. I buy that there are some excellent use-cases for other people who spend their time doing other things.
Still, I agree with you that people are easily fooled into thinking these things are more useful than they actually are. If you aren’t an expert in the subfield you’re asking about, then the LLM outputs will probably look great due to Gell-Mann Amnesia type effects. When checking to see how good the LLM is, people often check the easier sorts of cases which the LLMs are actually decent at, and then wrongly generalize to conclude that the LLMs are similarly good for other cases.
Yeah, that makes sense.
For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as
“has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?” (obviously yes),
& I created the question to see if we could substantiate the “yes” here with evidence.
It makes somewhat more sense to me for your timeline crux to be “can we do this reliably” as opposed to “has this literally ever happened”—but the claim in your post was quite explicit about the “this has literally never happened” version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what’s going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.
This strong position even makes some sense to me; it isn’t totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.
But we don’t need to speculate about that in the case of AI! We know roughly how much money we’ll need for a given size of AI experiment (eg, a training run). The question is one of raising the money to do it. With a strong enough safety case vs the competition, it might be possible.
I’m curious if you think there are any better routs; IE, setting aside the possibility of researching safer AI technology & working towards its adoption, what overall strategy would you suggest for AI safety?