Roger Dearnaley

Karma: 34

Roger Dearnaley 20 Oct 2024 19:36 UTC
LW: 3 AF: 1
0
AF
on: Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
I expect (1) and (2) to go mostly fine (though see footnote^[6]), and for (3) to completely fail. For example, suppose $C_{b}$ has a bunch of features for things like “true according to smart humans.” How will we distinguish those from features for “true”? I think there’s a good chance that this approach will reduce the problem of discriminating $C_{g}$ vs. $C_{b}$ to an equally-difficult problem of discriminating desired vs. undesired features.
However, if we found that the classifier was using a feature for “How smart is the human asking the question?” to decide what answer to give (as opposed to how to then phrase it), that would be a red flag.

Roger Dearnaley 20 Oct 2024 5:06 UTC
1 point
0
AF
on: Do Sparse Autoencoders (SAEs) transfer across base and finetuned language models?
Model/FinetuneGlobal Mean (Cosine) Similarity
Gemma-2b/Gemma-2b-Python-codes0.6691
Mistral-7b/Mistral-7b-MetaMath0.9648
As an ex-Googler, my informed guess would be that Gemma 2B will have been distilled (or perhaps both pruned and distilled) from a larger teacher model, presumably some larger Gemini model — and that Gemma-2b and Gemma-2b-Python-codes may well have been distilled separately, as separate students of two similar-but-not-identical teacher models distilled using different teaching datasets. The fact that the global mean cosine you find here isn’t ~0 shows that if so, the separate distillation processes were either warm-started from similar models (presumably a weak 2B model — a sensible way to save some distillation expense), or at least shared the same initial token embeddings/unembeddings.
Regardless of how they were created, these two Gemma models clearly differ pretty significantly, so I’m unsurprised by your subsequent discovery that the SAE basically doesn’t transfer between them.
For Mistral 7B, I would be astonished if any distillation was involved, I would expect just a standard combination of fine-tuning followed by either RLHF or something along the lines of DPO. In very high dimensional space, a cosine of 0.96 means “almost identical”, so clearly the instruct training here consists of fairly small, targeted changes, and I’m unsurprised that as a result the SAE transfers quite well.

Roger Dearnaley 20 Oct 2024 4:00 UTC
1 point
0
AF
on: The case for unlearning that removes information from LLM weights
knowing if the intentions of a user are legitimate is very difficult
This sounds more like a security and authentication problem than an AI problem.

Roger Dearnaley 20 Oct 2024 3:58 UTC
LW: 3 AF: 2
0
AF
on: The case for unlearning that removes information from LLM weights
Side note: skills vs facts. I framed some of the approaches as teaching false facts / removing some facts—and concrete raw facts is often what is the easiest to evaluate. But I think that the same approach can be used to remove the information that makes LLMs able to use most skills. For example, writing Python code requires the knowledge of the Python syntax and writing recursive functions requires the knowledge of certain patterns (at least for current LLMs, who don’t have the ability to rederive recursive from scratch, especially without a scratchpad), both of which could be unlearned like things that are more clearly “facts”.
If you read e.g. “Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level”, current thinking in mechanistic interpretability is that simple memorization of otherwise random facts (i.e. ones where there are no helpful “rules of thumb” to get many cases right) uses different kinds of learned neural circuits that learning of skills does. If this were in fact the case, then unlearning of facts and skills might have different characteristics. In particular, for learnt skills, if mechanistic interpretability can locate and interpret the learned circuit implementing a skills, then we can edit them directly (as has been done successfully in a few cases). However, we’ve so far had no luck at interpreting neural circuitry for large collections of basically-random facts, and there are some theoretical arguments suggesting that such circuitry may be inherently hard to interpret, at least using current techniques.

Roger Dearnaley 20 Oct 2024 2:20 UTC
1 point
0
AF
on: Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)
I think something that might be quite illuminating for this factual recall question (as well has having potential safety uses) is editing the facts. Suppose, for this case, you take just the layer 2-6 MLPs with frozen attention, you warm-start from the existing model, add a loss function term that penalizes changes to the model away from that initialization (perhaps using a combination of L1 and L2 norms), and train that model on a dataset that consists of your full set of athlete names embeddings and their sports (in a frequency ratio matching the original training data, so with more copies of data about more famous athletes), but with one factual edit to it to change the sport of one athlete. Train until it has memorized this slightly edited set of facts reasonably well, and then look at what the pattern of weights and neurons that changed significantly is. Repeat several times for the same athlete, and see how consistent/stochastic the results are. Repeat for many athlete/sport edit combinations. This should give you a lot of information on how widely distributed the representation of the data on a specific athlete is, and ought to give you enough information to distinguish between not only your single step vs hashing hypotheses, but also things like bagging and combinations of different algorithms. Even more interesting would be to do this not just for an athlete’s sport, but also some independent fact about them like their number.

Of course, this is computationally somewhat expensive, and you do need to find metaparameters that don’t make this too expensive, while encouraging the resulting change to be as localized as possible consistent with getting a good score on the altered fact.

Roger Dearnaley 15 May 2023 0:10 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Is Infra-Bayesianism Applicable to Value Learning?
I take your point that the way an Infra-Bayesian system makes decisions isn’t the same as a human — it presumably doesn’t share our cognitive biases, and the pessimism element ‘Murphy’ in it seems stronger than for most humans. I normally assume that if there’s something I don’t understand about the environment that’s injecting noise into the outcome of my actions, the noise-related parts of results aren’t going to be well-optimized, so they’re going to be worse than I could have achieved had I had full understanding, but that even leaving things to chance I may sometimes get some good luck along with the bad — I don’t generally assume that everything I can’t control will have literally the worst possible outcome. So I guess in Infra-Bayesian terms I’m assuming that Murphy is somewhat constrained by laws that I’m not yet aware of, and may never be aware of.
My take on Murphy is that it’s a systematization of the force of entropy trying to revert the environment to a thermodynamic equilibrium state, and of the common fact that the utility of that equilibrium state is usually pretty low. One of the flaws I see in Infra-Bayesianism is that there are sometimes (hard to reach but physically possible) states whose utility to me is even lower than the thermodynamic equilibrium (such as a policy that scores less than 20% on a 5-option multiple choice quiz so does worse than random guessing, or a minefield left over after a war that is actually worse than a blasted wasteland) where increasing entropy would actually help improve things. In a hellworld, randomly throwing money wrenches in the gears is a moderately effective strategy. In those unusual cases Infra-Bayesianism’s Murphy no longer aligns with the actual effects of entropy/Knightian uncertainty.

Roger Dearnaley 14 May 2023 20:47 UTC
1 point
on: Prospect Theory: A Framework for Understanding Cognitive Biases
The observation that gains saturate has a fairly simple explanation from evolutionary theory: the increased evolutionary fitness advantage from large material gains saturates (especially in a hunter-gatherer environment). Successfully hunting a rabbit will keep me from starving for a day; but if I successfully hunt a mammoth, I can’t keep the meat for long enough for it to feed me for years. The best I can do is feed everyone in the village for a few days, hoping they remember this later when my hunting is less successful, and do a bunch of extra work to make some jerky with the rest. The evolutionary advantage is sub-linear in the kilograms of raw meat. In more recent agricultural societies, rich and powerful men like Ramses II who had O(100) children needed a lot more than 50 times the average resources of men in their society to achieve that outcome (and of course that evolutionary strategy isn’t possible for women). Even today, if I’m unlucky enough to get pancreatic cancer it doesn’t matter how rich I am: all that money isn’t going to save me, even if I’m as rich as Steve Jobs.
Similarly, on the downside, from a personal evolutionary fitness point of view, saturation also makes sense, since there is a limit to how bad things can get: once I, my family, and everyone else in the tribe related to me are all dead, it’s game over, and my personal evolutionary fitness doesn’t really care whether everyone else in the region also died, or not.
So it seems to me that at least the first diagram above of prospect theory may be an example of humans being aligned with evolution’s utility function.
I don’t have a good evolutionary explanation for the second diagram, unless it’s a mechanism to compensate for some psychological or statistical bias in how hunter-gatherers obtain information about and estimate risks, and/or how that compares to modern mathematical risk measures like probabilities and percentages.

Roger Dearnaley 14 May 2023 11:02 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Is Infra-Bayesianism Applicable to Value Learning?
We want our value-learner AI to learn to have the same preference order over outcomes as humans, which requires its goal to be to find (or at least learn to act according to) a utility function as close as possible to some aggregate of ours (if humans actually had utility functions rather than a collection of cognitive biases) up to an arbitrary monotonically-increasing mapping. We also want its preference order over probability distributions of outcomes to match ours, which requires it to find a utility function that matches ours up to an increasing affine (linear, i.e. scale and shift) transformation. So, once it has made good progress on its value learning, its utility function ought to make a lot of sense to us.

Roger Dearnaley 12 May 2023 9:50 UTC
4 points
1
on: Infra-Bayesianism naturally leads to the monotonicity principle, and I think this is a problem
The sequence on Infra-Bayesianism motivates the min (a.k.a. Murphy) part of its argmax min by wanting to establish lower bounds on utility — that’s a valid viewpoint. My own interest in Infra-Bayesianism comes from a different motivation: Murphy’s min encodes directly into Infra-Bayesian decision making the generally true, inter-related facts that 1) for an optimizer, uncertainty on the true world model injects noise into your optimization process which almost always makes the outcome worse 2) the optimizer’s curse usually results in you exploring outcomes whose true utility you had overestimated, so your regret is generally higher than you had expected 3) most everyday environments and situations are already highly optimized, so random perturbations of their state almost invariably make things worse. All of which justify pessimism and conservatism.
The problem with this argument, is that it’s only true when the utility of the current state is higher than the utility of the maximum-entropy equilibrium state of the environment (the state that increasingly randomizing it tends to move it towards due to the law of big numbers). In everyday situations this is almost always true—making random changes to a human body or in a city will almost invariable make things worse, for example. In most physical environments, randomizing them sufficiently (e.g. by raining meteorites on them, or whatever) will tend to reduce their utility to that of a blasted wasteland (the surface of the moon, for example, has pretty-much reached equilibrium under randomization-by-meteorites, and has a very low utility). However, it’s a general feature of human utility functions that there often can be states worse than the maximum-entropy equilibrium. If your environment is a 5-choice multiple-choice test whose utility is the score, the entropic equilibrium is random guessing which will score 20%, and there are chose-wrong-answer policies that score less than that, all the way down to 0% -- and partially randomizing away from one of those policies will make its utility increase towards 20%. Similarly, consider a field of anti-personnel mines left over from a war—as an array of death and dismemberment waiting to happen, randomizing it with meteorite impacts will clear some mines and make its utility better—since it starts off actually worse than a blasted wasteland. Or, if a very smart GAI working on alignment research informed you that it had devised an acausal attack that would convert all permanent hellworlds anywhere in the multiverse into blasted wastelands, your initial assumption would probably be that doing that would be a good thing (modulo questions about whether it was pulling your leg, or whether the inhabitants of the hellworld would consider it a hellworld).
In general, hellworlds (or at least local hell-landscapes) like this are rare — they have a negligible probability of arising by chance, so creating one requires work by an unaligned optimizer. So currently, with humans the strongest optimizers on the planet, they usually only arise in adversarial situations such as wars between groups of humans (“War is Hell”, as the saying goes). However, Infra-Bayesianism has exactly the wrong intuitions about any hell-environment whose utility is currently lower than that of the entropic equilibrium. If you have an environment that has been carefully optimized by a powerful very-non-aligned optimizer so as to maximize human suffering, then random sabotage such as throwing monkey wrenches in the works or assassinating the diabolical mastermind is actually very likely to improve things (from a human point of view), at least somewhat. Infra-Bayesianism would predict otherwise. I think having your GAI’s decision theory based on a system that gives exactly the wrong intuitions about hellworlds is likely to be extremely dangerous.
The solution to this would be what one might call Meso-Bayesianism—renormalize your utility scores so that of the utility of the maximal entropy state of the environment is by definition zero, and then assume that Murphy minimizes the absolute value of the utility towards the equilibrium utility of zero, not towards a hellworld. (I’m not enough of a pure mathematician to have any idea what this modification does to the network of proofs, other then making the utility renormalization part of a Meso-Bayesian update more complicated.) So then your decision theory understands that any unaligned optimizer trying to create a hellworld is also fighting Murphy, and when fighting them on their home turf Murphy is your ally, since “it’s easier to destroy than create” is also true of hellworlds. [Despite the usual formulation of Murphy’s law, I actually think the name ‘Murphy’ suits this particular metaphysical force better—Infra-Bayesianism’s original ‘Murphy’ might have been better named ‘Satan’, since it’s is wiling to go to any length to create a hellworld, hobbled only by some initially-unknown physical laws.]

Roger Dearnaley 11 May 2023 8:26 UTC
1 point
0
on: Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning
Having since started learning about Infra-Bayesianism, my initial impression is that it’s a formal and well-structures mathematical formalism for doing of exactly the kind of mechanism I sketched above for breaking the Optimizer’s Curse, by taking the most pessimistic view of the utility over Knightian uncertainty in your hypotheses set.

Roger Dearnaley 10 May 2023 6:40 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Finding Neurons in a Haystack: Case Studies with Sparse Probing
For my second paragraph above: in a blog post out today, it turns out this is not only feasible, but OpenAI have experimented with doing it, and have now open-sourced the technology for doing it:
https://openai.com/research/language-models-can-explain-neurons-in-language-models
Open AI were only looking at explaining single neurons, so combining their approach with the original paper’s sparse probing technique for superpositions seems like the obvious next step.

Roger Dearnaley 25 Apr 2023 5:18 UTC
1 point
0
on: [Yann Lecun] A Path Towards Autonomous Machine Intelligence
My impression is that he’s trying to do GOFAI with fully differentiable neural networks. I’m also not sure he’s describing a GAI — I think he’s starting by aiming for parity with the capabilities of a typical mammal, not human-level, and that’s why he uses self-driving cars as an example.
Personally I think a move towards GOFAI-like ideas is a good intuition, but that insisting on keeping things fully differentiable is too constraining. I believe that at some level, we are going to need to move away from doing everything with gradient descent, and use something more like approximate Bayesianism, or at least RL.
I also think he’s underestimating the influence of genetics in mammalian mental capabilities. He talks about the step of babies learning that the world is 3D not 2D — I think it’s very plausible that adaptations for processing sensory data from a 3D rather than 2D world are already encoded in our genome, brain structure, and physiology in a many places.
If this is going to be a GAI architecture, then I think he’s massively underthinking alignment.

Roger Dearnaley 25 Apr 2023 3:09 UTC
5 points
6
on: World-Model Interpretability Is All We Need
I’m not very scared of any AGI that isn’t capable of being a scientist — it seems unlikely to be able to go FOOM. In order to do that, it needs to:
1. have multiple world models at the same time that disagree, and reason under uncertainty across them
2. do approximate Bayesian updates on their probability
3. plan conservatively under uncertainty, i.e have broken the Optimizer’s Curse
4. creatively come up with new hypotheses, i.e. create new candidate world models
5. devise and carry out low-cost/risk experiments to distinguish between world models
I think it’s going to be hard to do all of these things well if its world models aren’t fairly modular and separable from the rest of its mental architecture.
One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.

Roger Dearnaley 25 Apr 2023 1:45 UTC
3 points
on: A Correspondence Theorem
flowers are “real”
I really wish you’d said ‘birds’ here :-)

Roger Dearnaley 24 Apr 2023 8:30 UTC
4 points
1
on: Stability AI releases StableLM, an open-source ChatGPT counterpart
Up to a certain size, LLMs are going to be a commodity, and academic/amateur/open-source versions will be available. Currently that scale’s around 7B-20B parameters, likely it will soon increase. However, GPT-4 supposedly cost >$100m to create, of which I’ve seen estimates that the raw foundation model training cost was O($40m) [which admittedly will decrease with Moore’s Law for GPUs], and there is also a significant cost for filtering the training data, doing instruct-following and safety RLHF, and so forth. It’s not clear to me why any organization able to get its hands on that much money and the expertise necessary to spend it would open-source the result, at least while the result remains near-cutting edge (leaks are of course possible, as Meta already managed to prove). So I suspect models as capable as GPT-4 will not get open-sourced any time soon (where things are moving fast enough that ‘soon’ means ‘this year, or maybe next’). But there are quite a lot of companies/governments that can devote >$100m to an important problem, so the current situation that there are only 3-or-4 companies with LLMs with capabilities comparable to GPT-4 isn’t likely to last very long.
As for alignment, it’s very unlikely that all sources for commodity LLMs will do an excellent job of persuading them not to tell you how to hotwire cars, not roleplaying AI waifus, or not being able to simulate 4Chan, and some will just release a foundation model with no such training. So we can expect ‘unaligned’ open-source LLMs. However, none of those are remotely close to civilization-ending problems. The question then is whether Language Model Cognitive Architectures (LMCAs) along the lines of AutoGPT can be made to run effectively on a suitable fine-tuned LLM of less-than-GPT-4 level of complexity and can still increase their capabilities to an AGI level, or if that (if possible at all) requires an LLM of GPT-4-scale or larger. AutoGPT isn’t currently that capable when run with GPT-3.5, but generally, if a foundation LLM shows some signs of a capability at a task, suitable fine-tuning can usually greatly increase the reliability of it performing that task.

Roger Dearnaley 24 Apr 2023 7:59 UTC
5 points
2
in reply to: Max H’s comment on: Capabilities and alignment of LLM cognitive architectures
A lot of the work people have done in alignment had been based on the assumptions that 1) interpretability is difficult/weak, and 2) the dangerous parts of the architecture are mostly trained by SGD or RL or something like that. So you have a blind idiot god making you almost-black boxes. For example, the entire standard framing of inner vs outer alignment has that assumption built into it.
Now suddenly we’re instead looking at a hybrid system where all of that remains true for the LLM part (but plausibly a single LLM forward pass isn’t computationally complex enough to be very dangerous by itself), however the cognitive architecture built on top of it has easy interpretability and even editability (modulo things like steganography, complexity, and sheer volume), looks like combination of a fuzzy textual version of GOFAI with prompt engineering, and its structure is currently hand-coded, and could remain simple enough to be programmed and reasoned about. Parts of alignment research for this might look a lot like writing a constitution, or writing text clearly explaining CEV, and parts might look like the kind of brain-architectural diagrams in the article above.
I believe the only potentially-safe convergent goal to give a seed AI is to build something with the cognitive capability of one-or-multiple smart (and not significantly superhumanly smart) humans, but probably faster, that are capable of doing scientific research well, giving it/them the goal of solving the alignment problem (including corrigibility), somehow ensuring that that goal is locked in,and then somehow monitoring them while they do so
So how would you build a LMCA scientist? It needs to have the ability to:
1. Have multiple world models, including multiple models of human values, with probability distributions across them.
2. Do approximate Bayesian updates on them
3. Creatively generate new hypotheses/world models
4. It attempts to optimize our human value (under some definition, such as CEV, that it is also uncertain of), and knows that it doesn’t fully understand what that is, though it has some data on it.
5. It optimizes safely/conservatively in the presence of uncertainty in its model of the world and human values, i.e has solved the optimizer’s curse Significantly, this basically forces it to make progress on the Alignment Problem in order to ever do anything much outside its training distribution: it has to make very pessimistic assumptions in the presence of human value uncertainty, so it is operationally motivated to reduce human value uncertainty. So even if you build one that isn’t an alignment researcher as its primary goal, but say a medical researcher and tell it to cure cancer, it is likely to rapidly decide that in order to do that well it needs to figure out what Quality-of-life-adjusted Life Years and Informed Consent are, and the Do What I Mean problem, which requires it to first solve the Alignment Problem as a convergent operational goal.
6. Devise and perform low cost/risk experiments (where the cost/risk is estimated conservatively/pessimistically under current uncertainty in world+human value models) that will distinguish between/can falsify individual world+human value models in order to reduce model uncertainty.
This strongly suggest to me an LMCA architecture combining handling natural language (and probably also images, equations, code, etc) with quantitative/mathematical estimation of important quantities like probabilities, risks, costs, and value, and with an underlying structure built around approximate Bayesian reasoning.

Roger Dearnaley 22 Apr 2023 4:44 UTC
3 points
0
in reply to: Seth Herd’s comment on: Agentized LLMs will change the alignment landscape
I agree that an Auto-GPT style agent using self-editing context is likely to communicate in a natural language. Probably English, as the language GPT-4′s training data contained the most of. A future Auto-GPT-N and attempting to avoid monitoring of its stream-of-consciousness might start speaking a more obscure language, or doing a simple encryption like Base64 encoding: it’s not hard to get GPT3.5 to take input and give output in Base64, but it’s enough extra cognitive load per token, and/or messes up the token embeddings enough, that the resulting very-lightly-encrypted LLM is fairly brain damaged. So this is an example of a ‘misalignment tax’, where attempting to be deceptive imposes a significant additional cost.
Of course Base64 isn’t a cryptographically strong encryption scheme. Even for some very smart future AutoGPT-N running on GPT-N, it couldn’t use encryption in its context stream without somewhere to put the key (or some riddle or set of trivia questions whose answer was the key, or whatever). However, if smart enough, it might be able to do something like steganographically concealing a side-channel message in an innocent-looking English text, say as some form of lightly-encrypted/compressed acrostic, but it would have to be smart enough that every time it ran GPT-N against the text to generate another token, that spotted the acrostic, figured out how to decrypt it, paid attention to the side-channel message rather than the main one, and played along and continued this exercise. Not theoretically impossible for something enough smarter than a human, but the misalignment tax gets a good deal bigger, and it’s not obvious why something trained on imitating text generated by humans would learn this ability, except through inadvertent fine-tuning on text contaminated with such acrostic stenographic side-channels embedded. Obviously an aligned, comparably smart AutoGPT-N not paying the no-alignment overhead of having to steganographically conceal its thoughts that was used for monitoring the misaligned Auto-GPT-N shouldn’t find it too hard to catch on to what’s happening here, even if a human couldn’t do it. If nothing else, the pressure of having to fulfill the acrostic requirements for the side-channel is going to make it hard to make the innocent-looking text plausible, presumably raising its perplexity when analyzed by an LLM that hasn’t spotted the side-channel.

Roger Dearnaley 22 Apr 2023 4:14 UTC
3 points
2
in reply to: Thane Ruthenis’s comment on: Agentized LLMs will change the alignment landscape
It’s also quite likely that something like Auto-GPT would work a lot better using a version of LLM that had been fine-tuned/reinforcement-trained for this specific usecase—just as Chat-GPT is a lot more effective as a chatbot than the underlying GPT-3 model was before the specialized training. If the LLM is optimized for the wrapper and the wrapper designed to make efficient use of the entire context-size of the LLM, thinks are going to work a lot better.

Just How Hard a Problem is Alignment?

Roger Dearnaley25 Feb 2023 9:00 UTC

1 point

1 comment21 min readLW link

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

Roger Dearnaley

Just How Hard a Prob­lem is Align­ment?

Break­ing the Op­ti­mizer’s Curse, and Con­se­quences for Ex­is­ten­tial Risks and Value Learning

Just How Hard a Problem is Alignment?

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning