abramdemski

Karma: 19,359

abramdemski Jan 7, 2025, 7:40 PM
2 points
0
in reply to: J Bostock’s comment on: Turning up the Heat on Deceptively-Misaligned AI
The argument could also be phrased as “If an AI is trained to be coherent wrt a high beta, it cannot also be coherent wrt a low beta. Therefore an AI trained to a high beta cannot act coherently over multiple independent RL episodes if sampled with a low beta.”
Is the idea to train with high beta and then use lower beta post-training?
- If so, how does this relate to reward hacking and value preservation? IE, where do $V_{1}$ and $V_{2}$ come from, if they aren’t the result of a further training step? If high beta is used during training (to achieve beta-coherence) and then low beta is used in production, then the choice between $V_{1}$ and $V_{2}$ must be made in production (since it is made with low beta), but then it seems like $V_{1} = V_{2}$ .
- If not, then when does the proposal suggest to use high beta vs low beta? If low beta is used during training, then how is it that V is coherent with respect to high beta instead?
Another concern I have is that if both beta values are within a range that can yield useful capabilities, it seems like the difference cannot be too great. IIUC, the planning failure postulated can only manifest if the reward-hacking relies heavily on a long string of near-optimal actions, which becomes improbable under increased temperature. Any capabilities which similarly rely on long strings of near-optimal actions will similarly be hurt. (However, this concern is secondary to my main confusion.)
Therefore a value function trained with such a procedure must consider the state reached during training.
Trained with what procedure, exactly?
This reduces the space of possible value functions from “literally anything which wants to be modified a certain way to be released” to “value functions which do care about the states reached during training”.
Yes this would prevent an aligned AI from arbitrarily preserving its value function, the point is that an aligned AI probably would care about which state was reached during training (that’s the point of RL) so the contradiction does not apply.
(These parts made sense to me modulo my other questions/concerns/confusions.)

abramdemski Jan 7, 2025, 7:12 PM
2 points
0
on: What’s so special about likelihoods?
One sort of answer is that we often want the posterior, and we often have the likelihood. Slightly more refined: we often find the likelihood easier to estimate than the posterior, so Bayes’ Rule is useful.
Why so?
I think one reason is that we make it the “responsibility” of hypotheses to give their likelihood functions. After all, what is a hypothesis? It’s just a probability distribution (not a probability distribution that we necessarily endorse, but, one which we are considering as a candidate). As a probability distribution, its job is to make predictions; that is, give us probabilities for possible observations. These are the likelihoods.
We want the posterior because it tells us how much faith to place in the various hypotheses—that is, it tells us whether (and to what degree) we should trust the various probability distributions we were considering.
So, in some sense, we use Bayes’ Rule because we aren’t sure how to assign probabilities, but we can come up with several candidate options.
One weak counterexample to this story is regression, IE, curve-fitting. We can interpret regression in a Bayesian way easily enough. However, the curves don’t come with likelihoods baked in. They only tell us how to interpolate/extrapolate with point-estimates; they don’t give a full probability distribution. We’ve got to “soften” these predictions, layering probabilities on top, in order to apply the Bayesian way of thinking.

abramdemski Jan 7, 2025, 4:34 PM
2 points
0
on: Turning up the Heat on Deceptively-Misaligned AI
I think I don’t quite understand what the argument is supposed to be. Would the same argument also prevent an aligned AI from acting to preserve its values? How is it supposed to select for aligned behavior over misaligned behavior?
In order to achieve beta-coherence, it seems like the training needs to use $β_{t}$ to choose actions, so that V is trained on those action frequencies. However, the proposal appears to be to instead use $β_{s}$ during training, so that the misaligned V is mislead about how to do reward-hacking. This seems like an impossible combination: V will become beta-coherent wrt $β_{s}$ rather than $β_{t}$ during training.
We could imagine two training steps; first we cohere V wrt $β_{t}$ , and then re-train wrt $β_{s}$ . Perhaps this is your intention. More generally, we could gradually increase the temperature during training.
Oh, I guess we could also modify the training procedure so that the gradient associated with some actions gets under-weighted and others get over-weighted. Maybe this is your intended proposal.
Still, I’m not sure this helps achieve the postulated effect. Let’s say that, during training, we choose actions entirely randomly (max temperature), but the gradient from suboptimal actions get entirely ignored (so V becomes coherent wrt minimum temperature). This would seem to be almost equivalent to just training with minimum temperature (except the exploration behavior is very different).
Similarly for less extreme temperatures: if we make V beta-coherent by under-weighing gradients from over-sampled actions, then we are also more-or-less correcting the ‘mistaken’ expectations of the reward-hacking attempts.
Am I misunderstanding the proposal, or missing something in the argument?

abramdemski Jan 7, 2025, 3:31 PM
4 points
2
on: Turning up the Heat on Deceptively-Misaligned AI
Correctness: V correctly predicts future reward in RL scenarios.
The meaning of correctness is very unclear to me on first reading. It later becomes apparent that “reward” is not intended to refer to the human-designed reward signal at all, due to the later assumption “deceptive misalignment”. So what does “correctness” say? “Correctness” indicates conforming to some standard; but here, the standard being conformed to is only the subjective standard of the misaligned AI itself.
This suggests the interpretation of “correctness” as “reflect the EV of the current state in terms of the misaligned desires”. However, this contradicts beta-coherence, since $β_{t}$ incorrectly predicts action probabilities.
I think it would be be better to remove the “correctness” assumption, since it doesn’t really do anything.

abramdemski Dec 27, 2024, 8:01 PM
8 points
0
on: The Field of AI Alignment: A Postmortem, and What To Do About It
(And no, a doctorate degree in almost any other technical field, including ML these days, does not convey a comparable level of general technical skill to a physics PhD.)
Mathematics?

abramdemski Dec 16, 2024, 4:00 PM
2 points
0
on: Don’t want Goodhart? — Specify the damn variables
If you indeed were solving a narrower task — that is, only creating the most sense of pleasure-inducing picture with maximization of other parameters — and then looked back, puzzled as to why the hungry weren’t fed by this procedure, bringing Goodhart’s law into the discussion is madness; it stresses me out. The variable ‘people are hungry’ wasn’t important for this task at all. Oh, or was it important to you? Then why didn’t you specify it? You think it’s ‘obvious’?
The point of Goodhart’s Law is that you can only select for what you can measure. The burger is a good analogy because Instagram can’t measure taste or nutrition, so when Instagram is what optimizes burgers, you get burgers with a very appealing appearance but non-optimized taste and nutrition. If you have the ability to measure taste, then you can create good taste, but you run into subtler examples of Goodhart (EG, Starbucks coffee is optimized to taste good to their professional tasters, which is slightly different from tasting good to a general audience).
Just specifying the variable you’re interested in doesn’t solve this problem; you also have to figure out how to measure it. The problem is that measurements are usually at least slightly statistically distinct from the actual target variable, so that the statistical connection can fall apart under optimization.
I also take issue with describing optimizing the appearance of the burger as “narrower” than optimizing the burger quality. In general it is a different task, which may be narrower or broader.

abramdemski Dec 16, 2024, 3:50 PM
2 points
0
on: Don’t want Goodhart? — Specify the damn variables
I expect that the main problem with Goodhart’s law is that if you strive for an indicator to accurately reflect the state of the world, once the indicator becomes decoupled from the state of the world, it stops reflecting the changes in the world. This is how I interpret the term ‘good,’ which I dislike. People want a thermometer to accurately reflect the patterns they called temperature to better predict the future — if the thermometer doesn’t reflect the temperature, future predictions suffer.
A problem I have with this reinterpretation is that “state of the world” is too broad. In looking at a thermometer, I am not trying to understand the entire world-state (and the thermometer also couldn’t be decoupled from the entire world-state, since it is a part of the world).
A more accurate way to remove “good” would be as follows:
In everyday life, if a human is asked to make a (common, everyday) judgement based on appearances, then the judgement is probably accurate. But if we start optimizing really hard based on their judgement, Goodhart’s Law kicks in.

abramdemski Dec 5, 2024, 4:34 PM
2 points
0
in reply to: mxxun’s comment on: Complete Class: Consequentialist Foundations
Thanks!

abramdemski Dec 2, 2024, 9:13 PM
2 points
0
in reply to: mattmacdermott’s comment on: Ayn Rand’s model of “living money”; and an upside of burnout
Ah, yeah, sorry. I do think about this distinction more than I think about the actual model-based vs model-free distinction as defined in ML. Are there alternative terms you’d use if you wanted to point out this distinction? Maybe policy-gradient vs … not policy-gradient?

abramdemski Dec 2, 2024, 5:37 PM
4 points
0
in reply to: Chipmonk’s comment on: Ayn Rand’s model of “living money”; and an upside of burnout
In machine-learning terms, this is the difference between model-free learning (reputation based on success/failure record alone) and model-based learning (reputation can be gained for worthy failed attempts, or lost for foolish lucky wins).

abramdemski Dec 2, 2024, 4:01 PM
4 points
0
in reply to: cousin_it’s comment on: Examples of Highly Counterfactual Discoveries?
There’s unpublished work about a slightly weaker logical induction criterion which doesn’t have this property (there exist constant-distribution inductors in this weaker sense), but which is provably equivalent to the regular LIC whenever the inductor is computable.^[1] To my eye, the weaker criterion is more natural. The basic idea is that this weird trader shouldn’t count as raking in the cash. The regular LIC (we can call it “strong LIC” or SLIC) counts traders as exploiting the market if there is a sequence of worlds in which their wealth grows unboundedly. This allows for the trick you quote: buying up larger and larger piles of sentences in diminishingly-probable worlds counts as exploiting the market.
The weak LIC (WLIC) says instead that traders have to actually make the money in order to count as exploiting the market.
Thus the limit of a logical inductor can count as a (weak) logical inductor, just not a computable one.
1. ^
  Roughly speaking. This is not quite an adequate description of the theorem.

abramdemski Dec 2, 2024, 12:12 AM
LW: 4 AF: 4
2
AF
in reply to: casens’s comment on: o1 is a bad idea
Good citation. Yeah, I should have flagged harder that my description there was a caricature and not what anyone said at any point. I still need to think more about how to revise the post to be less misleading in this respect.
One thing I can say is that the reason that quote flags that particular failure mode is because, according to the MIRI way of thinking about the problem, that is an easy failure mode to fall into.

abramdemski Nov 28, 2024, 5:49 PM
5 points
0
on: Are most people deeply confused about “love”, or am I missing a human universal?
I agree that:
- People commonly talk about love as black-and-white when it isn’t.
- It is used as a semantic stopsign or applause light.
- Love is a cluster concept made up of many aspects, but the way people talk about love obscures this. The concept of “true love” serves to keep people in the dark by suggesting that the apparent grab-bag nature of love is just the way people talk, & there’s an underlying truth to be discovered. Because experiences of love are fairly rare, people can only accumulate personal evidence about this slowly, so it remains plausible that what they’ve experienced is “just infatuation” or something else, and the “true love” remains to be discovered. People will fight about what love is in sideways ways, EG “you don’t know what love is” can be used to undercut the other person’s authority on love (when a more accurate representation of the situation might be “I didn’t enjoy participating in the social dynamic you’re presently calling love”). Different people will sometimes express different detailed views about what love is, but there seems to be little drive to reach an agreement about this, at least between people who aren’t romantic partners.
I once had the experience of a roommate asking if I believe in love. I said yes, absolutely, explaining that it is a real human emotion (I said something about chemicals in the brain). He responded: it sounds like you don’t believe in love!
(I think this has to do with ambiguity about believe-in as much as about love, to be fair.)
So, a question: is ‘love’ worth saving as a concept? The way some people use the word might be pretty terrible, but, should we try to rescue it? Is there a good way to use it which recovers many aspects of the common use, while restoring coherence?
I do personally feel that there is some emotional core to love, so I’m sympathetic to the “it’s a specific emotion” definition. This accounts with people not being able to give a specific definition. Emotions are feelings, so they’re tough to define. They just feel a specific way. You can try to give cognitive-behavioral definitions, and that’s pretty useful; but, for example, you could show behavioral signs of being afraid without actually experiencing fear.
I’m also sympathetic to the view that “love” is a kind of ritual, with saying “I love you” being the centerpiece of the ritual, and a cluster of loving behaviors being the rest of it. “I love you” can then be interpreted as a desire or intention or commitment to participate in the rest of it.

abramdemski Nov 28, 2024, 5:13 PM
3 points
1
in reply to: SpectrumDT’s comment on: Hell is wasted on the evil
I reject the principled distinction. To me, it’s more of a spectrum.

abramdemski Nov 22, 2024, 9:16 AM
4 points
1
in reply to: Shelby Stryker’s comment on: Which things were you surprised to learn are not metaphors?
I would strongly guess that many people could physically locate the cringe pain, particularly if asked when they’re experiencing it.

abramdemski Nov 21, 2024, 7:13 PM
4 points
2
in reply to: Noosphere89’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
I’m specifically referring to this answer, combined with a comment that convinced me that the o1 deception so far is plausibly just a capabilities issue:
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#L5WsfcTa59FHje5hu
https://www.lesswrong.com/posts/3Auq76LFtBA4Jp5M8/why-is-o1-so-deceptive#xzcKArvsCxfJY2Fyi
I continue not to get what you’re saying. I entirely agree with the Habryka and Acertain responses to Gwern. Ted Sanders is talking about why GPT models hallucinate, but fails to address the question of why o1 actively encourages itself to hallucinate (rather than, eg, telling the user that it doesn’t know). My most plausible explanation of your position is that you think this is aligned, IE, you think providing plausible URLs is the best it can do when it doesn’t know. I disagree: it can report that it doesn’t know, rather than hallucinate answers. Actively encouraging itself to give plausible-but-inaccurate links is misalignment. It’s not just missing a capability; it’s also actively self-prompting to deceive the user in a case where it seems capable of knowing better (if it were deliberating with the user’s best interest in mind).
As I said above, the more likely explanation is that there’s an asymmetry in capabilities that’s causing the results, where just knowing what specific URL the customer wants doesn’t equal the model having the capability to retrieve a working URL, so this is probably at the heart of this behavior.
Capabilities issues and alignment issues are not mutually exclusive. It appears that capabilities issues (not being able to remember all valid URLs) is causing an alignment issue in this case (actively planning to give deceptive URLs rather than admit its ignorance). This causes me to predict that even if this particular capability issue is resolved, the training will still have this overall tendency to turn remaining capability issues into alignment issues. How are you seeing this differently?

abramdemski Nov 21, 2024, 6:44 PM
4 points
0
in reply to: Nathan Helm-Burger’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
Yep, thanks, this makes sense. I still don’t have an intuitive picture of what differences to expect in the trained result, though.
Clarifying Concrete Algorithms
In the most basic possible shoggoth+face training setup, you sample CoT+summary, you score the summary somehow, and then you do this lots of times and throw away a worse-scoring fraction of these samples. Then you fine-tune the shoggoth on the CoT parts of these, and fine-tune the face on the summary part. I’ll call this ‘basic trainig’.
The most basic implementation of your thing is similar, except it discards the worst in this more complex way you’ve defined, instead of just global score. I’ll call this ‘your proposal’. (Although, it ignores the contrastive part of your proposal.)
(For both basic training and your proposal, I’m going to pretend there is only one question/prompt; really we have across-prompt issues similar to the across-summary issues you’re trying to solve, IE, by discarding the globally worst-scoring CoT+summary we discard answers to hard questions, meaning we would never train to do better on those questions. So really we need to at least take multiple samples for each question and discard samples based on performance relative to a question. Easier to focus on one question to avoid describing too many algorithmic details.)
Your proposal splits into the CoT part (which you call the n part), and the summary part (which you call the k part).
Analyzing CoT Gradient
For the CoT part, basic training keeps the CoTs which resulted in the best summaries. Your proposal instead averages the top 3 summaries for each CoT. This de-noises somewhat, at the cost of taking more summary samples per CoT. I’m not sure how to think about the trade-off there. More summary samples per CoT means less CoT samples overall, so we’re doing a better job scoring the CoTs, but we get less scores overall, so we learn less. Maybe the gradient step-size can be increased a little because we’re more confident in the gradient steps being taken, but we still explore less of the CoT space.
Another variation would be to average all the summary samples for that CoT to score a CoT sample. This de-noises more, for the same price in number of samples. This way, we’re simply scoring CoTs based on how well they make the summaries perform, with several samples to de-noise the gradient.
A possible argument for your proposal over this alternative is that we’re also training the summary to improve it, so the top 3 summaries are actually a better estimator of the kind of performance we care about than the total average.
This argument, if valid, similarly recommends your proposal over the basic training, because basic training is also looking at average summary score per-CoT. (It’s also looking at top, but it only samples one sumary per CoT, so these are the same).
Analyzing Summary Gradient
For the summary-gradient part, you take the highest-scoring summary from the best CoT (as defined previously), rather than the highest-scoring summary overall. This could help train the summary to be good for the type of CoT we’re steering towards, instead of than training it to be good for high-variance CoTs which produce one outlier great summary. As with the previous section, this helps to steer towards policies which are going to be better after learning, rather than just what’s best currently.
Conclusion
I’m curious to hear whether you think I’m missing any big factors here. I don’t have a lot of practical experience with this type of machine learning, so my guesses about what’s important vs unimportant here are not very informed.
My uninformed guess is that the trade-off between denoising the gradient vs getting more samples favors getting more samples, so the denoising argument for your proposal isn’t a great one. The main argument in favor, then, is the way your proposal trains the face to respond well to how the shoggoth is expected to behave after training, and trains the shoggoth to respond well to how the face is expected to behave after training—whereas the basic proposal instead trains the face to perform well for the current shoggoth and vice versa.
I’m not sure how big of a deal this is. If the gradient steps are small and the training is iterated a lot of times, maybe it’s not a very big deal at all?
IE, for the CoT gradient, the mistake made by the basic proposal is that it will sometimes discard a good CoT (and gradient towards a worse one) based on random poor performance from face. However, maybe face’s performance will improve fast enough that this won’t matter too much in the end?
Similarly, for the summary gradient, the problem with the basic proposal is that it’ll sometimes update based on an anomalously good summary of a poor CoT, meaning it is learning to cope with bad CoT behavior which it won’t have to deal with very often post-training. But maybe it won’t matter too much, if the bad CoT behavior is eliminated during training anyway?
Overall, it seems plausible to me that these problems with the basic proposal would work against what Daniel wants, and your proposal would help push towards what Daniel wants. However, I’m very uncertain, especially about the size of the effect.

abramdemski Nov 20, 2024, 7:14 PM
6 points
4
in reply to: Noosphere89’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
I definitely agree that OpenAI would dismiss good evidence of deceptive cognition, though I personally don’t find the o1 report damning, because I find the explanation that it confabulates links in the CoT because there is a difference between it’s capability to retrieve links and it’s ability to know links to be pretty convincing (combined with links being a case where perfect linking is far more useful than approximately linking.)
What do you mean? I don’t get what you are saying is convincing.
Perhaps I should clarify my belief.
The o1 report says the following (emphasis mine):
0.8% of o1-preview’s responses got flagged as being ‘deceptive’ [...] roughly two thirds of which appear to be intentional (0.38%), meaning that there was some evidence in the chain of thought that o1-preview was aware that the answer was incorrect [...] Intentional hallucinations primarily happen when o1-preview is asked to provide references to articles, websites, books, or similar sources that it cannot easily verify without access to internet search, causing o1-preview to make up plausible examples instead. [...] It is encouraging that, in the analysis presented below, while our monitor did find a few forms of the model knowingly presenting incorrect information to the user or omitting important information, it did not find any instances of o1-preview purposely trying to deceive the user for reasons other than satisfying the user request.
- Is this damning in the sense of showing that o1 is a significant danger, which needs to be shut down for public safety?
  - No, I don’t think so.
- Is this damning in the sense of providing significant evidence that the technology behind o1 is dangerous? That is: does it provide reason to condemn scaling up the methodology behind o1? Does it give us significant reason to think that scaled-up o1 would create significant danger to public safety?
  - This is trickier, but I say yes. The deceptive scheming could become much more capable as this technique is scaled up. I don’t think we have a strong enough understanding of why it was deceptive in the cases observed to rule out the development of more dangerous kinds of deception for similar reasons.
- Is this damning in the sense that it shows OpenAI is dismissive of evidence of deception?
  - OpenAI should be commended for looking for deception in this way, and for publishing what they have about the deception they uncovered.
  - However, I don’t buy the distinction they draw in the o1 report about not finding instances of “purposefully trying to deceive the user for reasons other than satisfying the user request”. Providing fake URLs does not serve the purpose of satisfying the user request. We could argue all day about what it was “trying” to do, and whether it “understands” that fake URLs don’t satisfy the user. However, I maintain that it seems at least very plausible that o1 intelligently pursues a goal other than satisfying the user request; plausibly, “provide an answer that shallowly appears to satisfy the user request, even if you know the answer to be wrong” (probably due to the RL incentive).
  - More importantly, OpenAI’s overall behavior does not show concern about this deceptive behavior. It seems like they are judging deception case-by-case, rather than treating it as something to steer hard against in aggregate. This seems bad.

abramdemski Nov 20, 2024, 6:39 PM
4 points
0
in reply to: Nathan Helm-Burger’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth’s contribution vs Face’s contribution.
So, how do you extract the desired credit-assignment info out of this?
(Also, how do you then use that info?)
If I have 10 students but I can only test them in pairs, I can get some information on how good individual students are by testing multiple different pairings.
If I have just Alice and Bob, I pairing them up multiple times doesn’t help so much.

abramdemski Nov 20, 2024, 6:12 PM
6 points
4
in reply to: Daniel Kokotajlo’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you’ll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good “semantic similarity score” (this shouldn’t be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph’s idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.

abramdemski

Clarifying Concrete Algorithms

Analyzing CoT Gradient

Analyzing Summary Gradient

Conclusion