Alignment Forum recent comments

LawrenceC 4 May 2024 6:18 UTC
LW: 6 AF: 3
1
AF
in reply to: O O’s comment on: William_S’s Shortform
When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn’t.

O O 4 May 2024 0:48 UTC
LW: 12 AF: 6
0
AF
in reply to: gwern’s comment on: William_S’s Shortform
Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?

ryan_greenblatt 3 May 2024 22:43 UTC
LW: 7 AF: 6
2
AF
in reply to: Matthew Barnett’s comment on: Buck’s Shortform
One operationalization is “these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs”.

As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.

With these caveats:
- The speed up is relative to the current status quo as of GPT-4.
- The speed up is ignoring the “speed up” of “having better experiments to do due to access to better models” (so e.g., they would complete a fixed research task faster).
- By “capable” of speeding things up this much, I mean that if AIs “wanted” to speed up this task and if we didn’t have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.)
- The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I’m uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk.
- It might be important that the speed up is amortized over a longer duration like 6 months to 1 year.
I’m uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven’t yet been that widely deployed due to inference availability issues, so actual production hasn’t increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal).

So, it’s hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.

Matthew Barnett 3 May 2024 22:10 UTC
LW: 6 AF: 3
0
AF
in reply to: Buck’s comment on: Buck’s Shortform
Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?
For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?

gwern 3 May 2024 21:24 UTC
LW: 45 AF: 23
36
AF
in reply to: habryka’s comment on: William_S’s Shortform
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).

habryka 3 May 2024 18:22 UTC
LW: 29 AF: 16
4
AF
in reply to: William_S’s comment on: William_S’s Shortform
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a “no comment” or lack of response or something to that degree implies a “yes” with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA’s offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)

William_S 3 May 2024 18:18 UTC
LW: 28 AF: 15
0
AF
in reply to: habryka’s comment on: William_S’s Shortform
No comment.

habryka 3 May 2024 18:16 UTC
LW: 14 AF: 10
3
AF
in reply to: William_S’s comment on: William_S’s Shortform
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?

William_S 3 May 2024 18:14 UTC
LW: 103 AF: 53
7
AF
on: William_S’s Shortform
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.

Buck 3 May 2024 16:01 UTC
LW: 6 AF: 4
2
AF
in reply to: Akash’s comment on: Buck’s Shortform
When I said “AI control is easy”, I meant “AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes”; I wasn’t trying to comment more generally. I agree with your concern.

Akash 3 May 2024 15:57 UTC
LW: 5 AF: 2
2
AF
in reply to: Buck’s comment on: Buck’s Shortform
It is pretty plausible to me that AI control is quite easy
I think it depends on how you’re defining an “AI control success”. If success is defined as “we have an early transformative system that does not instantly kill us– we are able to get some value out of it”, then I agree that this seems relatively easy under the assumptions you articulated.
If success is defined as “we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk period”, then this seems much harder.
The classic race dynamic threat model seems relevant here: Suppose Lab A implements good control techniques on GPT-8, and then it’s trying very hard to get good alignment techniques out of GPT-8 to align a successor GPT-9. However, Lab B was only ~2 months behind, so Lab A feels like it needs to figure all of this out within 2 months. Lab B– either because it’s less cautious or because it feels like it needs to cut corners to catch up– either doesn’t want to implement the control techniques or it’s fine implementing the control techniques but it plans to be less cautious around when we’re ready to scale up to GPT-9.
I think it’s fine to say “the control agenda is valuable even if it doesn’t solve the whole problem, and yes other things will be needed to address race dynamics otherwise you will only be able to control GPT-8 for a small window of time before you are forced to scale up prematurely or hope that your competitor doesn’t cause a catastrophe.” But this has a different vibe than “AI control is quite easy”, even if that statement is technically correct.
(Also, please do point out if there’s some way in which the control agenda “solves” or circumvents this threat model– apologies if you or Ryan has written/spoken about it somewhere that I missed.)

Andrew Mack 3 May 2024 4:30 UTC
LW: 1 AF: 1
0
AF
in reply to: tailcalled’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

Andrew Mack 3 May 2024 4:25 UTC
LW: 2 AF: 2
1
AF
in reply to: Bogdan Ionut Cirstea’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested “one could maximize the cosine similarity between the differences in target activations across multiple prompts”. The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.

ryan_greenblatt 3 May 2024 3:36 UTC
LW: 6 AF: 4
4
AF
in reply to: Raemon’s comment on: Buck’s Shortform
The claim is that most applications aren’t internal usage of AI for AI development and thus can be made trivially safe.

Not that most applications of AI for AI development can be made trivially safe.

Raemon 3 May 2024 2:14 UTC
LW: 6 AF: 3
1
AF
in reply to: Buck’s comment on: Buck’s Shortform
It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.
I didn’t get this from the premises fwiw. Are you saying it’s trivial because “just don’t use your AI to help you design AI” (seems organizationally hard to me), or did you have particular tricks in mind?

Buck 3 May 2024 1:35 UTC
LW: 43 AF: 31
8
AF
on: Buck’s Shortform
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I’m much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]
I’m interested in the following subset of risk from AI:
- Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
- Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
  - So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity).
- Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes).
- Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works.
This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.)
I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored.
Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about:
- It’s very expensive to refrain from using AIs for this application.
- There’s no simple way to remove affordances from the AI such that it’s very hard for the AI to take a small sequence of actions which plausibly lead quickly to loss of control. In contrast, most other applications of AI probably can be controlled just by restricting their affordances.
If I’m right that the risk from scheming early-transformative models is concentrated onto this pretty specific scenario, it implies a bunch of things:
- It implies that work on mitigating these risks should focus on this very specific setting.
- It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.
- It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.

LawrenceC 2 May 2024 23:10 UTC
LW: 6 AF: 5
2
AF
in reply to: Neel Nanda’s comment on: Refusal in LLMs is mediated by a single direction
I don’t know what the “real story” is, but let me point at some areas where I think we were confused. At the time, we had some sort of hand-wavy result in our appendix saying “something something weight norm ergo generalizing”. Similarly, concurrent work from Ziming Liu and others (Omnigrok) had another claim based on the norm of generalizing and memorizing solutions, as well as a claim that representation is important.
One issue is that our picture doesn’t consider learning dynamics that seem actually important here. For example, it seems that one of the mechanisms that may explain why weight decay seems to matter so much in the Omnigrok paper is because fixing the norm to be large leads to an effectively tiny learning rate when you use Adam (which normalizes the gradients to be of fixed scale), especially when there’s a substantial radial component (which there is, when the init is too small or too big). This both probably explains why they found that training error was high when they constrain the weights to be sufficiently large in all their non-toy cases (see e.g. the mod add landscape below) and probably explains why we had difficulty using SGD+momentum (which, given our bad initialization, led to gradients that were way too big at some parts of the model especially since we didn’t sweep the learning rate very hard). ^[1]
There’s also some theoretical results from SLT-related folk about how generalizing circuits achieve lower train loss per parameter (i.e. have higher circuit efficiency) than memorizing circuits (at least for large p), which seems to be a part of the puzzle that neither our work nor the Omnigrok touched on—why is it that generalizing solutions have lower norm? IIRC one of our explanations was that weight decay “favored more distributed solutions” (somewhat false) and “it sure seems empirically true”, but we didn’t have anything better than that.
There was also the really basic idea of how a relu/gelu network may do multiplication (by piecewise linear approximations of x^2, or by using the quadratic region of the gelu for x^2), which (I think) was first described in late 2022 in Ekin Ayurek’s “Transformers can implement Sherman-Morris for closed-form ridge regression” paper? (That’s not the name, just the headline result.)
Part of the story for grokking in general may also be related to the Tensor Program results that claim the gradient on the embedding is too small relative to the gradient on other parts of the model, with standard init. (Also the embed at init is too small relative to the unembed.) Because the embed is both too small and do, there’s no representation learning going on, as opposed to just random feature regression (which overfits in the same way that regression on random features overfits absent regularization).
In our case, it turns out not to be true (because our network is tiny? because our weight decay is set aggressively at lamba=1?), since the weights that directly contribute to logits (W_E, W_U, W_O, W_V, W_in, W_out) all quickly converge to the same size (weight decay encourages spreading out weight norm between things you multiply together), while the weights that do not all converge to zero.
Bringing it back to the topic at hand: There’s often a lot more “small” confusions that remain, even after doing good toy models work. It’s not clear how much any of these confusions matter (and do any of the grokking results our paper, Ziming Liu et al, or the GDM grokking paper found matter?).
1. ^
  Haven’t checked, might do this later this week.

Andy Arditi 2 May 2024 23:05 UTC
LW: 1 AF: 1
0
AF
in reply to: TurnTrout’s comment on: Refusal in LLMs is mediated by a single direction
Was it substantially less effective to instead use $a_{harmless}^{'} \leftarrow a_{harmless} + ({avg_proj}_{harmful})^r$ ?

It’s about the same. And there’s a nice reason why: $a_{harmless} \cdot^r \approx 0$ . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it’s very positive for harmful prompts). We don’t display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the “refusal direction”). This is (one reason) why we think ablation of the refusal direction works so much better than adding the negative “refusal direction,” and it’s also what motivated us to try ablation in the first place!
I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.
Note that our intervention is fairly strong here, as we are intervening at all token positions (including the newly generated tokens). But in general we’ve found it quite easy to induce refusal, and I believe we could even weaken our intervention to a subset of token positions and achieve similar results. We’ve previously reported the ease by which we can induce refusal (patching just 6 attention heads at a single token position in Llama-2-7B-chat).
Burns et al. do activation engineering? I thought the CCS paper didn’t involve that.
You’re right, thanks for the catch! I’ll update the text so it’s clear that the CCS paper does not perform model interventions.

Oliver Daniels-Koch 2 May 2024 20:41 UTC
1 point
0
AF
in reply to: Fabien Roger’s comment on: Benchmarks for Detecting Measurement Tampering [Redwood Research]
oh I see, by all(sensor_preds) I meant sum([logit_i] for i in n_sensors) (the probability that all sensors are activated). Makes sense, thanks!

Thomas Kwa 2 May 2024 19:21 UTC
LW: 2 AF: 1
1
AF
in reply to: TurnTrout’s comment on: TurnTrout’s shortform feed
I’m not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
- A shardful agent can be incoherent due to valuing different things from different states
- A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather than their ultimate consequences
- A shardful agent saves compute by not evaluating the whole utility function
The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.