Joe Collman

Karma: 1,652

I’m a researcher on the technical governance team at MIRI.

Views expressed are my own, and should not be taken to represent official MIRI positions. Similarly, views within the technical governance team do vary.

Previously:

Helped with MATS, running the technical side of the London extension (pre-LISA).
Briefly a teaching fellow with BlueDot on AISF.
Worked for a while on Debate (this kind of thing).

Quick takes on the above:

I think MATS is great-for-what-it-is. My misgivings relate to high-level direction.
- Worth noting that PIBBSS exists, and is philosophically closer to my ideal.
The technical AISF course doesn’t have the emphasis I’d choose (which would be closer to Key Phenomena in AI Risk). It’s a decent survey of current activity, but only implicitly gets at fundamentals—mostly through a [notice what current approaches miss, and will continue to miss] mechanism.
I don’t expect research on Debate, or scalable oversight more generally, to help significantly in reducing AI x-risk. (I may be wrong! - some elaboration in this comment thread)

Joe Collman Jan 12, 2024, 2:17 AM
6 points
0
on: An even deeper atheism
The main case for optimism on human-human alignment under extreme optimization seems to be indirection: not that [what I want] and [what you want] happen to be sufficiently similar, but that there’s a [what you want] pointer within [what I want].
Value fragility doesn’t argue strongly against the pointer-based version. The tails don’t come apart when they’re tied together.
It’s not obvious that the values-on-reflection of an individual human would robustly maintain the necessary pointers (to other humans, to past selves, to alternative selves/others...), but it is at least plausible—if you pick the right human.
More generally, an argument along the lines of [the default outcome with AI doesn’t look too different from the default outcome without AI, for most people] suggests that we need to do better than the default, with or without AI. (I’m not particularly optimistic about human-human alignment without serious and principled efforts)

Joe Collman Jan 11, 2024, 7:10 PM
LW: 4 AF: 2
0
AF
in reply to: habryka’s comment on: Simulators
Sure, but I don’t think anyone is claiming that there’s a similarity between a brain stepping forward in physical time and transformer internals. (perhaps my wording was clumsy earlier)
IIUC, the single timestep in the ‘physics’ of the post is the generation and addition of one new token.
I.e. GPT uses [some internal process] to generate a token.
Adding the new token is a single atomic update to the “world state” of the simulation.
The [some internal process] defines GPT’s “laws of physics”.
The post isn’t claiming that GPT is doing some generalized physics internally.
It’s saying that [GPT(input_states) --> (output_states)] can be seen as defining the physical laws by which a simulation evolves.
As I understand it, it’s making almost no claim about internal mechanism.
Though I think “GPT is a simulator” is only intended to apply if its simulator-like behaviour robustly generalizes—i.e. if it’s always producing output according to the “laws of physics” of the training distribution (this is imprecise, at least in my head—I’m unclear whether Janus have any more precise criterion).
I don’t think the post is making substantive claims that disagree with [your model as I understand it]. It’s only saying: here’s a useful way to think about the behaviour of GPT.

Joe Collman Jan 11, 2024, 7:15 AM
LW: 4 AF: 2
0
AF
in reply to: habryka’s comment on: Simulators
Oh, hang on—are you thinking that Janus is claiming that GPT works by learning some approximation to physics, rather than ‘physics’?
IIUC, the physics being referred to is either through analogy (when it refers to real-world physics), or as a generalized ‘physics’ of [stepwise addition of tokens]. There’s no presumption of a simulation of physics (at any granularity).
E.g.:
Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples.
Apologies if I’m the one who’s confused :).
This just seemed like a natural explanation for your seeming to think the post is claiming a lot more mechanistically. (I think it’s claiming almost nothing)

Joe Collman Jan 11, 2024, 6:44 AM
LW: 2 AF: 1
0
AF
in reply to: habryka’s comment on: Simulators
Perhaps we’re talking past each other to a degree. I don’t disagree with what you’re saying.
I think I’ve been unclear—or perhaps just saying almost vacuous things. I’m attempting to make a very weak claim (I think the post is also making no strong claim—not about internal mechanism, at least).
I only mean that the output can often be efficiently understood in terms of human characters (among other things). I.e. that the output is a simulation, and that human-like minds will be an efficient abstraction for us to use when thinking about such a simulation. Privileging hypotheses involving the dynamics of the outputs of human-like minds will tend to usefully constrain expectations.
Again, I’m saying something obvious here—perhaps it’s too obvious to you. The only real content is something like [thinking of the output as being a simulation including various simulacra, is likely to be less misleading than thinking of it as the response of an agent].
I do not mean to imply that the internal cognition of the model necessarily has anything simulation-like about it. I do not mean that individual outputs are produced by simulation. I think you’re correct that this is highly unlikely to be the most efficient internal mechanism to predict text.
Overall, I think the word “simulation” invites confusion, since it’s forever unclear whether we’re pointing at the output of a simulation process, or the internal structure of that process.
Generally I’m saying:
[add a token single token] : single simulation step—using the training distribution’s ‘physics’.
[long string of tokens] : a simulation
[process of generating a single token] : [highly unlikely to be a simulation]

Joe Collman Jan 11, 2024, 5:02 AM
LW: 2 AF: 1
0
AF
in reply to: habryka’s comment on: Simulators
To add to Charlie’s point (which seems right to me):
As I understand things, I think we are talking about a simulation of something somewhat close to human minds—e.g. text behaviour of humanlike simulacra (made of tokens—but humans are made of atoms). There’s just no claim of an internal simulation.
I’d guess a common upside is to avoid constraining expectations unhelpfully in ways that [GPT as agent] might.
However, I do still worry about saying “GPT is a simulator” rather than something like “GPT currently produces simulations”.
I think the former suggests too strongly that we understand something about what it’s doing internally—e.g. at least that it’s not inner misaligned, and won’t stop acting like a simulator at some future time (and can easily be taken to mean that it’s doing simulation internally).
If the aim is to get people thinking more clearly, I’d want it to be clearer that this is a characterization of [what GPTs currently output], not [what GPTs fundamentally are].

Joe Collman Jan 9, 2024, 6:12 PM
LW: 3 AF: 2
0
AF
in reply to: Oliver Sourbut’s comment on: Deceptive AI ≠ Deceptively-aligned AI
I think the broader use is sensible—e.g. to include post-training.
However, I’m not sure how narrow you’d want [training hacking] to be.
Do you want to call it training only if NN internals get updated by default? Or just that it’s training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a …selection… process that could be ongoing], seems to cover all deceptive alignment—potential deletion/adjustment being a selection process).
Fine if there’s no bright line—I’d just be curious to know your criteria.

Joe Collman Jan 9, 2024, 5:12 PM
LW: 2 AF: 1
0
AF
on: Deceptive AI ≠ Deceptively-aligned AI
By contrast, deception is much broader—it’s any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.
This description allows us to classify every output of a highly capable AI as deceptive:
For any AI output, it’s essentially guaranteed that a human will update away from the truth about something. A highly capable AI will be able to predict some of these updates—thus it will be “knowingly providing … misleading information”.
Conversely, we can’t require that a human be misled about everything in order to classify something as deceptive—nothing would then qualify as deceptive.
There’s no obvious fix here.
Our common-sense notion of deception is fundamentally tied to motivation:
- A teacher says X in order to give a student a basic-but-flawed model that’s misleading in various ways, but is a step towards deeper understanding.
  - Not deception.
- A teacher says X to a student in order to give them a basic-but-flawed model that’s misleading in various ways, to manipulate them.
  - Deception.
The student’s updates in these cases can be identical. Whether we want to call the statement deceptive comes down to the motivation of the speaker (perhaps as inferred from subsequent actions).
In a real world context, it is not possible to rule out misleading behavior: all behavior misleads about something.
We can only hope to rule out malign misleading behavior. This gets us into questions around motivation, values etc (or at least into much broader considerations involving patterns of behavior and long-term consequences).

(I note also that requiring “knowingly” is an obvious loophole—allowing self-deception, willful ignorance or negligence to lead to bad outcomes; this is why some are focusing on truthfulness rather than honesty)

Joe Collman Jan 9, 2024, 4:23 PM
LW: 4 AF: 3
0
AF
on: Deceptive AI ≠ Deceptively-aligned AI
…but the AI is actually emitting those outputs in order to create that impression—more specifically, the AI has situational awareness
I think it’s best to avoid going beyond the RFLO description.
In particular, it is not strictly required that the AI be aiming to “create that impression”, or that it has “situational awareness” in any strong/general sense.
Per footnote 26 in RFLO (footnote 7 in the post):
”Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the optimization pressure it is subject to.”
It needs to be:
Modeling the optimization pressure.
Adapting its responses to that optimization pressure.
Saying more than that risks confusion and overly narrow approaches.
By all means use things like “in order to create that impression” in an example. It shouldn’t be in the definition.

Joe Collman Dec 15, 2023, 7:17 AM
LW: 4 AF: 2
0
AF
on: Two concepts of an “episode” (Section 2.2.1 of “Scheming AIs”)
Thanks for writing the report—I think it’s an important issue, and you’ve clearly gone to a lot of effort. Overall, I think it’s good.
However, it seems to me that the “incentivized episode” concept is confused, and that some conclusions over the probability of beyond-episode goals are invalid on this basis. I’m fairly sure I’m not confused here (though I’m somewhat confused that no-one picked this up in a draft, so who knows?!).
I’m not sure to what extent the below will change your broader conclusions—if it only moves you from [this can’t happen, by definition], to [I expect this to be rare], the implications may be slight. It does seem to necessitate further thought—and different assumptions in empirical investigations.
The below can all be considered as [this is how things seem to me], so I’ll drop caveats along those lines. I’m largely making the same point throughout—I’m hoping the redundancy will increase clarity.
1. “The unit of time such that we can know by definition that training is not directly pressuring the model to care about consequences beyond that time.”—this is not a useful definition, since there is no unit of time where we know this by definition. We know only that the model is not pressured by consequences beyond training.
  1. For this reason, in what follows I’ll largely assume that [episode = all of training]. Sticking with the above definition gives unbounded episodes, and that doesn’t seem in the spirit of things.
2. Replacing “incentivized episode” with “incentivizing episode” would help quite a bit. This clarifies the causal direction we know by definition. It’s a mistake to think that we can know what’s incentivized “by definition”: we know what incentivizes.
  1. In particular, only the episode incentivizes; we expect [caring about the episode] to be incentivized; we shouldn’t expect [caring about the episode] to be the only thing incentivized (at least not by definition).
3. For instance, for any t, we can put our best attempt at [care about consequences beyond t] in the reward function (perhaps we reward behaviour most likely to indicate post-t caring; perhaps we use interpretability on weights and activations, and we reward patterns that indicate post-t caring—suppose for now that we have superb interpretability tools).
  1. The interpretability case is more direct than rewarding future consequences: it rewards the mechanisms that constitute caring/goals directly.
  2. If a process we’ve designed explicitly to train for $x$ , is described as “not directly pressuring the model to $x$ ” by definition, then our definition is at least misleading—I’d say broken.
  3. Of course this example is contrived—but it illustrates the problem. There may be analogous implicit (but direct!) effects that are harder to notice. We can’t rule these out with a definition.
    These need not be down to inductive bias: as here, they can be favoured as a consequence of the reward function (though usually not so explicitly and extremely as in this example).
4. What matters is the robustness of correlations, not directness.
  1. Most directness is in our map, not in the territory. I.e. it’s largely saying [this incentive is obvious to us], [this consequence is obvious to us] etc.
  2. The episode is the direct cause of gradient updates.
  3. As a consequence of gradient updates, behaviour on the episode is no more/less direct than behaviour outside that interval.
5. While a strict constraint on beyond-episode goals may be rare, pressure will not be.
  1. The former would require that some beyond-episode goal is entailed by high performance in-episode.
  2. The latter only requires that some beyond-episode goal is more likely given high performance in-episode.
    This is the kind of pressure that might, in principle, be overcome by adversarial training—but it remains direct pressure for beyond-episode goals resulting from in-episode consequences.
The most significant invalid conclusion (Page 52):
- “Why might you not expect naturally-arising beyond-episode goals? The most salient reason, to me, is that by definition, the gradients given in training (a) do not directly pressure the model to have them…”
  - As a general “by definition” claim, this is false (or just silly, if we say that the incentivized episode is unbounded, and there is no “beyond-episode”).
  - It’s less clear to me how often I’d expect active pressure towards beyond-episode-goals over strictly-in-episode-goals (other than through the former being more common/simple). It is clear that such pressure is possible.
  - Directness is not the issue: if I keep $x$ small, I keep $x^{2}$ small. That I happen to be manipulating $x$ directly, rather than $x^{2}$ directly, is immaterial.
    Again, the out-of-episode consequences of gradient updates are just as direct as the in-episode consequences—they’re simply harder for us to predict.

Joe Collman Nov 23, 2023, 8:56 PM
5 points
0
on: AI #39: The Week of OpenAI
Is that ‘deceptive alignment’? You tell me.
I don’t think it makes sense to classify every instance of this as deceptive alignment—and I don’t think this is the usual use of the term.
I think that to say “this is deceptive alignment” is generally to say something like “there’s a sense in which this system has a goal different from ours, is modeling the selection pressure it’s under, anticipating that this selection pressure may not exist in the future, and adapting its behaviour accordingly”.
That still leaves things underdefined, e.g. since this can all happen implicitly and/or without the system knowing this mechanism exists.
However, if you’re not suggesting in any sense that [anticipation of potential future removal of selection pressure] is a big factor, then it’s strange to call it deceptive alignment.
I assume Wiblin means it in this sense—not that this is the chance we get catastrophically bad generalization, but rather that it happens via a mechanism he’d characterize this way.
[I’m now less clear that this is generally agreed, since e.g. Apollo seem to be using a foolish-to-my-mind definition here: When an AI has Misaligned goals and uses Strategic Deception to achieve them (see “Appendix C—Alternative definitions we considered”, for clarification).
This is not close to the RFLO definition, so I really wish they wouldn’t use the same name. Things are confusing enough without our help.]
All that said, it’s not clear to me that [deceptive alignment] is a helpful term or target, given that there isn’t a crisp boundary, and that there’ll be a tendency to tackle an artificially narrow version of the problem.
The rationale for solving it usually seems to be [if we can solve/avoid this subproblem, we’d have instrumentally useful guarantees in solving the more general generalization problem] - but I haven’t seen a good case made that we get the kind of guarantees we’d need (e.g. knowing only that we avoid explicit/intentional/strategic… deception of the oversight process is not enough).
It’s easy to motte-and-bailey ourselves into trouble.

Joe Collman Nov 23, 2023, 6:29 AM
14 points
5
on: AI Safety Research Organization Incubation Program—Expression of Interest
This seems great in principle.
The below is meant in the spirit of [please consider these things while moving forward with this], and not [please don’t move forward until you have good answers on everything].
That said:
First, I think it’s important to clearly distinguish:
1. A great world would have a lot more AI safety orgs. (yes)
2. Conditional on many new AI safety orgs starting, the world is in a better place. (maybe)
3. Intervening to facilitate the creation of new AI safety orgs makes the world better. (also maybe)
This program would be doing (3), so it’s important to be aware that (1) is not in itself much of an argument. I expect that it’s very hard to do (3) well, and that even a perfect version doesn’t allow us to jump to the (1) of our dreams. But I still think it’s a good idea!
Some thoughts that might be worth considering (very incomplete, I’m sure):
1. Impact of potential orgs will vary hugely.
  1. Your impact will largely come down to [how much you increase (/reduce) the chance that positive (/negative) high-impact orgs get created].
  2. This may be best achieved by aiming to create many orgs. It may not.
    Of course [our default should be zero new orgs] is silly, but so would be [we’re aiming to create as many new orgs as possible].
    You’ll have a bunch of information, time and levers that funders don’t have, so I don’t think such considerations can be left to funders.
  3. In the below I’ll be mostly assuming that you’re not agnostic to the kind of orgs you’re facilitating (since this would be foolish :)). However, I note that even if you were agnostic, you’d inevitably make choices that imply significant tradeoffs.
2. Consider the incentive landscape created by current funding sources.
  1. Consider how this compares to a highly-improved-by-your-lights incentive landscape.
  2. Consider what you can do to change things for the better in this regard.
    If anything seems clearly suboptimal as things stand, consider spending significant effort making this case to funders as soon as possible.
    Consider what data you could gather on potential failure modes, or simply on dynamics that are non-obvious at the outset. (anonymized appropriately)
    Gather as much data as possible.
    If you don’t have the resources to do a good job at experimentation, data gathering etc., make this case to funders and get those resources. Make the case that the cost of this is trivial relative to the opportunity cost of failing to gather the information.
3. The most positive-for-the-world orgs are likely among the hardest to create.
  1. By default, orgs created are likely to be doing not-particularly-neglected things (similar selection pressures that created the current field act on new orgs; non-neglected areas of the field correlate positively with available jobs and in-demand skills...).
  2. By default, it’s much more likely to select for [org that moves efficiently in some direction] than [org that picks a high-EV-given-what’s-currently-known direction].
    Given that impact can easily vary by a couple of orders of magnitude (and can be negative), direction is important.
    It’s long-term direction that’s important. In principle, an org that moves efficiently in some direction could radically alter that direction later. In practice, that’s uncommon—unless this mindset existed at the outset.
    Perhaps facilitating this is another worthwhile intervention?? - i.e. ensuring that safety orgs have an incentive to pivot to higher-EV approaches, rather than to continue with a [low EV-relative-to-counterfactual, but high comparative advantage] approach.
  3. Making it easier to create any kind of safety org doesn’t change the selection pressures much (though I do think it’s a modest improvement). If all the branches are a little lower, it’s still the low-hanging-fruit that tends to be picked first. It may often be easier to lower the low branches too.
    If possible, you’d want to disproportionately lower the highest branches. Clearly this is easier said than done. (e.g. spending a lot of resources on helping those with hard-to-make-legible ideas achieve legibility, [on a process level, if necessary], so that there’s not strong selection for [easy to make legible])
4. Ground truth feedback on the most important kinds of progress is sparse-to-non-existent.
  1. You’ll be using proxies (for [what seems important], [what impact we’d expect org x to have], [what impact direction y has had], [what impact new org z has had] etc. etc.).
    Most proxies aren’t great.
    The most natural proxies and metrics will tend to be the same ones others are using. This may help to get a project funded. It tends to act against neglectedness.
    Using multiple, non-obvious proxies is worth a thought.
    However, note that you don’t have the True Name of [AI safety], [alignment]… in your head: you have a vague, confused proxy.
    One person coming up with multiple proxies, will tend to mean creating various proxies to their own internal proxy. That’s still a single point of failure.
    If you all clearly understand the importance of all the proxies you’re using, that’s probably a bad sign.
5. It’s much better to create a great org slowly, than a mediocre org quickly. This can easily happen with (some of) the same people, entailing a high opportunity cost.
  - I think one of the most harmful dynamics at present is the expectation that people/orgs should have a concretely mapped out agenda/path-to-impact within a few months. This strongly selects against neglectedness.
  - Even Marius’ response to this seems to have the wrong emphasis:
    ”Second, a great agenda just doesn’t seem like a necessary requirement. It seems totally fine for me to replicate other people’s work, extend existing agendas, or ask other orgs if they have projects to outsource (usually they do) for a year or so and build skills during that time. After a while, people naturally develop their own new ideas and then start developing their own agendas.”
    I.e. that the options are:
    Have a great agenda.
    Replicate existing work, extend existing agenda, grab existing ideas to work on.
  - Where is the [spend time focusing on understanding the problem more deeply, and forming new ideas / approaches]? Of course this may sometime entail some replication/extension, but that shouldn’t be the motivation.
  - Financial pressures and incentives are important here: [We’ll fund you for six months to focus on coming up with new approaches] amounts to [if you pick a high-variance approach, your org may well cease to exist in six months]. If the aim is to get an org to focus on exploration for six months, guaranteed funding for two years is a more realistic minimum.
    Of course this isn’t directly within your control—but it’s the kind of thing you might want to make a case for to funders.
    Again, the more you’re able to shape the incentive landscape for future orgs, the more you’ll be able to avoid unhelpful instrumental constraints, and focus on facilitating the creation of the kind of orgs that should exist.
    Also worth considering that the requirement for this kind of freedom is closer to [the people need near-guaranteed financial support for 2+ years]. Where an org is uncertain/experimental, it may still make sense to give the org short-term funding, but all the people involved medium-term funding.

Joe Collman Nov 10, 2023, 6:19 PM
2 points
0
in reply to: Tamsin Leake’s comment on: The other side of the tidal wave
That’s my guess too, but I’m not highly confident in the [no attractors between those two] part.
It seems conceivable to have a not-quite-perfect alignment solution with a not-quite-perfect self-correction mechanism that ends up orbiting utopia, but neither getting there, nor being flung off into oblivion.
It’s not obvious that this is an unstable, knife-edge configuration. It seems possible to have correction/improvement be easier at a greater distance from utopia. (whether that correction/improvement is triggered by our own agency, or other systems)
If stable orbits exist, it’s not obvious that they’d be configurations we’d endorse (or that the things we’d become would endorse them).

Joe Collman Oct 30, 2023, 9:54 PM
2 points
0
in reply to: elifland’s comment on: Responsible Scaling Policies Are Risk Management Done Wrong
Anyway, overall I’d be surprised if it doesn’t help substantially to have more granular estimates.
Oh, I’m certainly not claiming that no-one should attempt to make the estimates.
I’m claiming that, conditional on such estimation teams being enshrined in official regulation, I’d expect their results to get misused. Therefore, I’d rather that we didn’t have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab’s policy, rather than their immediate actions. I’d want regulators to push for safer strategies, not to run checks on unsafe strategies—at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we’ll get blown up on the next step]. (importantly, it won’t always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)

Joe Collman Oct 28, 2023, 9:54 PM
13 points
12
in reply to: trevor’s comment on: Linkpost: Rishi Sunak’s Speech on AI (26th October)
It’s important not to ignore that this speech is to the general public.
While I agree that “in the most unlikely but extreme cases” is not accurate, it’s not clear that this reflects the views of the PM / government, rather than what they think it’s expedient to say.
Even if they took the risk fully seriously, and had doom at 60%, I don’t think he’d say that in a speech.
The speech is consistent with [not quite getting it yet], but also consistent with [getting it, but not thinking it’s helpful to say it in a public speech]. I’m glad Eliezer’s out there saying the unvarnished truth—but it’s less clear that this would be helpful from the prime minister.
It’s worth considering the current political situation: the Conservatives are very likely to lose the next election (no later than Jan 2025 - but it often happens early [this lets the governing party pick their moment, have the element of surprise, and look like calling the election was a positive choice]).
Being fully clear about the threat in public could be perceived as political desperation. So far, the issue hasn’t been politicized. If not coming out with the brutal truth helps with that, it’s likely a price worth paying. In particular, it doesn’t help if the UK government commits to things that Labour will scrap as soon as they get in.
Perhaps more importantly from his point of view, he’ll need support from within his own party over the next year—if he’s seen as sabotaging the Conservatives’ chances in the next election by saying anything too weird / alarmist-seeming / not-playing-to-their-base, he may lose that.
Again, it’s also consistent with not quite getting it, but that’s far from the only explanation.
We could do a lot worse than Rishi Sunak followed by Keir Starmer.
Relative to most plausible counterfactuals, we seem to have gotten very lucky here.

Joe Collman Oct 28, 2023, 8:20 AM
2 points
3
on: We’re Not Ready: thoughts on “pausing” and responsible scaling policies
Thanks for clarifying your views. I think it’s important.

...build consensus around conditional pauses...
My issue with this is that it’s empty unless the conditions commit labs to taking actions they otherwise wouldn’t. Anthropic’s RSP isn’t terrible, but I think a reasonable summary is “Anthropic will plan ahead a bit, take the precautions they think make sense, and pause when they think it’s a good idea”.
It’s a commitment to take some actions that aren’t pausing—defining ASL4 measures, implementing ASL3 measures that they know are possible. That’s nice as far as it goes. However, there’s nothing yet in there that commits them to pause when they don’t think it’s a good idea.
They could have included such conditions, even if they weren’t concrete, and wouldn’t come in to play until ASL4 (e.g. requiring that particular specifications or evals be approved by an external board before they could move forward). That would have signaled something. They chose not to.
That might be perfectly reasonable, given that it’s unilateral. But if (even) Anthropic aren’t going to commit to anything with a realistic chance of requiring a lengthy pause, that doesn’t say much for RSPs as conditional pause mechanisms.
The transparency probably does help to a degree. I can imagine situations where greater clarity in labs’ future actions might help a little with coordination, even if they’re only doing what they’d do without the commitment.
Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.
This seems a reasonable criticism only if it’s a question of [improvement with downside] vs [status-quo]. I don’t think the RSP critics around here are suggesting that we throw out RSPs in favor of the status-quo, but that we do something different.
It may be important to solve x, but also that it’s not prematurely believed we’ve solved x. This applies to technical alignment, and to alignment regulation.
Things being “confused for sufficient progress” isn’t a small problem: this is precisely what makes misalignment an x-risk.
Initially, communication around RSPs was doing a bad job of making their insufficiency clear.
Evan’s, Paul’s and your posts are welcome clarifications—but such clarifications should be in the RSPs too (not as vague, easy-enough-to-miss caveats).

Joe Collman Oct 27, 2023, 8:20 PM
4 points
0
in reply to: elifland’s comment on: Responsible Scaling Policies Are Risk Management Done Wrong
That’s reasonable, but most of my worry comes back to:
1. If the team of experts is sufficiently cautious, then it’s a trivially simple calculation: a step beyond GPT-4 + unknown unknowns = stop. (whether they say “unknown unknowns so 5% chance of 8 billion deaths”, or “unknown unknowns so 0.1% chance of 8 billion deaths” doesn’t seem to matter a whole lot)
  1. I note that 8 billion deaths seems much more likely than 100 million, so the expectation of “1% chance of over 100 million deaths” is much more than 1 million.
2. If the team of experts is not sufficiently cautious, and come up with “1% chance of OpenAI’s next model causing over 100 million deaths” given [not-great methodology x], my worry isn’t that it’s not persuasive that time. It’s that x will become the standard, OpenAI will look at the report, optimize to minimize the output of x, and the next time we’ll be screwed.
In part, I’m worried that the argument for (1) is too simple—so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I’d prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there’s a version of using AI assistants for alignment work that reduces overall risk]. Here I’d like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it’s more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don’t think Eliezer’s critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he’s likely to be essentially correct—that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can’t do this, we’ll be accelerating a vehicle that can’t navigate.
[EDIT: oh and of course there’s the [if we really suck at navigation, then it’s not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there’s a decent case that improving our ability to navigate might be something that it’s hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don’t need complex analyses.

Joe Collman Oct 26, 2023, 8:50 PM
8 points
7
in reply to: Orpheus16’s comment on: Responsible Scaling Policies Are Risk Management Done Wrong
it relies on evals that we do not have
I agree that this is a problem, but it strikes me that we wouldn’t necessarily need a concrete eval—i.e. we wouldn’t need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by “understanding a model”, such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it’s not obvious to me that defining the right question isn’t already most of the work)
Personally, I’d prefer that this were done already—i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved—e.g. deferring to an external board. [Anthropic’s Long Term Benefit Trust doesn’t seem great for this, since it’s essentially just Paul who’d have relevant expertise (?? I’m not sure about this—it’s just unclear that any of the others would)]
I do think it’s reasonable for labs to say that they wouldn’t do this kind of thing unilaterally—but I would want them to push for a more comprehensive setup when it comes to policy.

Joe Collman Oct 26, 2023, 2:11 PM
5 points
0
in reply to: Zane’s comment on: Lying to chess players for alignment
Oh I didn’t mean only to do it afterwards. I think before is definitely required to know the experiment is worth doing with a given setup/people. Afterwards is nice-to-have for Science. (even a few blitz games is better than nothing)

Joe Collman Oct 26, 2023, 2:06 PM
2 points
0
in reply to: Joern Stoehler’s comment on: Lying to chess players for alignment
Oh that’s cool—nice that someone’s run the numbers on this.
I’m actually surprised quite how close-to-50% both backgammon and poker are.

Joe Collman Oct 26, 2023, 1:55 PM
4 points
−2
in reply to: elifland’s comment on: Responsible Scaling Policies Are Risk Management Done Wrong
tl;dr:
Dario’s statement seems likely to reduce overconfidence.
Risk-management-style policy seems likely to increase it.
Overconfidence gets us killed.

I think Dario’s public estimate of 10-25% is useful in large part because:
1. It makes it more likely that the risks are taken seriously.
2. It’s clearly very rough and unprincipled.
Conditional on regulators adopting a serious risk-management-style approach, I expect that we’ve already achieved (1).
The reason I’m against it is that it’ll actually be rough and unprincipled, but this will not clear—in most people’s minds (including most regulators, I imagine) it’ll map onto the kind of systems that we have for e.g. nuclear risks.
Further, I think that for AI risk that’s not x-risk, it may work (probably after a shaky start). Conditional on its not working for x-risk, working for non-x-risk is highly undesirable, since it’ll tend to lead to overconfidence.
I don’t think I’m particularly against teams of [people non-clueless on AI x-risk], [good general forecasters] and [risk management people] coming up with wild guesses that they clearly label as wild guesses.
That’s not what I expect would happen (if it’s part of an official regulatory system, that is).
Two cases that spring to mind are:
1. The people involved are sufficiently cautious, and produce estimates/recommendations that we obviously need to stop. (e.g. this might be because the AI people are MIRI-level cautious, and/or the forecasters correctly assess that there’s no reason to believe they can make accurate AI x-risk predictions)
2. The people involved aren’t sufficiently cautious, and publish their estimates in a form you’d expect of Very Serious People, in a Very Serious Organization—with many numbers, charts and trends, and no “We basically have no idea what we’re doing—these are wild guesses!” warning in huge red letters at the top of every page.
The first makes this kind of approach unnecessary—better to get the cautious people make the case that we have no solid basis to make these assessments that isn’t a wild guess.
The second seems likely to lead to overconfidence. If there’s an officially sanctioned team of “experts” making “expert” assessments for an international(?) regulator, I don’t expect this to be treated like the guess that it is in practice.