Joe Collman comments on Making a conservative case for alignment

Joe Collman 21 Nov 2024 9:50 UTC
0 points
0
First some points of agreement:
- I like that you’re focusing on neglected approaches. Not much on the technical side seems promising to me, so I like to see exploration.
  - Skimming through your suggestions, I think I’m most keen on human augmentation related approaches—hopefully the kind that focuses on higher quality decision-making and direction finding, rather than simply faster throughput.
- I think outreach to Republicans / conservatives, and working across political lines is important, and I’m glad that people are actively thinking about this.
- I do buy the [Trump’s high variance is helpful here] argument. It’s far from a principled analysis, but I can more easily imagine [Trump does correct thing] than [Harris does correct thing]. (mainly since I expect the bar on “correct thing” to be high so that it needs variance)
  - I’m certainly making no implicit ”...but the Democrats would have been great...” claim below.
That said, various of the ideas you outline above seem to be founded on likely-to-be-false assumptions.
Insofar as you’re aiming for a strategy that provides broadly correct information to policymakers, this seems undesirable—particularly where you may be setting up unrealistic expectations.

Highlights of the below:
1. Telling policymakers that we don’t need to slow down seems negative.
  1. I don’t think you’ve made any valid argument that not needing to slow down is likely. (of course it’d be convenient)
2. A negative [alignment-in-the-required-sense tax] seems implausible. (see below)
  1. (I don’t think it even makes sense in the sense that “alignment tax” was originally meant^[1], but if “negative tax” gets conservatives listening, I’m all for it!)
3. I think it’s great for people to consider convenient possibilities (e.g. those where economic incentives work for us) in some detail, even where they’re highly unlikely. Whether they’re actually 0.25% or 25% likely isn’t too important here.
  1. Once we’re talking about policy advocacy, their probability is important.
More details:
A conservative approach to AI alignment doesn’t require slowing progress, avoiding open sourcing etc. Alignment and innovation are mutually necessary, not mutually exclusive: if alignment R&D indeed makes systems more useful and capable, then investing in alignment is investing in US tech leadership.
Here and in the case for a negative alignment tax, I think you’re:
1. Using a too-low-resolution picture of “alignment” and “alignment research”.
  1. This makes it too easy to slip between ideas like:
    Some alignment research has property x
    All alignment research has property x
    A [sufficient for scalable alignment solution] subset of alignment research has property x
    A [sufficient for scalable alignment solution] subset of alignment research that we’re likely to complete has property x
  2. An argument that requires (iv) but only justifies (i) doesn’t accomplish much. (we need something like (iv) for alignment tax arguments)
2. Failing to distinguish between:
  1. Alignment := Behaves acceptably for now, as far as we can see.
  2. Alignment := [some mildly stronger version of ‘alignment’]
  3. Alignment := notkilleveryoneism
In particular, there’ll naturally be some crossover between [set of research that’s helpful for alignment] and [set of research that leads to innovation and capability advances] - but alone this says very little.
What we’d need is something like:
- Optimizing efficiently for innovation in a way that incorporates various alignment-flavored lines of research gets us sufficient notkilleveryoneism progress before any unrecoverable catastrophe with high probability.
It’d be lovely if something like this were true—it’d be great if we could leverage economic incentives to push towards sufficient-for-long-term-safety research progress. However, the above statement seems near-certainly false to me. I’d be (genuinely!) interested in a version of that statement you’d endorse at >5% probability.
The rest of that paragraph seems broadly reasonable, but I don’t see how you get to “doesn’t require slowing progress”.
On “negative alignment taxes”:
First, a point that relates to the ‘alignment’ disambiguation above.
In the case for a negative alignment tax, you offer the following quote as support for alignment/capability synergy:
...Behaving in an aligned fashion is just another capability… (Anthropic quote from Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback)
However, the capability is [ability to behave in an aligned fashion], and not [tendency to actually behave in an aligned fashion] (granted, Anthropic didn’t word things precisely here). The latter is a propensity, not a capability.
What we need for scalable alignment is the propensity part: no-one sensible is suggesting that superintelligences wouldn’t have the ability to behave in an aligned fashion. The [behavior-consistent-with-alignment]-capability synergy exists while a major challenge is for systems to be able to behave desirably.
Once capabilities are autonomous-x-risk-level, the major challenge will be to get them to actually exhibit robustly aligned behavior. At that point there’ll be no reason to expect the synergy—and so no basis to expect a negative or low alignment tax where it matters.
On things like “Cooperative/prosocial AI systems”, I’d note that hits-based exploration is great—but please don’t expect it to work (and that “if implemented into AI systems in the right ways” is almost all of the problem).

On this basis, it seems to me that the conservative-friendly case you’ve presented doesn’t stand up at all (to be clear, I’m not critiquing the broader claim that outreach and cooperation are desirable):
- We don’t have a basis to expect negative (or even low) alignment tax.
  - (unclear so far that we’ll achieve non-infinite alignment tax for autonomous x-risk relevant cases)
- It’s highly likely that we do need to slow advancement, and will need serious regulation.
Given our lack of precise understanding of the risks, we’ll likely have to choose between [overly restrictive regulation] and [dangerously lax regulation] - we don’t have the understanding to draw the line in precisely the right place. (completely agree that for non-frontier systems, it’s best to go with little regulation)
I’d prefer a strategy that includes [policymakers are made aware of hard truths] somewhere.
I don’t think we’re in a world where sufficient measures are convenient.
It’s unsurprising that conservatives are receptive to quite a bit “when coupled with ideas around negative alignment taxes and increased economic competitiveness”—but this just seems like wishful thinking and poor expectation management to me.

Similarly, I don’t see a compelling case for:
that is, where alignment techniques are discovered that render systems more capable by virtue of their alignment properties. It seems quite safe to bet that significant positive alignment taxes simply will not be tolerated by the incoming federal Republican-led government—the attractor state of more capable AI will simply be too strong.
Of course this is true by default—in worlds where decision-makers continue not to appreciate the scale of the problem, they’ll stick to their standard approaches. However, conditional on their understanding the situation, and understanding that at least so far we have not discovered techniques through which some alignment/capability synergy keeps us safe, this is much less obvious.
I have to imagine that there is some level of perceived x-risk that snaps politicians out of their default mode.
I’d bet on [Republicans tolerate significant positive alignment taxes] over [alignment research leads to a negative alignment tax on autonomous-x-risk-capable systems] at at least ten to one odds (though I’m not clear how to operationalize the latter).
Republicans are more flexible than reality :).
1. ^
  As I understand the term, alignment tax compares [lowest cost for us to train a system with some capability level] against [lowest cost for us to train an aligned system with some capability level]. Systems in the second category are also in the first category, so zero tax is the lower bound.
  
  This seems a better definition, since it focuses on the outputs, and there’s no need to handwave about what counts as an alignment-flavored training technique: it’s just [...any system...] vs [...aligned system...].
  
  Separately, I’m not crazy about the term: it can suggests to new people that we know how to scalably align systems at all. Talking about “lowering the alignment tax” from infinity strikes me as an odd picture.

Joe Collman comments on Making a conservative case for alignment

First some points of agreement:

Highlights of the below:

More details:

On “negative alignment taxes”: