Joe Collman

Karma: 1,614

I’m a researcher on the technical governance team at MIRI.

Views expressed are my own, and should not be taken to represent official MIRI positions. Similarly, views within the technical governance team do vary.

Previously:

Helped with MATS, running the technical side of the London extension (pre-LISA).
Briefly a teaching fellow with BlueDot on AISF.
Worked for a while on Debate (this kind of thing).

Quick takes on the above:

I think MATS is great-for-what-it-is. My misgivings relate to high-level direction.
- Worth noting that PIBBSS exists, and is philosophically closer to my ideal.
The technical AISF course doesn’t have the emphasis I’d choose (which would be closer to Key Phenomena in AI Risk). It’s a decent survey of current activity, but only implicitly gets at fundamentals—mostly through a [notice what current approaches miss, and will continue to miss] mechanism.
I don’t expect research on Debate, or scalable oversight more generally, to help significantly in reducing AI x-risk. (I may be wrong! - some elaboration in this comment thread)

Joe Collman Jun 16, 2024, 1:22 AM
LW: 2 AF: 1
0
AF
on: AI catastrophes and rogue deployments
This seems a helpful model—so long as it’s borne in mind that [most paths to catastrophe without rogue deployment require many actions] isn’t a guarantee.
Thoughts:
1. It’s not clear to me whether the following counts as a rogue deployment (I’m assuming so):
  1. [un-noticed failure of one safety measure, in a context where all other safety measures are operational]
  2. For this kind of case:
    The name “rogue deployment” doesn’t seem a great fit.
    In general, it’s not clear to me how to draw the line between:
    Safety measure x didn’t achieve what we wanted, because it wasn’t specified/implemented sufficiently well. (not a rogue deployment)
    Safety measure x was subverted. (rogue deployment)
    For example, I think it’d be reasonable to think of [Amazing, long-term jailbreaks] as rogue deployments on this basis: the jailbreak is subverting a safety measure, so that “the safety measures are absent” is true in some sense.
    It seems important to distinguish things like:
    This safety measure appears to be in effect.
    This safety measure is running as designed.
    We’re getting the safety-improving-property we wanted from this safety measure.
2. When considering the [Employees of the AI company might run the model in an unauthorized way] case,
  1. I think one central example to consider is of an employee who:
    Thinks this is a good idea for the world.
    Can make a pretty compelling case to others that it’s a good idea.
  2. The examples in the post seem to focus on [bad intent and/or incompetence], which seems important, but too limited.

Joe Collman Jun 15, 2024, 3:53 AM
8 points
9
on: When is “unfalsifiable implies false” incorrect?
(Egan’s Incandescence is relevant and worth checking out—though it’s not exactly thrilling :))
I’m not crazy about the terminology here:
- Unfalsifiable-in-principle doesn’t imply false. It implies that there’s a sense in which the claim is empty. This tends to imply [it will not be accepted as science], but not [it is false].
- Where something is practically unfalsifiable (but falsifiable in principle), that doesn’t suggest it’s false either. It suggests it’s hard to check.
  - It seems to me that the thing you’d want to point to as potentially suspicious is [practically unfalsifiable claim made with high confidence].
- The fact that it’s unusual and inconvenient for something predictable to be practically unfalsifiable does not inherently make such prediction unsound.
- I don’t think it’s foolish to look for analogous examples here, but I guess it’d make more sense to make the case directly:
  - No, a hypothesis does not always need to make advance predictions (though it’s convenient when it does!).
    Claims predicting AI disaster are based on our not understanding how things will work concretely. Being unable to make many good predictions in this context is not strange.
  - Various AI x-risk claims concern patterns with no precedents we’d observe significantly before the end. This, again, is inconvenient—but not strange: they’re dangerous in large part because they’re patterns without predictable early warning signs.

Joe Collman Jun 14, 2024, 9:48 PM
2 points
0
in reply to: evhub’s comment on: Thomas Kwa’s Shortform
Here and above, I’m unclear what “getting to 7...” means.
With x = “always reliably determines worst-case properties about a model and what happened to it during training even if that model is deceptive and actively trying to evade detection”.
Which of the following do you mean (if either)?:
1. We have a method that x.
2. We have a method that x, and we have justified >80% confidence that the method x.
I don’t see how model organisms of deceptive alignment (MODA) get us (2).
This would seem to require some theoretical reason to believe our MODA in some sense covered the space of (early) deception.
I note that for some future time t, I’d expect both [our MODA at t] and [our transparency and interpretability understanding at t] to be downstream of [our understanding at t] - so that there’s quite likely to be a correlation between [failure modes our interpretability tools miss] and [failure modes not covered by our MODA].

Joe Collman Jun 14, 2024, 9:16 PM
LW: 4 AF: 2
2
AF
in reply to: habryka’s comment on: ricraz’s Shortform
I agree with this.
Unfortunately, I think there’s a fundamentally inside-view aspect of [problems very different from those we’re used to]. I think looking for a range of frames is the right thing to do—but deciding on the relevance of the frame can only be done by looking at the details of the problem itself (if we instead use our usual heuristics for relevance-of-frame-x, we run into the same out-of-distribution issues).
I don’t think there’s a way around this. Aspects of this situation are fundamentally different from those we’re used to. [Is different from] is not a useful relation—we can’t get far by saying “We’ve seen [fundamentally different] situations before—what happened there?”. It’ll all come back to how they were fundamentally different.
To say something mildly more constructive, I do still think we should be considering and evaluating other frames, based on our own inside-view model (with appropriate error bars on that model).
A place I’d start here would be:
- Attempt to understand another frame.
- See how far I need to zoom out before that frame’s models become a reasonable abstraction for the problem-as-I-understand-it.
- Find the smallest changes to my models that’d allow me to stick with this frame without zooming out so far. Assess the probability that these adjusted models are correct/useful.
For most frames, I end up needing to zoom out too far for them to say much of relevance—so this doesn’t much change my p(doom) assessment.
It seems more useful to apply other frames to evaluate smaller parts of our models. I’m sure there are a bunch of places where intuitions and models from e.g. economics or physics do apply to safety-related subproblems.

Joe Collman Jun 7, 2024, 2:17 AM
2 points
0
in reply to: Garrett Baker’s comment on: Prometheus’s Shortform
then even if we reveal information, adversaries may still assume (likely correctly) we aren’t sharing all our information
I think the same reasoning applies if they hack us: they’ll assume that the stuff they were able to hack was the part we left suspiciously vulnerable, and the really important information is behind more serious security.
I expect they’ll assume we’re in control either way—once the stakes are really high.
It seems preferable to actually be in control.
I’ll grant that it’s far from clear that the best strategy would be used.
(apologies if I misinterpreted your assumptions in my previous reply)

Joe Collman Jun 6, 2024, 10:41 PM
2 points
0
in reply to: Garrett Baker’s comment on: Prometheus’s Shortform
Working on this seems good insofar as greater control implies more options. With good security, it’s still possible to opt in to whatever weight-sharing / transparency mechanisms seem net positive—including with adversaries. Without security there’s no option.
Granted, the [more options are likely better] conclusion is clearer if we condition on wise strategy.
However, [we have great security, therefore we’re sharing nothing with adversaries] is clearly not a valid inference in general.

Joe Collman Jun 6, 2024, 10:20 PM
LW: 9 AF: 7
3
AF
on: [Link Post] “Foundational Challenges in Assuring Alignment and Safety of Large Language Models”
I think this is great overall.
One area I’d ideally prefer a clearer presentation/framing is “Safety/performance trade-offs”.
I agree that it’s better than “alignment tax”, but I think it shares one of the core downsides:
- If we say “alignment tax” many people will conclude [“we can pay the tax and achieve alignment” and “the alignment tax isn’t infinite”].
- If we say “Safety/performance trade-offs” many people will conclude [“we know how to make systems safe, so long as we’re willing to sacrifice performance” and “performance sacrifice won’t imply any hard limit on capability”]
I’m not claiming that this is logically implied by “Safety/performance trade-offs”.
I am claiming it’s what most people will imagine by default.
I don’t think this is a problem for near-term LLM safety.
I do think it’s a problem if this way of thinking gets ingrained in those thinking about governance (most of whom won’t be reading the papers that contain all the caveats, details and clarifications).
I don’t have a pithy description that captures the same idea without being misleading.
What I’d want to convey is something like “[lower bound on risk] / performance trade-offs”.

Joe Collman Jun 6, 2024, 5:23 AM
LW: 13 AF: 4
0
AF
on: On “first critical tries” in AI alignment
I think the DSA framing is in keeping with the spirit of “first critical try” discourse.
(With that in mind, the below is more “this too seems very important”, rather than “omitting this is an error”.)
However, I think it’s important to consider scenarios where humans lose meaningful control without any AI or group of AIs necessarily gaining a DSA. I think “loss of control” is the threat to think about, not “AI(s) take(s) control”. Admittedly this gets into Moloch-related grey areas—but this may indicate that [humans do/don’t have control] is too coarse-grained a framing.
I’d say that the key properties of “first critical try” are:
- We have the option to trigger some novel process.
- We’re unlikely to stop the process once it starts, even if it’s not going well.
  - Includes both [we can’t stop it] and [we won’t stop it].
- If the process goes badly, the odds of doom greatly increase.
- There’s a significant chance the process goes badly.
My guess is that the most likely near-term failure mode doesn’t start out as [some set of AIs gets a DSA], but rather [AI capability increase selects against meaningful human control] - and the DSA stuff is downstream of that.
This is a possibility with the [individually controllable powerful AI assistants] approach—whether or not this immediately takes things to transformational AI territory. Suppose we get the hoped-for >10x research speedup. Do we have a principled strategy for controlling the collective system this produces? I haven’t heard one. I wouldn’t say we’re doing a good job of controlling the current collective system.
I’ve heard cases for [this will speed things up], and [here are some good things this would make easier] but not for [overall, such a process should be expected to take things in a less doomy direction].
For such cases “you can’t learn enough from analogous but lower-stakes contexts” ought not to apply. However, I’d certainly expect “we won’t learn enough from analogous but lower-stakes contexts” (without huge efforts to avoid this).

Joe Collman Jun 2, 2024, 5:06 AM
4 points
−5
in reply to: Matthew Barnett’s comment on: MIRI 2024 Communications Strategy
On your (2), I think you’re ignoring an understanding-related asymmetry:
1. Without clear models describing (a path to) a solution, it is highly unlikely we have a workable solution to a deep and complex problem:
  1. Absence of concrete [we have (a path to) a solution] is pretty strong evidence of absence.
    [EDIT for clarity, by “we have” I mean “we know of”, not “there exists”; I’m not claiming there’s strong evidence that no path to a solution exists]
2. Whether or not we have clear models of a problem, it is entirely possible for it to exist and to kill us:
  1. Absence of concrete [there-is-a-problem] evidence is weak evidence of absence.
A problem doesn’t have to wait until we have formal arguments or strong, concrete empirical evidence for its existence before killing us. To claim that it’s “premature” to shut down the field before we have [evidence of type x], you’d need to make a case that [doom before we have evidence of type x] is highly unlikely.
A large part of the MIRI case is that there is much we don’t understand, and that parts of the problem we don’t understand are likely to be hugely important. An evidential standard that greatly down-weights any but the most rigorous, legible evidence is liable to lead to death-by-sampling-bias.
Of course it remains desirable for MIRI arguments to be as legible and rigorous as possible. Empiricism would be nice too (e.g. if someone could come up with concrete problems whose solution would be significant evidence for understanding something important-according-to-MIRI about alignment).
But ignoring the asymmetry here is a serious problem.

On your (3), it seems to me that you want “skeptical” to do more work than is reasonable. I agree that we “should be skeptical of purely theoretical arguments for doom”—but initial skepticism does not imply [do not update much on this]. It implies [consider this very carefully before updating]. It’s perfectly reasonable to be initially skeptical but to make large updates once convinced.
I do not think [the arguments are purely theoretical] is one of your true objections—rather it’s that you don’t find these particular theoretical arguments convincing. That’s fine, but no argument against theoretical arguments.

Joe Collman May 28, 2024, 3:05 PM
2 points
0
on: OpenAI: Fallout
This makes it even clearer that Altman’s claims of ignorance were lies – he cannot possibly have believed that former employees unanimously signed non-disparagements for free!
This is still quoting Neel, right? Presumably you intended to indent it.

Joe Collman May 27, 2024, 4:20 AM
2 points
0
in reply to: Ryan Kidd’s comment on: Talent Needs of Technical AI Safety Teams
Have you looked through the FLI faculty listed there?
How many seem useful supervisors for this kind of thing? Why?
If we’re sticking to the [generate new approaches to core problems] aim, I can see three or four I’d be happy to recommend, conditional on their agreeing upfront to the exploratory goals, and that publication would not be necessary (or a very low concrete number agreed upon).
There are about ten more that seem not-obviously-a-terrible-idea, but probably not great (e.g. those who I expect have a decent understanding of the core problems, but basically aren’t working on them).
The majority don’t write anything that suggests they know what the core problems are.
For almost all of these supervisors, doing a PhD would seem to provide quite a few constraints, undesirable incentives, and an environment that’s poor.
From an individual’s point of view this can still make sense, if it’s one of the only ways to get stable medium-term funding.
From a funder’s point of view, it seems nuts.
(again, less nuts if the goal were [incremental progress on prosaic approaches, and generation of a respectable publication record])

Joe Collman May 27, 2024, 3:53 AM
4 points
0
in reply to: Ryan Kidd’s comment on: Talent Needs of Technical AI Safety Teams
A few points here (all with respect to a target of “find new approaches to core problems in AGI alignment”):
It’s not clear to me what the upside of the PhD structure is supposed to be here (beyond respectability). If the aim is to avoid being influenced by most of the incentives and environment, that’s more easily achieved by not doing a PhD. (to the extent that development of research ‘taste’/skill acts to service a publish-or-perish constraint, that’s likely to be harmful)
This is not to say that there’s nothing useful about an academic context—only that the sensible approach seems to be [create environments with some of the same upsides, but fewer downsides].
I can see a more persuasive upside where the PhD environment gives:
- Access to deep expertise in some relevant field.
- The freedom to explore openly (without any “publish or perish” constraint).
This seems likely to be both rare, and more likely for professors not doing ML. I note here that ML professors are currently not solving fundamental alignment problems—we’re not in a [Newtonian physics looking for Einstein] situation; more [Aristotelian physics looking for Einstein]. I can more easily imagine a mathematics PhD environment being useful than an ML one (though I’d expect this to be rare too).
This is also not to say that a PhD environment might not be useful in various other ways. For example, I think David Krueger’s lab has done and is doing a bunch of useful stuff—but it’s highly unlikely to uncover new approaches to core problems.
For example, of the 213 concrete problems posed here how many would lead us to think [it’s plausible that a good answer to this question leads to meaningful progress on core AGI alignment problems]? 5? 10? (many more can be a bit helpful for short-term safety)
There are a few where sufficiently general answers would be useful, but I don’t expect such generality—both since it’s hard, and because incentives constantly push towards [publish something on this local pattern], rather than [don’t waste time running and writing up experiments on this local pattern, but instead investigate underlying structure].
I note that David’s probably at the top of my list for [would be a good supervisor for this kind of thing, conditional on having agreed the exploratory aims at the outset], but the environment still seems likely to be not-close-to-optimal, since you’d be surrounded by people not doing such exploratory work.

Joe Collman May 27, 2024, 12:04 AM
2 points
0
in reply to: Ryan Kidd’s comment on: Talent Needs of Technical AI Safety Teams
RFPs seem a good tool here for sure. Other coordination mechanisms too.
(And perhaps RFPs for RFPs, where sketching out high-level desiderata is easier than specifying parameters for [type of concrete project you’d like to see])
Oh and I think the MATS Winter Retrospective seems great from the [measure a whole load of stuff] perspective. I think it’s non-obvious what conclusions to draw, but more data is a good starting point. It’s on my to-do-list to read it carefully and share some thoughts.

Joe Collman May 25, 2024, 8:35 AM
40 points
21
in reply to: Ryan Kidd’s comment on: Talent Needs of Technical AI Safety Teams
I agree with Tsvi here (as I’m sure will shock you :)).
I’d make a few points:
1. “our revealed preferences largely disagree with point 1”—this isn’t clear at all. We know MATS’ [preferences, given the incentives and constraints under which MATS operates]. We don’t know what you’d do absent such incentives and constraints.
  1. I note also that “but we aren’t Refine” has the form [but we’re not doing x], rather than [but we have good reasons not to do x]. (I don’t think MATS should be Refine, but “we’re not currently 20% Refine-on-ramp” is no argument that it wouldn’t be a good idea)
2. MATS is in a stronger position than most to exert influence on the funding landscape. Sure, others should make this case too, but MATS should be actively making a case for what seems most important (to you, that is), not only catering to the current market.
  1. Granted, this is complicated by MATS’ own funding constraints—you have more to lose too (and I do think this is a serious factor, undesirable as it might be).
3. If you believe that the current direction of the field isn’t great, then “ensure that our program continues to meet the talent needs of safety teams” is simply the wrong goal.
  1. Of course the right goal isn’t diametrically opposed to that—but still, not that.
4. There’s little reason to expect the current direction of the field to be close to ideal:
  1. At best, the accuracy of the field’s collective direction will tend to correspond to its collective understanding—which is low.
  2. There are huge commercial incentives exerting influence.
  3. There’s no clarity on what constitutes (progress towards) genuine impact.
  4. There are many incentives to work on what’s already not neglected (e.g. things with easily located “tight empirical feedback loops”). The desirable properties of the non-neglected directions are a large part of the reason they’re not neglected.
  5. Similar arguments apply to [field-level self-correction mechanisms].
5. Given (4), there’s an inherent sampling bias in taking [needs of current field] as [what MATS should provide]. Of course there’s still an efficiency upside in catering to [needs of current field] to a large extent—but efficiently heading in a poor direction still sucks.
6. I think it’s instructive to consider extreme-field-composition thought experiments: suppose the field were composed of [10,000 researchers doing mech interp] [10 researchers doing agent foundations].
  1. Where would there be most jobs? Most funding? Most concrete ideas for further work? Does it follow that MATS would focus almost entirely on meeting the needs of all the mech interp orgs? (I expect that almost all the researchers in that scenario would claim mech interp is the most promising direction)
  2. If you think that feedback loops along the lines of [[fast legible work on x] --> [x seems productive] --> [more people fund and work on x]] lead to desirable field dynamics in an AIS context, then it may make sense to cater to the current market. (personally, I expect this to give a systematically poor signal, but it’s not as though it’s easy to find good signals)
  3. If you don’t expect such dynamics to end well, it’s worth considering to what extent MATS can be a field-level self-correction mechanism, rather than a contributor to predictably undesirable dynamics.
    I’m not claiming this is easy!!
    I’m claiming that it should be tried.
Detailing what job and funding opportunities should exist in the technical AI safety field is beyond the scope of this report.
Understandable, but do you know anyone who’s considering this? As the core of their job, I mean—not on a [something they occasionally think/talk about for a couple of hours] level. It’s non-obvious to me that anyone at OpenPhil has time for this.
It seems to me that the collective ‘decision’ we’ve made here is something like:
- Any person/team doing this job would need:
  - Extremely good AIS understanding.
  - To be broadly respected.
  - Have a lot of time.
- Nobody like this exists.
- We’ll just hope things work out okay using a passive distributed approach.
To my eye this leads to a load of narrow optimization according to often-not-particularly-enlightened metrics—lots of common incentives, common metrics, and correlated failure.
Oh and I still think MATS is great :) - and that most of these issues are only solvable with appropriate downstream funding landscape alterations. That said, I remain hopeful that MATS can nudge things in a helpful direction.

Joe Collman May 25, 2024, 6:28 AM
6 points
2
in reply to: TsviBT’s comment on: Talent Needs of Technical AI Safety Teams
For reference there’s this: What I learned running Refine
When I talked to Adam about this (over 12 months ago), he didn’t think there was much to say beyond what’s in that post. Perhaps he’s updated since.
My sense is that I view it as more of a success than Adam does. In particular, I think it’s a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.
Agreed that Refine’s timescale is clearly too short.
However, a much longer program would set a high bar for whoever’s running it.
Personally, I’d only be comfortable doing so if the setup were flexible enough that it didn’t seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).
What links here?
- TsviBT's comment on Talent Needs of Technical AI Safety Teams by yams (May 25, 2024, 8:57 PM; 4 points)

Joe Collman May 22, 2024, 8:24 PM
LW: 2 AF: 1
0
AF
in reply to: davidad’s comment on: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
(understood that you’d want to avoid the below by construction through the specification)
I think the worries about a “least harmful path” failure mode would also apply to a “below 1 catastrophic event per millennium” threshold. It’s not obvious to me that the vast majority of ways to [avoid significant risk of catastrophe-according-to-our-specification] wouldn’t be highly undesirable outcomes.
It seems to me that “greatly penalize the additional facts which are enforced” is a two-edged sword: we want various additional facts to be highly likely, since our acceptability specification doesn’t capture everything that we care about.

I haven’t thought about it in any detail, but doesn’t using time-bounded utility functions also throw out any acceptability guarantee for outcomes beyond the time-bound?

Joe Collman May 22, 2024, 7:33 PM
LW: 8 AF: 2
5
AF
in reply to: davidad’s comment on: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
[again, the below is all in the spirit of “I think this direction is plausibly useful, and I’d like to see more work on it”]
not to have any mental influences on people other than those which factor through the system’s pre-agreed goals being achieved in the world.
Sure, but this seems to say “Don’t worry, the malicious superintelligence can only manipulate your mind indirectly”. This is not the level of assurance I want from something calling itself “Guaranteed safe”.
It is worth noting here that a potential failure mode is that a truly malicious general-purpose system in the box could decide to encode harmful messages in irrelevant details
This is one mechanism by which such a system could cause great downstream harm.
Suppose that we have a process to avoid this. What assurance do we have that there aren’t other mechanisms to cause harm?
I don’t yet buy the description complexity penalty argument (as I currently understand it—but quite possibly I’m missing something). It’s possible to manipulate by strategically omitting information. Perhaps the “penalise heavily biased sampling” is intended to avoid this (??). If so, I’m not sure how this gets us more than a hand-waving argument.
I imagine it’s very hard to do indirect manipulation without adding much complexity.
I imagine that ASL-4+ systems are capable of many very hard things.
Similar reasoning leads me to initial skepticism of all [safety guarantee by penalizing some-simple-x] claims. This amounts to a claim that reducing x necessarily makes things safer—which I expect is untrue for any simple x.
I can buy that there are simple properties whose reduction guarantees safety if it’s done to an extreme degree—but then I’m back to expecting the system to do nothing useful.
As an aside, I’d note that such processes (e.g. complexity penalties) seem likely to select out helpful behaviours too. That’s not a criticism of the overall approach—I just want to highlight that I don’t think we get to have both [system provides helpful-in-ways-we-hadn’t-considered output] and [system can’t produce harmful output]. Allowing the former seems to allow the latter.
I would like to fund a sleeper-agents-style experiment on this by the end of 2025
That’s probably a good idea, but this kind of approach doesn’t seem in keeping with a “Guaranteed safe” label. More of a “We haven’t yet found a way in which this is unsafe”.

Joe Collman May 20, 2024, 6:33 AM
LW: 11 AF: 5
5
AF
on: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
This seems interesting, but I’ve seen no plausible case that there’s a version of (1) that’s both sufficient and achievable. I’ve seen Davidad mention e.g. approaches using boundaries formalization. This seems achievable, but clearly not sufficient. (boundaries don’t help with e.g. [allow the mental influences that are desirable, but not those that are undesirable])
The [act sufficiently conservatively for safety, relative to some distribution of safety specifications] constraint seems likely to lead to paralysis (either of the form [AI system does nothing], or [AI system keeps the world locked into some least-harmful path], depending on the setup—and here of course “least harmful” isn’t a utopia, since it’s a distribution of safety specifications, not desirability specifications).
Am I mistaken about this?
I’m very pleased that people are thinking about this, but I fail to understand the optimism—hopefully I’m confused somewhere!
Is anyone working on toy examples as proof of concept?
I worry that there’s so much deeply technical work here that not enough time is being spent to check that the concept is workable (is anyone focusing on this?). I’d suggest focusing on mental influences: what kind of specification would allow me to radically change my ideas, but not to be driven insane? What’s the basis to think we can find such a specification?
It seems to me that finding a fit-for-purpose safety/acceptability specification won’t be significantly easier than finding a specification for ambitious value alignment.

Joe Collman May 20, 2024, 4:40 AM
6 points
3
in reply to: Ben Pace’s comment on: Stephen Fowler’s Shortform
So no, not disincentivizing making positive EV bets, but updating about the quality of decision-making that has happened in the past.
I think there’s a decent case that such updating will indeed disincentivize making positive EV bets (in some cases, at least).
In principle we’d want to update on the quality of all past decision-making. That would include both [made an explicit bet by taking some action] and [made an implicit bet through inaction]. With such an approach, decision-makers could be punished/rewarded with the symmetry required to avoid undesirable incentives (mostly).
Even here it’s hard, since there’d always need to be a [gain more influence] mechanism to balance the possibility of losing your influence.
In practice, most of the implicit bets made through inaction go unnoticed—even where they’re high-stakes (arguably especially when they’re high-stakes: most counterfactual value lies in the actions that won’t get done by someone else; you won’t be punished for being late to the party when the party never happens).
That leaves the explicit bets. To look like a good decision-maker the incentive is then to make low-variance explicit positive EV bets, and rely on the fact that most of the high-variance, high-EV opportunities you’re not taking will go unnoticed.
From my by-no-means-fully-informed perspective, the failure mode at OpenPhil in recent years seems not to be [too many explicit bets that don’t turn out well], but rather [too many failures to make unclear bets, so that most EV is left on the table]. I don’t see support for hits-based research. I don’t see serious attempts to shape the incentive landscape to encourage sufficient exploration. It’s not clear that things are structurally set up so anyone at OP has time to do such things well (my impression is that they don’t have time, and that thinking about such things is no-one’s job (?? am I wrong ??)).
It’s not obvious to me whether the OpenAI grant was a bad idea ex-ante. (though probably not something I’d have done)
However, I think that another incentive towards middle-of-the-road, risk-averse grant-making is the last thing OP needs.
That said, I suppose much of the downside might be mitigated by making a distinction between [you wasted a lot of money in ways you can’t legibly justify] and [you funded a process with (clear, ex-ante) high negative impact].
If anyone’s proposing punishing the latter, I’d want it made very clear that this doesn’t imply punishing the former. I expect that the best policies do involve wasting a bunch of money in ways that can’t be legibly justified on the individual-funding-decision level.

Joe Collman May 5, 2024, 12:42 AM
4 points
1
on: Shane Legg’s necessary properties for every AGI Safety plan
Some thoughts:
1. Necessary conditions aren’t sufficient conditions. Lists of necessary conditions can leave out the hard parts of the problem.
2. The hard part of the problem is in getting a system to robustly behave according to some desirable pattern (not simply to have it know and correctly interpret some specification of the pattern).
  1. I don’t see any reason to think that prompting would achieve this robustly.
  2. As an attempt at a robust solution, without some other strong guarantee of safety, this is indeed a terrible idea.
    I note that I don’t expect trying it empirically to produce catastrophe in the immediate term (though I can’t rule it out).
    I also don’t expect it to produce useful understanding of what would give a robust generalization guarantee.
    With a lot of effort we might achieve [we no longer notice any problems]. This is not a generalization guarantee. It is an outcome I consider plausible after putting huge effort into eliminating all noticeable problems.
3. The “capabilities are very important [for safety]” point seems misleading:
  1. Capabilities create the severe risks in the first place.
  2. We can’t create a safe AGI without advanced capabilities, but we may be able to understand how to make an AGI safe without advanced capabilities.
    There’s no ”...so it makes sense that we’re working on capabilities” corollary here.
    The correct global action would be to try gaining theoretical understanding for a few decades before pushing the cutting edge on capabilities. (clearly this requires non-trivial coordination!)