ryan_greenblatt comments on Four visions of Transformative AI success

ryan_greenblatt 17 Jan 2024 21:55 UTC
LW: 19 AF: 9
9
AF
I think your description of vision 1 is likely to give people misleading impressions of what this could plausibly look like or what the people who you cited as pursuing vision 1 are thinking will happen. You disclaim this by noting the doc is oversimplified, but I think various clarifications are quite important in practice.

(It’s possible that you think these misleading impression aren’t that important because from your perspective the main cruxes are in What does it take to defend the world from out-of-control AI? (But presumably you don’t place total confidence in your views there?))

[Edit: I think this first paragraph originally came across as more aggressive than I was intending. Sorry. I’ve edited it a bit to tone it down.]

It seems important to note that the totally amount of autonomy in vision 1 might be extremely large in practice. E.g., AIs might conduct autonomous R&D where some AI instance works on a project for the equivalent of many months without any interaction with a human. (That said I think this system is very likely to be monitored by other AI systems and some actions might be monitored by humans, though it’s plausible that the fraction monitored by humans is very low (e.g. 1%) and long contiguous sequences won’t see any human monitoring.) Levels of autonomy this high might be required for speeding up R&D by large factors (e.g. 30x) due to a combination of serial bottlenecks (meaning that AIs need to serially outspeed humans in many cases) and the obvious argument that a 30x speed up requires AI to automate at least 97% of tasks. (To be clear, I think sometimes when people are imagining vision 1, they aren’t thinking about situations this crazy, but I think they should.)

In fact, I think the level of autonomy between Visions 1 and 2 might be actually similar in practice (because even wild AIs in Vision 2 might want to utilize human labor for some tasks for some transitionary period).

The main difference between vision 1 and visions 2 (assuming vision 1 is working):
- The weights are still on our server and we could turn off the server.
- We can monitor all inputs and outputs from the AI.
- We can continue training the AI and might if we observe undesirable behavior.
There’s no sharp line between the helper AIs of Vision 1 and the truly-autonomous AIs of Vision 2. For example, to what extent do the human supervisors really understand what their AI helpers are doing and how? The less the humans understand, the less we can say that the humans are really in control.

There is also the failure model of deceptive alignment where these AIs are lying in wait for a good opportunity for a treacherous turn. This is a problem even if humans have understood everything they’ve seen thus far.

One issue here is race-to-the-bottom competitive dynamics: if some humans entrust their AIs with more authority to make fast autonomous decisions for complex inscrutable reasons, then those humans will have a competitive advantage over the humans who don’t. Thus they will wind up in control of more resources, and in this way, the typical level of human control and supervision may very rapidly drop to zero.

Seems like a complicated empirical question. Note that adequately supervising 1% of all queries suffices to rule out a bunch of specific threat models. See auditing failures vs concentrated failures. Of course, adequate supervision is hard and might be much harder if competitive AIs must perform inscrutable actions which could contain inscrutable danger.

By and large, people in this camp have an assumption that TAI will look, and act, and be trained, much like LLMs, but they’ll work better.

FWIW, I think Paul in particular puts less than 50% on “TAI looks like LLMs” if by that you mean “most of the capabilities come from generative pretraining basically like what we have right now”. Short timelines are more likely to look like this though presumably.
What links here?
- Steven Byrnes's comment on What does it take to defend the world against out-of-control AGIs? by Steven Byrnes (25 Jan 2024 22:00 UTC; 6 points)
- Steven Byrnes 19 Jan 2024 16:07 UTC
  LW: 11 AF: 6
  1
  AF Parent
  That’s a very helpful comment, thanks!
  Yeah, Vision 1 versus Vision 2 are two caricatures, and as such, they differ along a bunch of axes at once. And I think you’re emphasizing on different axes than the ones that seem most salient to me. (Which is fine!)
  In particular, maybe I should have focused more on the part where I wrote: “In that case, an important conceptual distinction (as compared to Vision 1) is related to AI goals: In Vision 1, there’s a pretty straightforward answer of what the AI is supposed to be trying to do… By contrast, in Vision 2, it’s head-scratching to even say what the AI is supposed to be doing…”
  Along this axis-of-variation:
  - “An AI that can invent a better solar cell, via doing the same sorts of typical human R&D stuff that a human solar cell research team would do” is pretty close to the Vision 1 end of the spectrum, despite the fact that (in a different sense) this AI has massive amounts of “autonomy”: all on its own, the AI may rent a lab space, apply for permits, order parts, run experiments using robots, etc.
  - The scenario “A bunch of religious fundamentalists build an AI, and the AI notices the error in its programmers’ beliefs, and successfully de-converts them” would be much more towards the Vision 2 end of the spectrum—despite the fact that this AI is not very “autonomous” in the going-out-and-doing-things sense. All the AI is doing is thinking, and chatting with its creators. It doesn’t have direct physical control of its off-switch, etc.
  Why am I emphasizing this axis in particular?
  For one thing, I think this axis has practical importance for current research; on the narrow value learning vs ambitious value learning dichotomy, “narrow” is enough to execute Vision 1, but you need “ambitious” for Vision 2.
  For example, if we move from “training by human approval” to “training by human approval after the human has had extensive time to reflect, with weak-AI brainstorming help”, then that’s a step from Vision 1 towards Vision 2 (i.e. a step from narrow value learning towards ambitious value learning). But my guess is that it’s a pretty small step towards Vision 2. I don’t think it gets us all the way to the AI I mentioned above, the one that will proactively deconvert a religious fundamentalist supervisor who currently has no interest whatsoever in questioning his faith.
  For another thing, I think this axis is important for strategy and scenario-planning. For example, if we do Vision 2 really well, it changes the story in regards to “solution to global wisdom and coordination” mentioned in Section 3.2 of my “what does it take” post.
  In other words, I think there are a lot of people (maybe including me) who are wrong about important things, and also not very scout-mindset about those things, such that “AI helpers” wouldn’t particularly help, because the person is not asking the AI for its opinion, and would ignore the opinion anyway, or even delete that AI in favor of a more sycophantic one. This is a societal problem, and always has been. One possible view of that problem is: “well, that’s fine, we’ve always muddled through”. But if you think there are upcoming VWH-type stuff where we won’t muddle through (as I tentatively do in regards to ruthlessly-power-seeking AGI), then maybe the only option is a (possibly aggressive) shift in the balance of power towards a scout-mindset-y subpopulation (or at least, a group with more correct beliefs about the relevant topics). That subpopulation could be composed of either humans (cf. “pivotal act”), or of Vision 2 AIs.
  Here’s another way to say it, maybe. I think you’re maybe imagining a dichotomy where either AI is doing what we want it to do (which is normal human stuff like scientific R&D), or the AI is plotting to take over. I’m suggesting that there’s a third murky domain where the person wants something that he maybe wouldn’t want upon reflection, but where “upon reflection” is kinda indeterminate because he could be manipulated into wanting different things depending on how they’re framed. This third domain is important because it contains decisions about politics and society and institutions and ethics and so on. I have concerns that getting an AI to “perform well” in this murky domain is not feasible via a bootstrap thing that starts from the approval of random people; rather, I think a good solution would have to look more like an AI which is internally able to do the kinds of reflection and thinking that humans do (but where the AI has the benefit of more knowledge, insight, time, etc.). And that requires that the AI have a certain kind of “autonomy” to reflect on the big picture of what it’s doing and why. I think that kind of “autonomy” is different than how you’re using the term, but if done well (a big “if”!), it would open up a lot of options.
  What links here?
  - 4. Existing Writing on Corrigibility by Max Harms (10 Jun 2024 14:08 UTC; 49 points)
  - Steven Byrnes's comment on Four visions of Transformative AI success by Steven Byrnes (29 Jan 2024 13:45 UTC; 2 points)
  - ryan_greenblatt 19 Jan 2024 18:13 UTC
    LW: 6 AF: 5
    0
    AF Parent
    Thanks for the response! I agree that the difference is a difference in emphasis.
- Seth Herd 18 Jan 2024 20:39 UTC
  2 points
  0
  Parent
  I agree that there isn’t a sharp line between helper AIs and autonomous AIs. I think it’s also important that autonomous won’t necessarily outcompete helper AIs.
  
  If we use DWIM as our alignment target, you could see a “helper AI” that’s autonomous enough to “create a plan to solve cancer”. The human just told it to do that, and will need to check the plan and ask the AI to actually carry it out if it seems safe.
  
  If you only have a human in the loop at key points in big plans, there’s no real competitive advantage for fully autonomous AGI.