Steven Byrnes comments on Being nicer than Clippy

Steven Byrnes 17 Jan 2024 15:44 UTC
10 points
0
Great post! I’m getting a lot out of this series.
Here are some of the paths that I think lead some people to thinking that a boundary-respecting post-AI future is unlikely or bad. (Note: I don’t have a strong position either way, at the end of the day, just trying to facilitate good discussion)
- Belief that “pure-consequentialist AI” is kinda the only way to build very powerful (tech-inventing, self-improving, reflectively-stable) AI, so we should expect that to happen sooner or later.
(By “pure-consequentialist AI”, I mean AI that has preferences about the state of the universe in the distant future, and those preferences inform its actions.)
My impression is that Eliezer & Nate believe something like this. See for example Nate’s post Deep Deceptiveness (see also my response comment).
Anyway, I don’t buy this belief, for reasons in my post Consequentialism & Corrigibility. In short, I think it’s possible to make AIs that are consequentialist enough to invent tech, self-improve, and so on, but that simultaneously have reflectively-stable preferences about respecting norms and so on. Humans are an example, I claim.
…But there’s a softer version of that:
- Belief that “pure-consequentialist AI” will outcompete the “non-pure-consequentialist AI”, even if (per above) the latter is real and powerful.
It’s true that if Agent A is a pure consequentialist (has preferences about the state of the universe in the distant future), and Agent B is not (it has both preferences about the state of the universe in the distant future and preferences about other kinds of things like following norms, respecting boundaries, etc.), then, other things equal, one should expect the state of the universe in the distant future to have more in common with Agent A’s preferences than Agent B’s. For example, insofar as it’s instrumentally useful to maintain a reputation for norm-following, well Agent A can do that too. But agent A can also do ruthless power-seeking when it can get away with it. (There’s an exception in principle if AIs can read each other’s source code, but I’m not sure if that’s actually feasible in practice.)
Anyway, I see this as an important dynamic to keep in mind, but I’m not sure how decisive it will be.
- Belief that good enforceable boundaries are a temporary luxury of our technological immaturity, i.e. offense-defense balance will change in the future
For example, plausibly it’s much easier to make boundary-ignoring nanobots than either boundary-respecting nanobots or nanobot defense systems. (Or substitute “invasive species from hell” if you’re not into nanobots.)
The OP already mentioned another important example: if it becomes possible to create sentient minds in the privacy of one’s own consumer GPU (as I expect eventually), that creates challenges to envisioning a liberal-genre good future.
- Belief that the power of cooperation (Elua) is a temporary feature of our technology immaturity
For one thing: if a human wants allies, he can be cooperative, or charismatic, or distribute spoils, etc. If an AI wants allies, it can do any of those things, or it can simply create more copies of itself, which is a very different and potentially very effective strategy.
There’s a trope in zombie apocalypse movies where the zombies can turn people into more zombies who then immediately join the zombie cause. In human-world, that’s fiction, but in AI-world, it will presumably be possible for AIs to take control of each others’ chips and use them to run more copies of themselves (either by cyber-attacking each other, or even teleoperating a robot to get physical access to another AI’s chips, and go get root access with a soldering iron or whatever). I don’t really know how this would play out but it seems like it might importantly remove the strategic advantage of playing-well-with-others.
Another thing: Very few humans are liable to act sincere and cooperative for an extended period, and then stab allies in the back as soon as the situation changes. Most people act sincere and cooperative because of their innate social drives; then there are a small number of smart sociopaths and so on, but they tend to be impulsive and impatient rather than patient and strategic, by and large. But all bets are off with future AIs. So old-fashioned cooperation (through trust, reputation, etc.) might not be a stable equilibrium in the future. It could be replaced by some high-tech version of cooperation (reading each others’ source code?), but it’s unclear whether there’s anything feasible in that genre. (My perennial uncertainty is: AI 1 can straightforwardly send source code / model weights / whatever to AI 2, but how can AI 1 prove to AI 2 that this file is actually its real source code / model weights / whatever? There might be a good answer, I dunno.)
- CarlShulman 18 Jan 2024 1:32 UTC
  9 points
  3
  Parent
  (My perennial uncertainty is: AI 1 can straightforwardly send source code / model weights / whatever to AI 2, but how can AI 1 prove to AI 2 that this file is actually its real source code / model weights / whatever? There might be a good answer, I dunno.)
  They can jointly and transparently construct an AI 3 from scratch motivated to further their deal, and then visibly hand over their physical resources to it, taking turns with small amounts in iterated fashion.
  
  AI 3 can also be given access to secrets of AI 1 and AI 2 to verify their claims without handing over sensitive data.
  - Wei Dai 18 Jan 2024 2:26 UTC
    5 points
    0
    Parent
    I think this idea should be credited to Tim Freeman (who I quoted in this post), who AFAIK was the first person to to talk about it (in response to a question very similar to Steven’s that I asked on SL4).