Vanessa Kosoy comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

Vanessa Kosoy Dec 25, 2024, 11:52 AM
LW: 13 AF: 10
4
AF
This post makes an important point: the words “artificial intelligence” don’t necessarily carve reality at the joints, the fact something is true about a modern system that we call AI doesn’t automatically imply anything about arbitrary future AI systems, no more than conclusions about e.g. Dendral or DeepBlue carry over to Gemini.
That said, IMO the author somewhat overstates their thesis. Specifically, I take issue with all the following claims:
- LLMs have no chance of becoming AGI.
- LLMs are automatically safe.
- There is nearly no empirical evidence from LLMs that is relevant to alignment of future AI.
First, those points are somewhat vague because it’s not clear what counts as “LLM”. The phrase “Large Language Model” is already obsolete, at least because modern AI is multimodal. It’s more appropriate to speak of “Foundation Models” (FM). More importantly, it’s not clear what kind of fine-tuning does or doesn’t count (RLHF? RL on CoT? …)
Second, how do we know FM won’t become AGI? I’m imagining the argument is something like “FM is primarily about prediction, so it doesn’t have agency”. However, when predicting data that contains or implies decisions by agents, it’s not crazy to imagine that agency can arise in the predictor.
Third, how do we know that FM are always going to be safe? By the same token that they can develop agency, they can develop dangerous properties.
Fourth, it seems really unfair to say existing AI provides no relevant evidence. The achievements of existing AI systems are such that it seems very likely they capture at least some of the key algorithmic capabilities of the human brain. The ability of relatively simple and generic algorithms to perform well on a large variety of different tasks is indicative of something in the system being quite “general”, even if not “general intelligence” in the full sense.
I think that we should definitely try learning from existing AI. However, this learning should be more sophisticated and theory-driven than superficial analogies or trend extrapolations. What we shouldn’t do is say “we succeeded at aligning existing AI, therefore AI alignment is easy/solved in general”. The same theories that predicted catastrophic AI risk also predict roughly the current level of alignment for current AI systems.
I will expand a little on this last point. The core of the catastrophic AI risk scenario is:
- We are directing the AI towards a goal which is complex (so that correct specification/generalization is difficult)^[1].
- The AI needs to make decisions in situations which (i) cannot be imitated well in simulation, due to the complexity of the world (ii) admit catastrophic mistakes (otherwise you can just add any mistake to the training data)^[2].
- The capability required from the AI to succeed is such that it can plausibly do catastrophic mistakes (if succeeding at the task is easy, but causing a catastrophe is really hard then a weak AI would be safe and effective)^[3].
The above scenario must be addressed eventually, if only to create an AI defense system against unaligned AI that irresponsible actors could create. However, no modern AI system operates in this scenario. This is the most basic reason why the relative ease of alignment in modern systems (although even modern systems have alignment issues), does little to dispel concerns about catastrophic AI risk in the future.
1. ^
  Even for simple goals inner alignment is a concern. However, it’s harder to say at which level of capability this concern arises.
2. ^
  It’s also possible that mistakes are not catastrophic per se, but are simultaneously rare enough that it’s hard to get enough training data and frequent enough to be troublesome. This is related to the reliability problems in modern AI that we indeed observe.
3. ^
  But sometimes it might be tricky to hit the capability sweet spot where the AI is strong enough to be useful but weak enough to be safe, even if such a sweet spot exists in principle.