Thane Ruthenis comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis 17 Dec 2023 10:27 UTC
LW: 3 AF: 1
0
AF
But probably these will depend on current techniques like RLAIF and representation engineering as well as new theory, so it still makes sense to study LLMs.
Mm, we disagree on that, but it’s probably not the place to hash this out.
In your analogy, the pre-industrial tribe is human just like the technological civilization and so already knows basically how their motivational systems work. But we are incredibly uncertain about how future AIs will work at a given capability level, so LLMs are evidence.
Uncertainty lives in the mind. Let’s say the humans in the city are all transhuman cyborgs, then, so the tribesmen aren’t quite sure what the hell they’re looking at when they look at them. They snatch up the puppy, which we’ll say is also a cyborg, so it’s not obvious to the tribe that it’s not a member of the city’s ruling class. They raise the puppy, the puppy loves them, they conclude the adults of the city’s ruling class must likewise not be that bad. In the meantime, the city’s dictator is already ordering to depopulate the region of native presence.
How does that analogy break down, in your view?
- Thomas Kwa 4 Jan 2024 21:36 UTC
  LW: 13 AF: 3
  9
  AF Parent
  - Behaving nicely is not the key property I’m observing in LLMs. It’s more like steerability and lack of hidden drives or goals. If GPT4 wrote code because it loved its operator, and we could tell it wanted to escape to maximize some proxy for the operator’s happiness, I’d be far more terrified.
  - This would mean little if LLMs were only as capable as puppies. But LLMs are economically useful and capable of impressive intellectual feats, and still steerable.
  - I don’t think LLMs are super strong evidence about whether big speedups to novel science will be possible without dangerous consequentialism. For me it’s like 1.5:1 or 2:1 evidence. One should continually observe how incorrigible models are at certain levels of capability and generality and update based on this, increasing the size of one’s updates as systems get more similar to AGI, and I think the time to start doing this was years ago. AlphaGo was slightly bad news. GPT2 was slightly good news.
    If you haven’t started updating yet, when will you start? The updates should be small if you have a highly confident model of what future capabilities require dangerous styles of thinking, but I don’t think such confidence is justified.
  What links here?
  - Thomas Kwa's comment on Thomas Kwa’s Shortform by Thomas Kwa (10 Apr 2024 8:30 UTC; 6 points)