Thomas Kwa comments on What do coherence arguments actually prove about agentic behavior?

Thomas Kwa 6 Jun 2024 18:11 UTC
7 points
3
I think that it’s mostly Eliezer who believes so strongly in utility functions. Nate Soares’ post Deep Deceptiveness, which I claim is a central part of the MIRI threat model insofar as there is one, doesn’t require an agent coherent enough to satisfy VNM over world-states. In fact it can depart from coherence in several ways and still be capable and dangerous:
- It can flinch away from dangerous thoughts;
- Its goals can drift over time;
- Its preferences can be incomplete;
- Maybe every 100 seconds it randomly gets distracted for 10 seconds
The important property it has is a goal about the real world and applies general problem-solving skills to achieve it, and has no stable desire to use its full intelligence to be helpful/good for humans. No one has formalized this and so no one has proved interesting things about such an agent model.
- habryka 6 Jun 2024 19:54 UTC
  4 points
  0
  Parent
  I would be somewhat surprised if Eliezer and Nate disagree very much here, though you might know better. So I would mostly see Nate’s post as clarification of both Eliezer’s and Nate’s views.
  - Thomas Kwa 6 Jun 2024 20:58 UTC
    5 points
    0
    Parent
    I do think they disagree based on my experience working with Nate and Vivek. Eliezer has said he has only shared 40% of his models with even Nate for infosec reasons [1] (which surprised me!), so it isn’t surprising to me that they would have different views. Though I don’t know Eliezer well, I think he does believe in the basic point of Deep Deceptiveness (because it’s pretty basic) but also believes in coherence/utility functions more than Nate does. I can maybe say more privately but if it’s important asking one of them is better.
    
    [1] This was a while ago so he might have actually said that Nate only has 40% of his models. But either way my conclusion is valid.