Thomas Kwa comments on Another argument against utility-centric alignment paradigms

Thomas Kwa Sep 23, 2024, 10:19 AM
5 points
0
While 2024 SoTA models are not capable of autonomously optimizing the world, they are really smart, perhaps ¹⁄₂ or ²⁄₃ of the way there, and already beginning to make big impacts on the economy. As I said in response to your original post, because we don’t have 100% confidence in the coherence arguments, we should take observations about the coherence level of 2024 systems as evidence about how coherent the 203X autonomous corporations will need to be. Evidence that 2024 systems are not dangerous is both evidence that they are not AGI and evidence that AGI need not be dangerous.
I would agree with you if the coherence arguments were specifically about autonomously optimizing the world and not about autonomously optimizing a Go game or writing 100-line programs, but this doesn’t seem to be the case.
the mathematical noose around them is slowly tightening
This is just a conjecture, and there has not really been significant progress on the agent-like structure conjecture. I don’t think it’s fair to say we’re making good progress on a proof.
This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior? In this world what the believers in coherence really need to show is that almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence. Then for the argument to carry through you need to show they are also high on some metric of incorrigibility, or fragile to value misspecification. None of the classic coherence results quite hit this.
However AFAIK @Jeremy Gillen does not think we can get an argument with exactly this structure (the main argument in his writeup is a bit different), and Eliezer has historically and recently made the argument that EU maximization is simple and natural. So maybe you do need this argument that an EU maximization algorithm is simpler than other algorithms, which seems like it needs some clever way to formalize it, because proving things about the space of all simple programs seems too hard.
- Jeremy Gillen Sep 23, 2024, 9:12 PM
  2 points
  0
  Parent
  I think you are mischaracterizing my beliefs here.
  “almost all agents getting sufficiently high performance on sufficiently hard tasks score high on some metric of coherence.”
  This seems right to me. Maybe see my comment further up, I think it’s relevant to arguments we’ve had before.
  This might be fine if proving things about the internal structure of an agent is overkill and we just care about behavior?
  We can’t say much about the detailed internal structure of an agent, because there’s always a lot of ways to implement an algorithm. But we do only care about (generalizing) behavior, so we only need some very abstract properties relevant to that.