Daniel Kokotajlo comments on My disagreements with “AGI ruin: A List of Lethalities”

Daniel Kokotajlo 25 Sep 2024 14:39 UTC
2 points
0
because of synthetic data letting us control what the AI learns and what they value, and in particular we can place honeypots that are practically indistinguishable from the real world, such that if we detected an AI trying to deceive or gain power, the AI almost certainly doesn’t know whether we tested it or whether it’s in the the real world:
Much easier said than done my friend. In general there are a lot of alignment techniques which I think would plausibly work if only the big AI corporations invested sufficient time and money and caution in them. But instead they are racing each other and China. It seems that you think that our default trajectory includes making simulation-honeypots so realistic that even our AGIs doing AGI R&D on our datacenters and giving strategic advice to the President etc. will think that maybe they are in a sim-honeypot. I think this is unlikely; it sounds like a lot more work than AGI companies will actually do.
- Noosphere89 25 Sep 2024 15:07 UTC
  4 points
  0
  Parent
  Good point that some race dynamics will make the in practice outcome worse than my ideal.
  
  I agree that the problem is that the race dynamics will cause some labs to skip various precautions, but even in this world where we have 0 dignity, I have a non-trivial (though unacceptably low chance) of us succeeding at alignment, more like 20-50%, but even so, I agree that less racing would be good, because a 20-50% chance of success is very dangerous, it’s just that I believe we need 0 more insights into alignment, and there is a reasonably tractable way to get an AI to share our values in a way which requires lots of engineering to automate the data pipeline safely, but nothing in the need of new insights is required.
  
  This isn’t a post on “how we could be safe even under arbitrarily high pressure to race”, this is a post on how early Lesswrong and MIRI got a lot of things wrong such that we can assign much higher tractability to alignment happening, and thus good outcomes happening.
  
  You are correct that more safety culture needs to happen in labs, it’s just that we could have AI progress at some rate without getting ourselves into a catastrophe.
  
  So I agree with you for the most part that we need to slow down the race, I just think that we don’t need to go further and introduce outright stoppages because we don’t know in technical terms how to align AIs.
  
  Though see this post on the case for negative alignment taxes, which would really help the situation for us if we were in a race:
  
  https://www.lesswrong.com/posts/xhLopzaJHtdkz9siQ/the-case-for-a-negative-alignment-tax
  
  (We might disagree on how much time and money needs to be invested, but that’s a secondary crux.)