John_Maxwell comments on Getting from an unaligned AGI to an aligned AGI?

John_Maxwell 22 Jun 2022 10:39 UTC
7 points
I skimmed the post, here is some feedback (context):
- I’m probably not the best person to red team this since some of my own alignment ideas are along similar lines. I’m also a bit on the optimistic side about alignment more generally—it might be better to talk to a pessimist.
- This sounds a bit like the idea of a “low-bandwidth oracle”.
- I think the biggest difficulty is the one you explicitly acknowledged—boxing is hard.
- But there are also problems around ensuring that bandwidth is actually limited. If you have a human check to see that the AGI’s output conforms to the spec, the AGI could put an infohazard in the output, or write code which looks like it does one thing but actually does another. If you have a computer program or AI check the AI’s output, the AGI could exploit a flaw in that program/AI. I think this sort of thing basically gets you a probabilistic safety measure, because there’s always a risk that there’s a flaw that the superintelligent AGI sees (or can infer) that you don’t see. (I like this intuition pump for seeing why these sort of problems are plausible.) I think probabilistic safety measures can be good if we stack a lot of them together in the right way.
- The idea of emitting machine-checkable proofs is interesting. I’m not sure such proofs are very useful though. “Finding the right spec is one of the biggest challenges in formal methods.”—source. And finding the right spec seems more difficult to outsource to an unfriendly AI. In general, I think using AI to improve software reliability seems good, and tractable.
I think you’ll find it easier to get feedback if you keep your writing brief. Assume the reader’s time is valuable. Sentences like “I will mention some stuff later that maybe will make it more clear how I’d think about such a question.” should simply be deleted—make huge cuts. I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post. Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.

Similarly, I think it is worth investing in clarity. If a sentence is unclear, I have a tendency to just keep reading and not mention it unless I have a prior that the author knows what they’re talking about. (The older I get, the more I assume that unclear writing means the author is confused and ignorable.) I like writing advice from Paul Graham and Scott Adams.

Personally I’m more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading. I don’t have much time to do feedback right now unfortunately.
What links here?
- John_Maxwell's comment on Security Mindset: Lessons from 20+ years of Software Security Failures Relevant to AGI Alignment by elspood (22 Jun 2022 10:42 UTC; 4 points)
- Tor Økland Barstad's comment on Getting from an unaligned AGI to an aligned AGI? by Tor Økland Barstad (8 Jul 2022 6:57 UTC; 1 point)
- Tor Økland Barstad 22 Jun 2022 14:24 UTC
  5 points
  Parent
  This sounds a bit like the idea of a “low-bandwidth oracle”.
  Thanks, that’s interesting. Hadn’t seen that (insofar as I can remember). Definitely overlap there.
  
  I think probabilistic safety measures can be good if we stack a lot of them together in the right way.
  Same, and that’s a good/crisp way to put it.
  
  Maybe post a summary at the top, and say people are welcome to give feedback after just having read the summary.
  Will edit at some point so as to follow the first part of that suggestion. Thanks!
  
  I think you’ll find it easier to get feedback if you keep your writing brief. (..) I think I might have been able to generate the bullet points above based on a 2-paragraph executive summary of your post.
  Some things in that bullet-list addresses stuff I left out to cut length, and stuff I though I would address in future parts of the series. Found also those parts of bullet-list helpful, but still this exemplifies dilemmas/tradeoffs regarding length. Will try to make more effort to look for things to make shorter based on your advice. And I should have read through this one more before publishing.
- Tor Økland Barstad 7 Jul 2022 9:41 UTC
  1 point
  Parent
  Personally I’m more willing to give feedback on prepublication drafts because that gives me more influence on what people end up reading.
  I have a first draft ready for part 2 now: https://docs.google.com/document/d/1RGyvhALY5i98_ypJkFvtSFau3EmH8huO0v5QxZ5v_zQ/edit
  Will read it over more, but plan to post within maybe a few days.
  I have also made a few changes to part 1, and will probably make additional changes to part 1 over time.
  As you can see if you open the Google Doc, part 2 is not any shorter than part 1. You may or may not interpret that as an indication that I don’t make effective use of feedback.
  
  Part 3, which I have not finished, is the part that will focus more on proofs. (Edit: It does not. But maybe there will be a future post that focuses on proofs as planned. It is however quite very relevant to the topic of proofs the way I think of things.)
  Any help from anyone in reading over would be appreciated, but at the same time it is not expected :)
  - Tor Økland Barstad 10 Dec 2022 18:09 UTC
    1 point
    Parent
    A draft for part 3 is now ready: https://docs.google.com/document/d/1S0HsaImfJHtyz2wo59S9n2doefyKQfP9Yp6A1XxXEwM/edit#.
    
    Length-wise it is a bit of a monstrosity, despite your warnings.
    
    I do intend to later write a post that is quicker and easier to read (but gets less into detail).
    
    Feedback would be welcomed, but it’s definitely not expected (so I’m sharing it here just in case) 🙂