Richard_Kennaway comments on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

Richard_Kennaway 17 Oct 2023 7:46 UTC
4 points
2

To me the security mindset seems inapplicable because in computer science, programs are rigid systems with narrow targets. AI is not very rigid and the target, I.e. an aligned mind, is not necessarily narrow.

That rigidity is what makes computer security so easy.

...

Relative to AGI security.
- O O 17 Oct 2023 8:36 UTC
  6 points
  −1
  Parent
  No the rigidity is what makes a system error prone i.e. brittle. If you don’t specify the solution exactly, the machine won’t solve the problem. Classic computer programs can’t generalize.
  
  The OP makes a point how you can double a model size and it will work well but if you double a computer programs binary size with unused lines of code you can get all sorts of weird errors. Even if none of that extra size is ever used.
  
  An analogy is trying to write a symbolic logic program to emulate an LLM. (Ie with only if statements and for loops) or trying to make a self driving car with Boolean logic.
  
  If I flip one single bit in a computer program, it will probably catastrophically fail and crash the whole computer. However removing random weights won’t do much to an LLM.
  
  a little tangent on the flipping a bit:
  
  Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash.
  
  Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
  - Noosphere89 17 Oct 2023 13:34 UTC
    4 points
    2
    Parent
    
    Flipping a bit in the actual binary itself (the thing the computer reads to run the program) will probably cause the computer to access a part of itself it wasn’t supposed to and immediately crash.
    
    Changing a letter in a computer program that humans write will almost certainly cause the program to not compile.
    
    Yep, these are the important parts, and Neural Networks are much more robust than that, and it has extreme robustness compared to a lot of other fields, which is why I’m skeptical of applying the security mindset, since it would predict false things.
  - Richard_Kennaway 18 Oct 2023 7:31 UTC
    2 points
    0
    Parent
    The non-rigidity of ChatGPT and its ilk does not make them less error-prone. Indeed, ChatGPT text is usually full of errors. But the errors are just as non-rigid. So are the means, if they can be found, of fixing them. ChatGPT output has to be read with attention to see its emptiness.
    
    None of this has anything to do with security mindset, as I understand the term.
    - Noosphere89 18 Oct 2023 13:55 UTC
      0 points
      −6
      Parent
      The point is that if it was like computer security or even computer engineering, those errors would completely destroy ChatGPT’s intelligence, and make it as useless as a random computer. This is just one example of an observation like this that makes me skeptical of applying the security mindset, as ML/AI and it’s subfield, ML/AI alignment is a strange enough field that I wouldn’t port over any intuitions from other fields.
      
      ML/AI alignment is like quantum mechanics, in which you need to leave your intuitions at the door, and unfortunately this makes public outreach likely net-negative.
      - Richard_Kennaway 18 Oct 2023 16:39 UTC
        23 points
        11
        Parent
        At this point it is not clear to me what you mean by security mindset. I understand by it what Bruce Schneier described in the article I linked, and what Eliezer describes here (which cites and quotes from Bruce Schneier). You have cited QuintinPope, who also cites the Eliezer article, but gets from it this concept of “security mindset”: “The bundle of intuitions acquired from the field of computer security are good predictors for the difficulty / value of future alignment research directions”. From this and his further words about the concept, he seems to mean something like “programming mindset”, i.e. good practice in software engineering. Only if I read both you and him as using “security mindset” to mean that can I make sense of the way you both use the term.
        
        But that is simply not what “security mindset” means. Recall that Schneier’s article began with the example of a company selling ant farms by mail order, nothing to do with software. After several more examples, only one of which concerns computers, he gives his own short characterisation of the concept that he is talking about:
        
        the security mindset involves thinking about how things can be made to fail. It involves thinking like an attacker, an adversary or a criminal. You don’t have to exploit the vulnerabilities you find, but if you don’t see the world that way, you’ll never notice most security problems.
        
        Later on he describes its opposite:
        
        The designers are so busy making these systems work that they don’t stop to notice how they might fail or be made to fail, and then how those failures might be exploited.
        
        That is what Eliezer is talking about, when he is talking about security mindset.
        
        Yes, prompting ChatGPT is not like writing a software library like pytorch. That does not make getting ChatGPT to do what you want and only what you want any easier or safer. In fact, it is much more difficult. Look at all the jailbreaks for ChatGPT and other chatbots, where they have been made to say things they were intended not to say, and answer questions they were intended not to answer.
        Noosphere89 18 Oct 2023 17:09 UTC
        0 points
        −2
        Parent
        My issue with the security mindset is that there’s a selection effect/bias that causes people to notice the failures of security, and not it’s successes, even if the true evidence for success is massively larger than it’s failure.
        
        Here’s a quote from lc’s post POC or GTFO as a counter to alignment wordcelism, on why the security industry has massive issues with people claiming security failures when they don’t or can’t happen:
        
        Even if you’re right that an attack vector is unimportant and probably won’t lead to any real world consequences, in retrospect your position will be considered obvious. On the other hand, if you say that an attack vector is important, and you’re wrong, people will also forget about that in three years. So better list everything that could possibly go wrong[1], even if certain mishaps are much more likely than others, and collect oracle points when half of your failure scenarios are proven correct.
        
        And this is why in general I dislike the security mindset, because of the incentives to report failure or bad events even when they aren’t very much of a concern.
        
        Also, the stuff that computer security people do largely doesn’t need to be done in ML/AI, which is another reason I’m skeptical of the security mindset.
        Richard_Kennaway 19 Oct 2023 7:27 UTC
        0 points
        −4
        Parent
        These are parochial matters within the computer security community, and do not bear on the hazards of AGI.
        Noosphere89 19 Oct 2023 15:31 UTC
        0 points
        −2
        Parent
        They do matter, since it implies a sort of selection effect where people will share the evidence for doom, and not notice the evidence for not-doom, and this matters because the real chance of doom may be much lower, in principle arbitrarily low, while LWers and AI safety/governance organizations have higher probabilities of doom.
        
        Combined with more standard biases on negative news being selected for, it is one piece in why I think AI doom is very unlikely. This is just one piece of it, not my entire argument
        
        And I think this already happened, cf the entire inner misalignment/optimization daemon situation, where it was tested twice, once showing a confirmed break, and the other one by Ulisse Mini, where in a more realistic situation, the optimization daemon/inner misalignment went away, and very little shared on this result, compared to the original which almost certainly got more views.