Noosphere89 comments on If we solve alignment, do we die anyway?

Noosphere89 19 Nov 2024 18:05 UTC
6 points
2
Re jailbreaks, I think this is not an example of alignment not being solved, but rather an example of how easy it is to misuse/control LLMs.

Also, a lot of the jailbreak successes rely on the fact that it’s been trained to accept a very wide range of requests for deployment reasons, which suggests narrowing the domain of acceptable questions for internal use could reduce the space of jailbreaks dramatically:

Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that’s just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.
- Dakara 19 Nov 2024 22:14 UTC
  2 points
  0
  Parent
  I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).
  1. What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.
  2. What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?
  3. What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn’t agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI’s intelligence plateaus and we can recuperate and plan for future?
  - Seth Herd 19 Nov 2024 23:05 UTC
    5 points
    1
    Parent
    We die (don’t fuck this step up!:)
    Unless we still have adequate mech interp or natural language train of thought to detect deceptive alignment
    We die (don’t let your AGI fuck this step up!:)
    22 chained independent alignment attempts does sound like too much. Hubinger specified that he wasn’t thinking of daisy-chaining like that, but having one trusted agent that keeps itself aligned as it grows smarter.
    the endgame is to use Intent alignment as a stepping-stone to value alignment and let something more competent and compassionate than us monkeys handle things from there on out.
  - Noosphere89 19 Nov 2024 23:00 UTC
    4 points
    1
    Parent
    The first concern is absolutely critical, and one way to break the circularity issue is to rely on AI control, while another way is to place incentives that favor alignment as an equilibrium and make dishonesty/misalignment unfavorable, in the sense that you can’t have a continuously rewarding path to misalignment.
    
    The second issue is less critical, assuming that AGI #21 hasn’t itself become deceptively aligned, because at that point, we can throw away #22 and restart from a fresh training run.
    
    If that’s no longer an option, we can go to war against the misaligned AGI with our own AGI forces.
    
    In particular, you can still do a whole lot of automated research once you break labor bottlenecks, and while this is a slowdown, this isn’t fatal, so we can work around it.
    
    The third issue is if we have achieved aligned ASI, than we have at that point achieved our goal, and once humans are obsolete in making alignment advances, that’s when we can say the end goal has been achieved.
    - Dakara 20 Nov 2024 12:53 UTC
      1 point
      0
      Parent
      That does indeed answer my 3 concerns (and Seth’s answer does as well). Overnight, I came up with 1 more concern.
      
      What if AGI somewhere down the line overgoes a value drift. After all, looking at the evolution, it seems like our evolutionary goal was supposed to be “produce as many offsprings”. And in the recent years, we have strayed from this goal (and are currently much worse at it than our ancestors). Now, humans seem to have goals like “design a video game” or “settle in France” or “climb Everest”. What if AGI similarly changes its goals and values overtime? Is there are way to prevent that or at least be safeguarded against that?
      
      I am afraid that if that happens, humans would, metaphorically speaking, stand in AGI’s way of climbing Everest.
      - Noosphere89 20 Nov 2024 15:01 UTC
        3 points
        0
        Parent
        The answer to this is that we’d rely on instrumental convergence to help us out, combined with adding more data/creating error-correcting mechanisms to prevent value drift from being a problem.
        Dakara 20 Nov 2024 15:29 UTC
        1 point
        0
        Parent
        What would instrumental convergence mean in this case? I am not sure of what that means in this case.
        Noosphere89 20 Nov 2024 15:46 UTC
        2 points
        0
        Parent
        In this case, it would mean the convergence to preserve your current values.
        Dakara 20 Nov 2024 15:56 UTC
        1 point
        0
        Parent
        Reading from LessWrong wiki, it says “Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition”
        
        It seems like it preserves exactly the goals we wouldn’t really need it to preserve (like resource acquisition). I am not sure how it would help us with preserving goals like ensuring humanity’s prosperity, which seem to be non-fundamental.
        Noosphere89 20 Nov 2024 16:05 UTC
        3 points
        0
        Parent
        Yes, I admittedly want to point to something along the lines of preserving your current values being a plausibly major drive of AIs.
        Dakara 20 Nov 2024 16:09 UTC
        1 point
        0
        Parent
        Ah, so you are basically saying that preserving current values is like a meta instrumental value for AGIs similar to self-preservation that is just kind of always there? I am not sure if I would agree with that (if I am correctly interpreting you) since, it seems like some philosophers are quite open to changing their current values.
        Noosphere89 20 Nov 2024 16:31 UTC
        2 points
        0
        Parent
        Not always, but I’d say often.
        I’d also say that at least some of the justification for changing values in philosophers/humans is because they believe the new values are closer to the moral reality/truth, which is an instrumental incentive.
        To be clear, I’m not going to state confidently that this will happen (maybe something like instruction following ala @Seth Herd is used instead, such that the pointer is to the human giving the instructions, rather than having values instead), but this is at least reasonably plausible IMO.
        Expand this thread
        Dakara 10 Dec 2024 13:48 UTC
        1 point
        0
        Parent
        I have found another possible concern of mine.
        
        Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.
        
        We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an aligned rule based on time-and situation-limited data.
        
        While it may be safe, for all practical purposes, to assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the learning algorithms that are programmed into them can have complex unintended consequences for how the LLM will behave in the future, given the changing conditions an LLM finds itself in.
        
        Doesn’t this mean that it is not possible to achieve alignment?
        Dakara 17 Dec 2024 18:28 UTC
        1 point
        0
        Parent
        I have posted this text as a standalone question here
        Dakara 20 Nov 2024 17:22 UTC
        1 point
        0
        Parent
        Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that’s probably good for us, cause I wouldn’t expect human extinction to be a morally good thing)
        Noosphere89 20 Nov 2024 17:27 UTC
        3 points
        0
        Parent
        The problem with that plan is that there are too many valid moral realities, so which one you do get is once again a consequence of alignment efforts.
        
        To be clear, I’m not stating that it’s hard to get the AI to value what we value, but it’s not so brain-dead easy that we can make the AI find moral reality and then all will be well.
        Dakara 20 Nov 2024 17:43 UTC
        3 points
        0
        Parent
        Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.
        
        This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!
        Dakara 20 Nov 2024 22:41 UTC
        1 point
        0
        Parent
        P.S. Here is the link to the question that I posted.
        Dakara 22 Nov 2024 7:53 UTC
        1 point
        0
        Parent
        Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?
        Noosphere89 22 Nov 2024 17:39 UTC
        2 points
        0
        Parent
        The threat model is plausible enough that some political actions should be done, like banning open-source/open-weight models, and putting in basic Know Your Customer checks.
        Dakara 22 Nov 2024 17:41 UTC
        1 point
        0
        Parent
        Isn’t it a bit too late for that? If o1 gets publicly released, then according to that article, we would have an expert-level consultant in bioweapons available for everyone. Or do you think that o1 won’t be released?
        Noosphere89 22 Nov 2024 17:46 UTC
        3 points
        2
        Parent
        I don’t buy that o1 has actually given people expert-level bioweapons, so my actions here are more so about preparing for future AI that is very competent at bioweapon building.
        
        Also, even with the current level of jailbreak resistance/adversarial example resistance, assuming no open-weights/open sourcing of AI is achieved, we can still make AIs that are practically hard to misuse by the general public.
        
        See here for more:
        
        https://www.lesswrong.com/posts/KENtuXySHJgxsH2Qk/managing-catastrophic-misuse-without-robust-ais
        Dakara 22 Nov 2024 17:31 UTC
        1 point
        0
        Parent
        After some thought, I think this is a potentially really large issue which I don’t know how we can even begin to solve. We can have aligned AI, being aligned with someone who wants to create bioweapons. Is there anything being done (or anything that can be done) to prevent that?
        Noosphere89 22 Nov 2024 17:40 UTC
        3 points
        1
        Parent
        The answers to this question is actually 2 things:
        
        This is why I expect we will eventually have to fight to ban open-source, and we will have to get the political will to ban both open-source and open-weights AI.
        
        This is where the unlearning field comes in. If we could make the AI unlearn knowledge, an example being nuclear weapons, we could possibly distribute AI safely without causing novices to create dangerous stuff.
        
        More here:
        
        https://www.lesswrong.com/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms
        
        https://www.lesswrong.com/posts/9AbYkAy8s9LvB7dT5/the-case-for-unlearning-that-removes-information-from-llm
        
        But the solutions are intentionally going to make AI safe without relying on alignment.
- Dakara 19 Nov 2024 18:33 UTC
  1 point
  0
  Parent
  I agree with comments both by you and Seth. I guess that isn’t really part of an alignment as usually understood. However, I think it is a part of a broad preventing AI from killing humans strategy, so it’s still pretty important for our main goal.
  
  I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
  
  Or are you proposing that we use AI monitors to monitor our leading future AI models and then we heavily restrict only the monitors?
  - Noosphere89 19 Nov 2024 18:42 UTC
    4 points
    1
    Parent
    
    I am not exactly sure I understand your proposal. Are you proposing that we radically gut our leading future model by restricting it severely? I don’t think any AI labs will agree to do so, because such future AI is much less useful than probably even current AIs.
    
    Or are you proposing that we use AI monitors our leading future AI models and then we heavily restrict only the monitors?
    
    My proposal is to restrain the AI monitor’s domain only.
    
    I agree this is a reduction in capability from unconstrained AI, but at least in the internal use setting rather than deploying the AI, you probably don’t need, and maybe don’t want it to be able to write fictional stories or telling calming stories, but rather using the AI for specific employment tasks.
    - Dakara 19 Nov 2024 18:48 UTC
      1 point
      0
      Parent
      That’s pretty interesting, I do think that if iterative alignment strategy ends up working, then this will probably end up working too (if nothing else, then because this seems much easier).
      
      I have some concerns left about iterative alignment strategy in general, so I will try to write them down below.
      
      EDIT: On the second thought, I might create a separate question for it (and link it here), for the benefit of all of the people who concerned about the things (or similar things) that I am concerned about.
      - Seth Herd 19 Nov 2024 19:06 UTC
        3 points
        1
        Parent
        That would be great. Do reference scalable oversight to show you’ve done some due diligence before asking to have it explained. If you do that, I think it would generate some good discussion.
        Dakara 19 Nov 2024 19:26 UTC
        1 point
        0
        Parent
        Sure, I might as well ask my question directly about scalable oversight, since it seems like a leading strategy of iterative alignment anyways. I do have one preliminary question (which probably isn’t worthy of being included in that post, given that it doesn’t ask about a specific issue or threat model, but rather about expectations of people).
        
        I take it that this strategy relies on evaluation being easier than coming up with research? Do you expect this to be the case?