Satron comments on Sabotage Evaluations for Frontier Models

Satron 16 Nov 2024 6:33 UTC
1 point
4
The issue is that no matter how much good stuff they do, one can always call it marginal and not enough. There aren’t really any objective benchmarks for measuring it.

The justification for creating AI (at least in the heads of its developers) could look something like this (really simplified): Our current models seem safe, we have a plan on how to proceed and we believe that we will achieve a utopia.

You can argue that they are wrong and nothing like that will work, but that’s their justification.
- Ben Pace 16 Nov 2024 6:49 UTC
  5 points
  8
  Parent
  The issue is that no matter how much good stuff they do, one can always call it marginal and not enough.
  I think this is unfairly glossing my position. Simply because someone could always ask for more work does not imply that this particular request for more work is necessarily unreasonable.
  “I wish you would be a little less selfish”
  “People always ask others to do more for them! When will this stop!”
  “Well, in this case you have murdered many people for personal gratification and stolen the life savings of many people. Surely I get to ask it in this case.”
  The basic standards I am proposing is that if you are getting billions-of-dollars rich by building machines you believe may lead to omnicide or omnicide-adjacent outcomes for literally everyone, you should have a debate with a serious & worthy critic (e.g. a Nobel Prize winner in your field who believes you shouldn’t be doing so, or perhaps a person who saw the alignment problem a ~decade ahead of everyone and worked on it as a top-priority the entire time) about (a) why you are doing this, and (b) whether the outcome will be good or bad.
  I have many other basic standards that are not being met, but on the question of whether omnicide or omnicide-adjacent things are the default outcome when you are unilaterally running humanity towards it, this is perhaps the lowest standard for public discourse and honest debate. Publicly show up to engage with your strongest critics, at least literally once (though you are correct to assume that actually I think it should happen more than once).
  - Satron 16 Nov 2024 6:57 UTC
    1 point
    0
    Parent
    I do disagree with your proposed standard. It is good enough for me, that the critic’s argument are engaged by someone on your side. Going there personally seems unnecessary. After all, if the goal is to build safe AI, you personally knowing a niche technical solution isn’t necessary, if you have people on your team who are aware of publicly produced solutions as well as internal ones.
    
    And there is engagement with people like Yudkowski from the optimistic side. There are at least proposed solutions to problems he is presenting. Just because Sam Altman personally didn’t invent or voice them doesn’t really make him unethical (I accept that he personally may be unethical for a different reason).
    - Ben Pace 16 Nov 2024 7:12 UTC
      4 points
      2
      Parent
      It is good enough for me, that the critic’s argument are engaged by someone on your side. Going there personally seems unnecessary.
      What engagement are you referring to? If there is such a defense that is officially endorsed by one of the leading companies developing potential omnicide-machines (or endorsed by the CEO/cofounders), that seriosuly engages with worthy critics, I don’t recall it in this moment.
      After all, if the goal is to build safe AI, you personally knowing a niche technical solution isn’t necessary, if you have people on your team who are aware of publicly produced solutions as well as internal ones.
      I believe that nobody on earth has a solution to the alignment problem, of course this would all be quite different if I felt anyone credibly claimed to have a good such solution.
      Edit: Pardon me, I hit cmd-enter a little too quickly, I have now (~15 mins later) slightly edited my comment to be less frantic and a little more substantive.
      - Satron 16 Nov 2024 7:33 UTC
        1 point
        0
        Parent
        I similarly don’t see the need for any official endorsement of the arguments. For example if a critic says that such and such technicality will prevent us from building safe AI and someone responds that here are the reasons for why such and such technicality will not prevent us from building safe AI (maybe this particular one is unlikely by default for various reasons), then such and such technicality will just not prevent us from building safe AI. I don’t see a need for a CEO to officially endorse the response.
        
        There is a different type of technicalities which you actively need to work against. But even in this case, as long as someone has found a way to combat them, as long as relevant people in your company are aware of the solution, it is fine by me.
        
        Even if there are technicalities that can’t be solved in principle, they should be evaluated by technical people and discussed by the same technical people (like they are on Less wrong for example).
        
        I am definitely not saying that I can pinpoint an exact solution to AI alignment, but there have been attempts so promising that leading skeptics (like Yudkowski) have said “Not obviously stupid on a very quick skim. I will have to actually read it to figure out where it’s stupid. (I rarely give any review this positive on a first skim. Congrats.)”
        
        Whether companies actually follow promising alignment techniques is an entirely different question. But having CEOs officially endorse such solutions as opposed to relevant specialists evaluating and implementing them doesn’t seem strictly necessary to me.
        Ben Pace 16 Nov 2024 8:09 UTC
        5 points
        3
        Parent
        I’m sorry, I’m confused about something here, I’ll back up briefly and then respond to your point.
        My model is:
        The vast majority of people who’ve seriously thought about it believe we don’t know how to solve the alignment problem.
        More fundamentally, there’s a sense in which we “basically don’t know what we’re doing” with regards to AI. People talk about “agents” and “goals” and “intentions” but we’re kind of like at the phlogiston theory of heat or vitalism theory of life. We don’t get it. We have no equations, we have no theory, we’re just like “man these systems can really write and make pretty pictures” like we used to say “I don’t get it but some things are hot and some things are cold”. Science was tried, found hard, engineering was tried, found easy, and now we’re only doing that.
        Many/most folks who’ve given it serious thought are pretty confident that the default outcome is doom (omnicide or permanent disempowerment), though it may be way kinda worse (e.g. eternal torture) or slightly better (e.g. we get to keep earth), due to intuitive arguments about instrumental goals and selecting on minds in the way machine learning works. (This framing is a bit local, in that not every scientist in the world would quite know what I’m referring to here.)
        People are working hard and fast to build these AIs anyway because it’s a profitable industry.
        This literally spells the end of humanity (barring the eternal torture option or the grounded on earth option).
        Back to your comment: some people are building AGI and knowingly threatening all of our lives. I propose they should show up and explain themselves.
        A natural question is “Why should they talk with you Ben? You’re just one of the 8 billion people whose lives they’re threatening.”
        That is why I am further suggesting they talk with many of the great and worthy thinkers who are of the position this is clearly bad, like Hinton, Bengio, Russell, Yudkowsky, Bostrom, etc.
        I am reading you say something like “But as long as someone is defending their behavior, they don’t need to show up to defend it themselves.”
        This lands with me like we are two lowly peasants, who are talking about how the King has mistreated us due to how the royal guards often beat us up and rape the women. I’m saying “I would like the King to answer for himself” and I’m hearing you say “But I know a guy in the next pub who thinks the King is making good choices with his powers. If you can argue with him, I don’t see why the King needs to come down himself.” I would like to have the people who are wielding the power defend themselves.
        Again, this is not me proposing business norms, it’s me saying “the people who are taking the action that looks like it kills us, I want those people in particular to show up and explain themselves”.
        What links here?
        Ben Pace's comment on Sabotage Evaluations for Frontier Models by David Duvenaud (17 Nov 2024 21:37 UTC; 5 points)
        Satron 16 Nov 2024 8:55 UTC
        2 points
        0
        Parent
        I will try to provide another similar analogy. Let’s say that a King got tired of his people dying from diseases, so he decided to try a novel method of vaccination.
        
        However, some people were really concerned about that. As far as they were concerned, the default outcome of injecting viruses into the bodies of people is death. And the King wants to vaccinate everyone, so these people create a council of great and worthy thinkers who after thinking for a while come up with a list of reasons why vaccines are going to doom everyone.
        
        However, some other great and worthy thinkers (let’s call them “hopefuls”) come to the council and give reasons to think that aforementioned reasons are mistaken. Maybe they have done their own research, which seems to vindicate King’s plan or at least undermine council’s arguments.
        
        And now imagine that King comes down from the castle points his finger at hopefuls’ giving arguments for why the arguments proposed by the council is wrong and says “yeah, basically this” and then turns around and goes back to the castle. To me it seems like King’s official endorsement of the arguments proposed by hopefuls doesn’t really change the ethicality of the situation, as long as King is acting according to hopefuls’ plan.
        
        Furthermore, imagine if the one of the hopefuls who come to argue with the council was actually an undercover King. And he gave exactly the same arguments as people before him. This still IMO doesn’t change the ethicality of the situation.
        Ben Pace 16 Nov 2024 23:38 UTC
        4 points
        0
        Parent
        The central thing I am talking about is basic measures for accountability, of which I consider very high up to be engaging with criticism, dialogue, and argument (as is somewhat natural given my background philosophy from growing up on LessWrong).
        The story of a King doing things for good reasons lacks any mechanism for accountability if the King is behaving badly. It is important to design systems of power that do not rely on the people in power being good and right, but instead make it so that if they behave badly, they are held to account. I don’t think I have to explain why incentives and accountability matter for how the powerful wield their powers.
        My basic claim is that the plans for avoiding omnicide or omnicide-adjacent outcomes are not workable (slash there-are-no-plans), there is little-to-no responsibility being taken, and that there is no accountability for this illegitimate use of power.
        If you believe that there is any accountability for the CEOs of the companies building potentially omnicidal machines and risking the lives of 8 billion people (such as my favorite mechanism: showing up and respectfully engaging with the people they have power over, but also any other mechanism you like; for instance there are not currently any criminal penalties for such behaviors, but that would be a good example if it did exist), I request you provide links, I would welcome specifics to talk about.
        Satron 17 Nov 2024 8:36 UTC
        1 point
        0
        Parent
        To modify my example to include an accountability mechanism that’s also similar to the real life, the King takes exactly the same vaccines as everyone else. So if he messed up with the chemicals, he also dies.
        
        I believe similar accountability mechanism works in our real world case. If CEOs build unsafe AI, they and everyone they valued in this life die. This seems like a really good incentive for them to not build unsafe AI.
        
        At the end of the day, voluntary commitment such as debating with the critics are not as strong in my option. Imagine that they agree with you and go to the debate. Without the incentive of “if I mess up, everyone dies”, the CEOs could just go right back to doing what they were doing. As far as I know voluntary debates have few (if any) actual legal mechanisms to hold CEOs accountable.
        Ben Pace 17 Nov 2024 18:45 UTC
        6 points
        4
        Parent
        If a medicine literally kills everyone who takes it within a week of taking it, sure, it will not get widespread adoption amongst thousands of people.
        
        If the medicine has bad side effects for 1 in 10 people and no upsides, or it only kills people 10 years later, and at the same time there is some great financial deal the ruler can make for himself in accepting this trade with the neighboring nation who is offering the vaccines, then yes I think that could easily be enough pressure for a human being to rationalize that actually the vaccines are good.
        
        The relevant question is rarely ‘how high stakes is the decision’. The question is what is in need of rationalizing, how hard is it to support the story, and how strong are the pressures on the person to do that. Typically when the stakes are higher, the pressures on people to rationalize are higher, not lower.
        
        Politicians often enact policies that make the world worse for everyone (including themselves) while thinking they’re doing their job well, due to the various pressures and forces on them. The fact that it arguably increases their personal chance of death isn’t going to stop them, especially when they can easily rationalize it away because it’s abstract. In recent years politicians in many countries enacted terrible policies during a pandemic that extended the length of the pandemic (there were no challenge trials, there was inefficient handing out of vaccines, there were false claims about how long the pandemic would last, there were false claims about mask effectiveness, there were forced lockdown policies that made no sense, etc). These policies hurt people and messed up the lives of ~everyone in the country I live in (the US), which includes the politicians who enacted them and all of their families and friends. Yet this was not remotely sufficient to cause them to snap out of it.
        
        What is needed to rationalize AI development when the default outcome is doom? Here’s a brief attempt:
        
        A lot of people who write about AI are focused on current AI capabilities and have a hard time speculating about future AI capabilities. Talk with these people. This helps you keep the downsides in far mode and the upsides in near mode (which helps because current AI capabilities are ~all upside, and pose no existential threat to civilization). The downsides can be pushed further into far mode with phrases like ‘sci-fi’ and ‘unrealistic’.
        Avoid arguing with people or talking with people who have thought a great deal about this and believe the default outcome is doom and the work should be stopped (e.g. Hinton, Bengio, Russell, Yudkowsky, etc). This would put pressure on you to keep those perspectives alive while making decisions, which would cause you to consider quitting.
        Instead of acknowledging that we don’t fundamentally don’t know what we’re doing, instead focus on the idea that other people are going to plough ahead. Then you can say that you have a better chance than them, rather than admitting neither of you have a good chance.
        
        This puts you into a mental world where you’re basically doing a good thing and you’re not personally responsible for much of the extinction-level outcomes.
        
        Intentionally contributing to omnicide is not what I am describing. I am describing a bit of rationalization in order to receive immense glory, and that leading to omnicide-level outcomes 5-15 years down the line. This sort of rationalizing why you should take power and glory is frequent and natural amongst humans.
        Satron 17 Nov 2024 20:57 UTC
        2 points
        −1
        Parent
        I don’t think it would be necessarily be easy to rationalize that vaccine that negatively affects 10% of the population and has no effect on 90% is good. It seems possible to rationalize that it is good for you (if you don’t care about other people), but quite hard to rationalize that it is good in general.
        Given how few politicians die as the result of their policies (at least in the Western world), the increase in their chance of death does seem negligible (compared to something that presumably increases this risk by a lot). Most bad policies that I have in mind don’t seem to happen during human catastrophes (like pandemics).
        However, I suspect that the main point of your last comment was that potential glory can, in principle, be one of the reasons for why people rationalize stuff, and if that is the case, then I can broadly agree with you!
        Expand this thread
        Ben Pace 17 Nov 2024 21:37 UTC
        5 points
        0
        Parent
        I have found it fruitful to argue this case back and forth with you, thank you for explaining and defending your perspective.
        I will restate my overall position, I invite you to do the same, and then I propose that we consider this 40+ comment thread concluded for now.
        ———
        The comment of yours that (to me) started this thread was the following.
        If the default path is AI’s taking over control from humans, then what is the current plan in leading AI labs? Surely all the work they put in AI safety is done to prevent exactly such scenarios. I would find it quite hard to believe that a large group of people would vigorously do something if they believed that their efforts will go to vain.
        I primarily wish to argue that, given the general lack of accountability for developing machine learning systems in worlds where indeed the default outcome is doom, it should not be surprising to find out that there is a large corporation (or multiple) doing so. One should not assume that the incentives are aligned – anyone who is risking omnicide-level outcomes via investing in novel tech development currently faces no criminal penalties, fines, or international sanctions.
        Given the current intellectual elite scene where a substantial number of prestigious people care about extinction level outcomes, it is also not surprising that glory-seeking companies have large departments focused on ‘ethics’ and ‘safety’ in order to look respectable to such people. Separately from any intrinsic interest, it has been a useful political chip for enticing a great deal of talent from scientific communities and communities interested in ethics to work for them (not dissimilar to how Sam Bankman-Fried managed to cause a lot of card-carrying members of the Effective Altruist philosophy and scene to work very hard to help build his criminal empire by talking a good game about utilitarianism, veganism, and the rest).
        Looking at a given company’s plan for preventing doom, and noticing it does not check out, should not be followed by an assumption of adequacy and good incentives such that surely this company would not exist nor do work on AI safety if it did not have a strong plan, I must be mistaken. I believe that there is no good plan and that these companies would exist regardless of whether a good plan existed or not. Given the lack of accountability, and my belief that alignment is clearly unsolved and we fundamentally do not knowing what we’re doing, I believe the people involved are getting rich risking all of our lives and there is (currently) no justice here.
        We have agreed on many points, and from the outset I believe you felt my position had some truth to it (e.g. “I do get that point that you are making, but I think this is a little bit unfair to these organizations.”). I will leave you to outline whichever overall thoughts and remaining points of agreement or disagreement that you wish.
        Satron 18 Nov 2024 9:49 UTC
        3 points
        0
        Parent
        Sure, it sounds like a good idea! Below I will write my thoughts on your overall summarized position.
        
        ———
        
        “I primarily wish to argue that, given the general lack of accountability for developing machine learning systems in worlds where indeed the default outcome is doom, it should not be surprising to find out that there is a large corporation (or multiple) doing so.”
        
        I do think that I could maybe agree with this if it was 1 small corporation. In your previous comment you suggested that you are describing not the intentional contribution to the omnicide, but the bit of rationalization. I don’t think I would agree that that many people working on AI are successfully engaged in that bit of rationalization or that it would be enough to keep them doing it. The big factor is that in case of their failure, they personally (and all of their loved ones) will suffer the consequences.
        
        “It is also not surprising that glory-seeking companies have large departments focused on ‘ethics’ and ‘safety’ in order to look respectable to such people.”
        
        I don’t disagree with this, because it seems plausible that one of the reasons for creating safety departments is ulterior. However, I believe that this reason is probably not the main one and that AI safety labs are making genuinely good research papers. To take an example of Anthropic, I’ve seen safety papers that got LessWrong community excited (at least judging by upvotes). Like this one.
        
        “I believe that there is no good plan and that these companies would exist regardless of whether a good plan existed or not… I believe that the people involved are getting rich risking all of our lives and there is (currently) no justice here”
        
        For the reasons, that I mentioned in my first paragraph I would probably disagree with this. Relatedly, while I do think wealth in general can be somewhat motivating, I also think that AI developers are aware that all their wealth would mean nothing if AI kills everyone.
        
        ———
        
        Overall, I am really happy with this discussion. Our disagreements came down to a few points and we agree on quite a bit of issues. I am similarly happy to conclude this big comment thread.