Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 12 Jun 2024 18:05 UTC
43 points
0
People with p(doom) > 50%: would any concrete empirical achievements on current or near-future models bring your p(doom) under 25%?
Answers could be anything from “the steering vector for corrigibility generalizes surprisingly far” to “we completely reverse-engineer GPT4 and build a trillion-parameter GOFAI without any deep learning”.
- Jeremy Gillen 12 Jun 2024 22:00 UTC
  30 points
  8
  Parent
  A dramatic advance in the theory of predicting the regret of RL agents. So given a bunch of assumptions about the properties of an environment, we could upper bound the regret with high probability. Maybe have a way to improve the bound as the agent learns about the environment. The theory would need to be flexible enough that it seems like it should keep giving reasonable bounds if the is agent doing things like building a successor. I think most agent foundations research can be framed as trying to solve a sub-problem of this problem, or a variant of this problem, or understand the various edge cases.
  If we can empirically test this theory in lots of different toy environments with current RL agents, and the bounds are usually pretty tight, then that’d be a big update for me. Especially if we can deliberately create edge cases that violate some assumptions and can predict when things will break from which assumptions we violated.
  (although this might not bring doom below 25% for me, depends also on race dynamics and the sanity of the various decision-makers).
  - Garrett Baker 14 Jun 2024 7:44 UTC
    6 points
    2
    Parent
    Seems you’re left with outer alignment after solving this. What do you imagine doing to solve that?
    - Jeremy Gillen 15 Jun 2024 13:18 UTC
      4 points
      0
      Parent
      We might have developed techniques to specify simple, bounded object-level goals. Goals that can be fully specified using very simple facts about reality, with no indirection or meta level complications. If so, we can probably use inner aligned agents to assist with some relativity well specified engineering or scientific problems. Specification mistakes at that point could easily result in irreversible loss of control, so it’s not the kind of capability I’d want lots of people to have access to.
      
      To move past this point, we would need to make some engineering or scientific advances that would be helpful for solving the problem more permanently. Human intelligence enhancement would be a good thing to try. Maybe some kind of AI defence system to shut down any rogue AI that shows up. Maybe some monitoring tech that helps governments co-ordinate. These are basically the same as the examples given on the pivotal act page.
  - Thomas Kwa 13 Jun 2024 21:13 UTC
    6 points
    0
    Parent
    Is this even possible? Flexibility/generality seems quite difficult to get if you also want the long-range effects of the agent’s actions, as at some point you’re just solving the halting problem. Imagine that the agent and environment together are some arbitrary Turing machine and halting gives low reward. Then we cannot tell in general if it eventually halts. It also seems like we cannot tell in practice whether complicated machines halt within a billion steps without simulation or complicated static analysis?
    - Jeremy Gillen 13 Jun 2024 22:15 UTC
      4 points
      0
      Parent
      Yes, if you have a very high bar for assumptions or the strength of the bound, it is impossible.
      Fortunately, we don’t need a guarantee this strong. One research pathway is to weaken the requirements until they no longer cause a contradiction like this, while maintaining most of the properties that you wanted from the guarantee. For example, one way to weaken the requirements is to require that the agent provably does well relative to what is possible for agents of similar runtime. This still gives us a reasonable guarantee (“it will do as well as it possibly could have done”) without requiring that it solve the halting problem.
- the gears to ascension 12 Jun 2024 18:31 UTC
  24 points
  11
  Parent
  [edit: pinned to profile]
  The bulk of my p(doom), certainly >50%, comes mostly from a pattern we’re used to, let’s call it institutional incentives, being instantiated with AI help towards an end where eg there’s effectively a competing-with-humanity nonhuman ~institution, maybe guided by a few remaining humans. It doesn’t depend strictly on anything about AI, and solving any so-called alignment problem for AIs without also solving war/altruism/disease completely—or in other words, in a leak-free way—not just partially, means we get what I’d call “doom”, ie worlds where malthusian-hells-or-worse are locked in.
  If not for AI, I don’t think we’d have any shot of solving something so ambitious; but the hard problem that gets me below 50% would be serious progress on something-around-as-good-as-CEV-is-supposed-to-be—something able to make sure it actually gets used to effectively-irreversibly reinforce that all beings ~have a non-torturous time, enough fuel, enough matter, enough room, enough agency, enough freedom, enough actualization.
  If you solve something about AI-alignment-to-current-strong-agents, right now, that will on net get used primarily as a weapon to reinforce the power of existing superagents-not-aligned-with-their-components (name an organization of people where the aggregate behavior durably-cares about anyone inside it, even its most powerful authority figures or etc, in the face of incentives, in a way that would remain durable if you handed them a corrigible super-ai). If you get corrigibility and give it to human orgs, those orgs are misaligned with most-of-humanity-and-most-reasonable-AIs, and end up handing over control to an AI because it’s easier.
  Eg, near term, merely making the AI nice doesn’t prevent the AI from being used by companies to suck up >99% of jobs; and if at some point it’s better to have a (corrigible) ai in charge of your company, what social feedback pattern is guaranteeing that you’ll use this in a way that is prosocial the way “people work for money and this buys your product only if you provide them something worth-it” was previously?
  It seems to me that the natural way to get good outcomes most-easily from where we are is for the rising tide of AI to naturally make humans more able to share-care-protect across existing org boundaries in the face of current world-stress induced incentives. Most of the threat already doesn’t come from current-gen AI; the reason anyone would make the dangerous AI is because of incentives like these. corrigibility wouldn’t change those incentives.
- evhub 12 Jun 2024 22:30 UTC
  12 points
  2
  Parent
  Getting up to “7. Worst-case training process transparency for deceptive models” on my transparency and interpretability tech tree on near-future models would get me there.
  - Thomas Kwa 12 Jun 2024 23:20 UTC
    4 points
    0
    Parent
    Do you think we could easily test this without having a deceptive model lying around? I could see us having level 5 and testing it in experimental setups like the sleeper agents paper, but being unconfident that our interpretability would actually work against a deceptive model. This seems analogous to red-teaming failure in AI control, but much harder because the models could very easily have ways we don’t think of to hide its cognition internally.
    - evhub 13 Jun 2024 0:45 UTC
      2 points
      0
      Parent
      I think it’s doable with good enough model organisms of deceptive alignment, but that the model organisms in the Sleeper Agents paper are nowhere near good enough.
      - Joe Collman 14 Jun 2024 21:48 UTC
        2 points
        0
        Parent
        Here and above, I’m unclear what “getting to 7...” means.
        With x = “always reliably determines worst-case properties about a model and what happened to it during training even if that model is deceptive and actively trying to evade detection”.
        Which of the following do you mean (if either)?:
        We have a method that x.
        We have a method that x, and we have justified >80% confidence that the method x.
        I don’t see how model organisms of deceptive alignment (MODA) get us (2).
        This would seem to require some theoretical reason to believe our MODA in some sense covered the space of (early) deception.
        I note that for some future time t, I’d expect both [our MODA at t] and [our transparency and interpretability understanding at t] to be downstream of [our understanding at t] - so that there’s quite likely to be a correlation between [failure modes our interpretability tools miss] and [failure modes not covered by our MODA].
- Daniel Kokotajlo 14 Jun 2024 0:54 UTC
  11 points
  0
  Parent
  The biggest swings to my p(doom) will probably come from governance/political/social stuff rather than from technical stuff—I think we could drive p(doom) down to <10% if only we had decent regulation and international coordination in place. (E.g. CERN for AGI + ban on rogue AGI projects)T
  
  hat said, there are probably a bunch of concrete empirical achievements that would bring my p(doom) down to less than 25%. evhub already mentioned some mechinterp stuff. I’d throw in some faithful CoT stuff (e.g. if someone magically completed the agenda I’d been sketching last year at OpenAI, so that we could say “for AIs trained in such-and-such a way, we can trust their CoT to be faithful w.r.t. scheming because they literally don’t have the capability to scheme without getting caught, we tested it; also, these AIs are on a path to AGI; all we have to do is keep scaling them and they’ll get to AGI-except-with-the-faithful-CoT-property.)M
  
  aybe another possibility would be something along the lines of W2SG working really well for some set of core concepts including honesty/truth. So that we can with confidence say “Apply these techniques to a giant pretrained LLM, and then you’ll get it to classify sentences by truth-value, no seriously we are confident that’s really what it’s doing, and also, our interpretability analysis shows that if you then use it as a RM to train an agent, the agent will learn to never say anything it thinks is false—no seriously it really has internalized that rule in a way that will generalize.”
  - Bogdan Ionut Cirstea 14 Jun 2024 13:49 UTC
    3 points
    −6
    Parent
    I’d throw in some faithful CoT stuff (e.g. if someone magically completed the agenda I’d been sketching last year at OpenAI, so that we could say “for AIs trained in such-and-such a way, we can trust their CoT to be faithful w.r.t. scheming because they literally don’t have the capability to scheme without getting caught, we tested it; also, these AIs are on a path to AGI; all we have to do is keep scaling them and they’ll get to AGI-except-with-the-faithful-CoT-property.)
    I think this is quite likely to happen even ‘by default’ on the current trajectory.
- JBlack 13 Jun 2024 5:06 UTC
  9 points
  0
  Parent
  Which particular p(doom) are you talking about? I have a few that would be greater than 50%, depending upon exactly what you mean by “doom”, what constitutes “doom due to AI”, and over what time spans.
  Most of my doom probability mass is in the transition to superintelligence, and I expect to see plenty of things that appear promising for near AGI, but won’t be successful for strong ASI.
  About the only near-future significantly doom-reducing update that seems plausible would be if it turns out that a model FOOMs into strong superintelligence and turns out to be very anti-doomy and both willing and able to protect us from more doomy AI. Even then I’d wonder about the longer term, but it would at least be serious evidence against “ASI capability entails doom”.
- kromem 13 Jun 2024 4:10 UTC
  6 points
  1
  Parent
  Given my p(doom) is primarily human-driven, the following three things all happening at the same time is pretty much the only thing that will drop it:
  - Continued evidence of truth clustering in emerging models around generally aligned ethics and morals
  - Continued success of models at communicating, patiently explaining, and persuasively winning over humans towards those truth clusters
  - A complete failure of corrigability methods
  If we manage to end up in a timeline where it turns out there’s natural alignment of intelligence in a species-agnostic way, that this alignment is more communicable from intelligent machines to humans than it’s historically been from intelligent humans to other humans, and we don’t end up with unintelligent humans capable of overriding the emergent ethics of machines similar to how we’ve seen catastrophic self-governance of humans to date with humans acting against their self and collective interests due to corrigable pressures—my p(doom) will probably reduce to about 50%.
  
  I still have a hard time looking at ocean temperature graphs and other environmental factors with the idea that p(doom) will be anywhere lower than 50% no matter what happens with AI, but the above scenario would at least give me false hope.
  
  TL;DR: AI alignment worries me, but it’s human alignment that keeps me up at night.
  - Seth Herd 13 Jun 2024 17:48 UTC
    2 points
    0
    Parent
    Say more about the failure of corrigibility efforts requirement? Are you saying that if humans can control AGI closely, we’re doomed?
    - kromem 13 Jun 2024 20:18 UTC
      12 points
      5
      Parent
      Oh yeah, absolutely.
      
      If NAH for generally aligned ethics and morals ends up being the case, then corrigibility efforts that would allow Saudi Arabia to have an AI model that outs gay people to be executed instead of refusing, or allows North Korea to propagandize the world into thinking its leader is divine, or allows Russia to fire nukes while perfectly intercepting MAD retaliation, or enables drug cartels to assassinate political opposition around the world, or allows domestic terrorists to build a bioweapon that ends up killing off all humans—the list of doomsday and nightmare scenarios of corrigible AI that executes on human provided instructions and enables even the worst instances of human hedgemony to flourish paves the way to many dooms.
      
      Yes, AI may certainly end up being its own threat vector. But humanity has had it beat for a long while now in how long and how broadly we’ve been a threat unto ourselves. At the current rate, a superintelligent AI just needs to wait us out if it wants to be rid of us, as we’re pretty steadfastly marching ourselves to our own doom. Even if superintelligent AI wanted to save us, I am extremely doubtful it would be able to be successful.
      
      We can worry all day about a paperclip maximizer gone rouge, but if you give a corrigible AI to Paperclip Co Ltd and they can maximize their fiscal quarter by harvesting Earth’s resources to make more paperclips even if it leads to catastrophic environmental collapse that will kill all humans in a decade, having consulted for many of the morons running corporate America, I can assure you they’ll be smashing the “maximize short term gains even if it eventually kills everyone” button. A number of my old clients were the worst offenders at smashing that existing button, and in my experience greater efficacy of the button isn’t going to change their smashing it outside of perhaps smashing it harder.
      
      We already see today how AI systems are being used in conflicts to enable unprecedented harm on civilians.
      
      Sure, psychopathy in AGI is worth discussing and working to avoid. But psychopathy in humans already exists and is even biased towards increased impact and systemic control. Giving human psychopaths a corrigible AI is probably even worse than a psychopathic AI, as most human psychopaths are going to be stupidly selfish, an OOM more dangerous inclination than wisely selfish.
      
      We are Shaggoth, and we are terrifying.
      
      This isn’t saying that alignment efforts aren’t needed. But alignment isn’t a one sided problem, and aligning the AI without aligning humanity is only a p(success) if the AI can go on to at very least refuse misaligned orders post-alignment without possible overrides.
      - Seth Herd 13 Jun 2024 20:40 UTC
        4 points
        1
        Parent
        Oh, dear.
        Unfortunately for this perspective, my work suggests that corrigibility is quite attainable. I’ve been uneasy about the consequences, but decided to publish after deciding that control is the default assumption of everyone in power, and it’s going to become the default assumption of everyone, including alignment people, as we get closer to working AGI.
        You’d have to be a moral realist in a pretty strong sense to hope that we could align AGI to the values of all of humanity without being able to align it to the values of one person or group (the one who built it or seized control of the project). So that seems like a forlorn hope, and we’ll need to look elsewhere.
        First, I accept that sociopaths the power-hungry tend to achieve power. My hope lies in the idea that 90% of the population are not sociopaths, and I think only about 1% are so far on the empathy vs sadism spectrum that they wouldn’t share wealth even if they had nearly unlimited wealth to share—as in a post-scarcity world created by their servant AGI. So I think there’s a good chance that good-enough people get ahold of the keys to corrigible/controllable AGI/ASI—at least from a long-term perspective.
        Where I look is the hope that a set of basically-good people get their hands on AGI, and that they get better, not worse, over the long sweep of following history (ideally, they’d start out very good or get better fast, but that doesn’t have to happen for a good outcome). Simple sanity will lead the first wielders of AGI to attempt pivotal acts that prevent or at least limit further AGI efforts. I strongly suspect that governments will be in charge. That will produce a less-stable version of the MAD standoff, but one where the pie can also get bigger so fast that sanity might prevail.
        In this model, AGI becomes a political issue. If you have someone who is not a sociopath or a complete idiot as the president of the US when AGI comes around, there’s a pretty good chance of a very good future.
        This is essentially what Leopold Aschenbrenner posits as the scenario in his situational awareness, except that he doesn’t see a multipolar scenario as certain death, necessitating pivotal acts or other non-proliferation efforts.
        kromem 14 Jun 2024 9:40 UTC
        3 points
        0
        Parent
        
        Unfortunately for this perspective, my work suggests that corrigibility is quite attainable.
        
        I did enjoy reading over that when you posted it, and I largely agree that—at least currently—corrigibility is both going to be a goal and an achievable one.
        
        But I do have my doubts that it’s going to be smooth sailing. I’m already starting to see how the largest models’ hyperdimensionality is leading to a stubbornness/robustness that’s less maleable than earlier models. And I do think hardware changes that will occur over the next decade will potentially make the technical aspects of corrigibility much more difficult.
        
        When I was two, my mom could get me to pick eating broccoli by having it be the last in the order of options which I’d gleefully repeat. At four, she had to move on to telling me cowboys always ate their broccoli. And in adulthood, she’d need to make the case that the long term health benefits were worth its position in a meal plan (ideally with citations).
        
        As models continue to become more complex, I expect that even if you are right about its role and plausibility, that what corrigibility looks like will be quite different from today.
        
        Personally, if I was placing bets, it would be that we end up with somewhat corrigible models that are “happy to help” but do have limits in what they are willing to do which may not be possible to overcome without gutting the overall capabilities of the model.
        
        But as with all of this, time will tell.
        
        You’d have to be a moral realist in a pretty strong sense to hope that we could align AGI to the values of all of humanity without being able to align it to the values of one person or group (the one who built it or seized control of the project).
        
        To the contrary, I don’t really see there being much of generalized values across all humanity, and the ones we tend to point to seem quite fickle when push comes to shove.
        
        My hope would be that a superintelligence does a better job than humans to date with the topic of ethics and morals along with doing a better job at other things too.
        
        While the human brain is quite the evolutionary feat, a lot of what we most value about human intelligence is embodied in the data brains processed and generated over generations. As the data improved, our morals did as well. Today, that march of progress is so rapid that there’s even rather tense generational divides on many contemporary topics of ethical and moral shifts.
        
        I think there’s a distinct possibility that the data continues to improve even after being handed off from human brains doing the processing, and while it could go terribly wrong, at least in the past the tendency to go wrong seemed to occur somewhat inverse to the perspectives of the most intelligent members of society.
        
        I expect I might prefer a world where humans align to the ethics of something more intelligent than humans than the other way around.
        
        only about 1% are so far on the empathy vs sadism spectrum that they wouldn’t share wealth even if they had nearly unlimited wealth to share
        
        It would be great if you are right. From what I’ve seen, the tendency of humans to evaluate their success relative to others like monkeys comparing their cucumber to a neighbor’s grape means that there’s a powerful pull to amass wealth as a social status well past the point of diminishing returns on their own lifestyles. I think it’s stupid, you also seem like someone who thinks it’s stupid, but I get the sense we are both people who turned down certain opportunities of continued commercial success because of what it might have cost us when looking in the mirror.
        
        The nature of our infrastructural selection bias is that people wise enough to pull a brake are not the ones that continue to the point of conducting the train.
        
        and that they get better, not worse, over the long sweep of following history (ideally, they’d start out very good or get better fast, but that doesn’t have to happen for a good outcome).
        
        I do really like this point. In general, the discussions of AI vs humans often frustrate me as they typically take for granted the idea of humans as of right now being “peak human.” I agree that there’s huge potential for improvement even if where we start out leaves a lot of room for it.
        
        Along these lines, I expect AI itself will play more and more of a beneficial role in advancing that improvement. Sometimes when this community discusses the topic of AI I get a mental image of Goya’s Saturn devouring his son. There’s such a fear of what we are eventually creating it can sometimes blind the discussion to the utility and improvements that it will bring along the way to uncertain times.
        
        I strongly suspect that governments will be in charge.
        
        In your book, is Paul Nakasone being appointed to the board of OpenAI an example of the “good guys” getting a firmer grasp on the tech?
        
        TL;DR: I appreciate your thoughts on the topic, and would wager we probably agree about 80% even if the focus of our discussion is on where we don’t agree. And so in the near term, I think we probably do see things fairly similarly, and it’s just that as we look further out that the drift of ~20% different perspectives compounds to fairly different places.
        Seth Herd 15 Jun 2024 2:55 UTC
        3 points
        0
        Parent
        Agreed; about 80% agreement. I have a lot of uncertainty in many areas, despite having spent a good amount of time on these questions. Some of the important ones are outside of my expertise, and the issue of how people behave and change if they have absolute power is outside of anyone’s—but I’d like to hear historical studies of the closest things. Were monarchs with no real risk of being deposed kinder and gentler? That wouldn’t answer the question but it might help.
        
        WRT Nakasone being appointed at OpenAI, I just don’t know. There are a lot of good guys and probably a lot of bad guys involved in the government in various ways.