Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 8 Nov 2024 2:45 UTC
LW: 19 AF: 4
0
AF
What’s the most important technical question in AI safety right now?
- Buck 8 Nov 2024 19:03 UTC
  LW: 12 AF: 8
  0
  AF Parent
  In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
  - How to evaluate whether models should be trusted or untrusted: currently I don’t have a good answer and this is bottlenecking the efforts to write concrete control proposals.
  - How AI control should interact with AI security tools inside labs.
  More generally:
  - How can we get more evidence on whether scheming is plausible?
  - How scary is underelicitation? How much should the results about password-locked models or arguments about being able to generate small numbers of high-quality labels or demonstrations affect this?
  - Chris_Leong 13 Nov 2024 6:11 UTC
    LW: 4 AF: 2
    0
    AF Parent
    “How can we get more evidence on whether scheming is plausible?”—What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
- Noosphere89 8 Nov 2024 17:44 UTC
  8 points
  0
  Parent
  I’d say 1 important question is whether the AI control strategy works out as they hope.
  
  I agree with Bogdan that making adequate safety cases for automated safety research is probably one of the most important technical problems to answer (since conditional on the automating AI safety direction working out, then it could eclipse basically all safety research done prior to the automation, and this might hold even if LWers really had basically perfect epistemics given what’s possible for humans, and picked closer to optimal directions, since labor is a huge bottleneck, and allows for much tighter feedback loops of progress, for the reasons Tamay Besiroglu identified):
  
  https://x.com/tamaybes/status/1851743632161935824
  
  https://x.com/tamaybes/status/1848457491736133744
- Nathan Helm-Burger 8 Nov 2024 16:20 UTC
  5 points
  −2
  Parent
  Here’s some candidates:
  
  1 Are we indeed (as I suspect) in a massive overhang of compute and data for powerful agentic AGI? (If so, then at any moment someone could stumble across an algorithmic improvement which would change everything overnight.)
  
  2 Current frontier models seem much more powerful than mouse brains, yet mice seem conscious. This implies that either LLMs are already conscious, or could easily be made so with non-costly tweaks to their algorithm. How could we objectively tell if an AI were conscious?
  
  3 Over the past year I’ve helped make both safe-evals-of-danger-adjacent-capabilities (e.g. WMDP.ai) and unpublished infohazardous-evals-of-actually-dangerous-capabilities. One of the most common pieces of negative feedback I’ve heard on the safe-evals is that they are only danger-adjacent, not measuring truly dangerous things. How could we safely show the correlation of capabilities between high performance on danger-adjacent evals with high performance on actually-dangerous evals?
  - Garrett Baker 8 Nov 2024 17:06 UTC
    4 points
    0
    Parent
    
    Are we indeed (as I suspect) in a massive overhang of compute and data for powerful agentic AGI? (If so, then at any moment someone could stumble across an algorithmic improvement which would change everything overnight.)
    
    Why is this relevant for technical AI alignment (coming at this as someone skeptical about how relevant timeline considerations are more generally)?
    - Nathan Helm-Burger 8 Nov 2024 17:28 UTC
      3 points
      1
      Parent
      If tomorrow anyone in the world could cheaply and easily create an AGI which could act as a coherent agent on their behalf, and was based on an architecture different from a standard transformer.… Seems like this would change a lot of people’s priorities about which questions were most urgent to answer.
      - Alexander Gietelink Oldenziel 8 Nov 2024 20:42 UTC
        4 points
        4
        Parent
        Fwiw I basically think you are right about the agentic AI overhang and obviously so. I do think it shapes how one thinks about what’s most valuable in AI alignment.
        Noosphere89 8 Nov 2024 21:38 UTC
        4 points
        0
        Parent
        I kind of wished you both gave some reasoning as to why you believe that the agentic AI overhang/algorithmic overhang is likely, and I also wish that Nathan Helm Burger and Vladimir Nesov discussed this topic in a dialogue post.
        Alexander Gietelink Oldenziel 8 Nov 2024 22:16 UTC
        2 points
        0
        Parent
        Glib formality: current LLMs do approximate something like a speed prior solomonoff inductor for internetdata but do not approximate AIXI.
        
        There is a whole class of domains that are not tractably accesible from next-token prediction on human generated data. For instance, learning how to beat alphaGo with only access to pre2014 human go games.
        Noosphere89 9 Nov 2024 14:51 UTC
        4 points
        2
        Parent
        IMO, I think AlphaGo’s success was orthogonal to AIXI, and more importantly, I expect AIXI to be very hard to approximate even as an approximatable ideal, so what’s the use case for thinking future AIs will be AIXI-like?
        
        I will also say that while I don’t think pure LLMs will be just scaled forwards, just because there’s a use for inference time compute scaling, I think that conditional on AGI and ASI being achieved, the strategy will look more iike using lots and lots of synthetic data to compensate for compute, whereas Solomonoff induction has a halting oracle with lots of compute, and can infer lots of things with the minimum data possible, while we will rely on a data-rich, compute poor strategy compared to approximate AIXI.
        Alexander Gietelink Oldenziel 9 Nov 2024 16:03 UTC
        2 points
        0
        Parent
        The important thing is that both do active learning & decisionmaking & search, i.e. RL. *
        LLMs don’t do that. So the gain from doing that is huge.
        Synthetic data is a bit of a weird word that get’s thrown around a lot. There are fundamental limits on how much information resampling from the same data source will yield about completely different domains. So that seems a bit silly. Ofc sometimes with synthetic data people just mean doing rollouts, i.e. RL.
        *the word RL sometimes gets mistaken for only very specific reinforcement learning algorithm. I mean here a very general class of algorithms that solve MDPs.
- Stephen Fowler 9 Nov 2024 2:38 UTC
  4 points
  2
  Parent
  The lack of a robust, highly general paradigm for reasoning about AGI models is the current greatest technical problem, although it is not what most people are working on.
  
  What features of architecture of contemporary AI models will occur in future models that pose an existential risk?
  What behavioral patterns of contemporary AI models will be shared with future models that pose an existential risk?
  Is there a useful and general mathematical/physical framework that describes how agentic, macroscropic systems process information and interact with the environment?
  Does terminology adopted by AI Safety researchers like “scheming”, “inner alignment” or “agent” carve nature at the joints?
- Bogdan Ionut Cirstea 8 Nov 2024 12:29 UTC
  4 points
  −1
  Parent
  Something like a safety case for automated safety research (but I’m biased)
- PhilosophicalSoul 8 Nov 2024 11:14 UTC
  3 points
  −1
  Parent
  Answering this from a legal perspective:
  What is the easiest and most practical way to translate legalese into scientifically accurate terms, thus bridging the gap between AI experts and lawyers? Stated differently, how do we move from localised papers that only work in law or AI fields respectively, to papers that work in both?
- RHollerith 13 Nov 2024 6:53 UTC
  2 points
  0
  Parent
  Are Eliezer and Nate right that continuing the AI program will almost certainly lead to extinction or something approximately as disastrous as extinction?