phd student in comp neuroscience @ mpi brain research frankfurt. https://twitter.com/janhkirchner and https://universalprior.substack.com/
Jan
Interesting, I added a note to the text highlighting this! I was not aware of that part of the story at all. That makes it more of a Moloch-example than a “mistaking adversarial for random”-example.
Yes, that’s a pretty fair interpretation! The macroscopic/folk psychology notion of “surprise” of course doesn’t map super cleanly onto the information-theoretic notion. But I tend to think of it as: there is a certain “expected surprise” about what future possible states might look like if everything evolves “as usual”, . And then there is the (usually larger) “additional surprise” about the states that the AI might steer us into, . The delta between those two is the “excess surprise” that the AI needs to be able to bring about.
It’s tricky to come up with a straightforward setting where the actions of the AI can be measured in nats, but perhaps the following works as an intuition pump: “If we give the AI full, unrestricted access to a control panel that controls the universe, how many operations does it have to perform to bring about the catastrophic event?”. That’s clearly still not well defined (there is no obvious/privileged way that the panel should look like), but it shows that 1) the “excess surprise” is a lower bound (we wouldn’t usually give the AI unrestricted access to that panel) and 2) that the minimum amount of operations required to bring about a catastrophic event is probably still larger than 1.
Thank you for your comment! You are right, these things are not clear from this post at all and I did not do a good job at clarifying that. I’m a bit low on time atm, but hopefully, I’ll be able to make some edits to the post to set the expectations for the reader more carefully.
The short answer to your question is: Yep, X is the space of events. In Vanessa’s post it has to be compact and metric, I’m simplifying this to an interval in R. And can be derived from by plugging in g=0 and replacing the measure by the Lesbegue integral . I have scattered notes where I derive the equations in this post. But it was clear to me that if I want to do this rigorously in the post, then I’d have to introduce an annoying amount of measure theory and the post would turn into a slog. So I decided to do things hand-wavy, but went a bit too hard in that direction.
Cool paper, great to see the project worked out! (:
One question: How do you know the contractors weren’t just answering randomly (or were confused about the task) in your “quality after filtering” experiments (Table 4)? Is there agreement across contractors about the quality of completions (in case they saw the same completions)?
Fascinating! Thanks for sharing!
Cool experiment! I could imagine that the tokenizer handicaps GPT’s performance here (reversing the characters leads to completely different tokens). With a character-level tokenizer GPT should/might be able to handle that task better!
Interesting, thank you! I guess I was thinking of deception as characterized by Evan Hubinger, with mesa-optimizers, bells, whistles, and all. But I can see how a sufficiently large competence-vs-performance gap could also count as deception.
Thanks for the comment! I’m curious about the Anthropic Codex code-vulnerability prompting, is this written up somewhere? The closest I could find is this, but. I don’t think that’s what you’re referencing?
I was not aware of this, thanks for pointing this out! I made a note in the text. I guess this is not an example of “advanced AI with an unfortunately misspecified goal” but rather just an example of the much larger class of “system with an unfortunately misspecified goal”.
Thanks for the comment, I did not know this! I’ll put a note in the essay to highlight this comment.
Iiinteresting! Thanks for sharing! Yes, the choice of how to measure this affects the outcome a lot..
Hmm, fair, I think you might get along fine with my coworker from footnote 6 :) I’m not even sure there is a better way to write these titles—but they can still be very intimidating for an outsider.
Yes, I agree, a model can really push intuition to the next level! There is a failure mode where people just throw everything into a model and hope that the result will make sense. In my experience that just produces a mess, and you need some intuition for how to properly set up the model.
Hi! :) Thanks for the comment! Yes, that’s on purpose, the idea is that a lot of the shorthand in molecular neuroscience are very hard to digest. So since the exact letters don’t matter I intentionally garbled them with a Glitch Text Generator. But perhaps that isn’t very clear without explanation, I’ll add something.
This word Ǫ̵͎͊
G̦̉̇O-GlcNAc”l͉͇̝̽͆̚i̷͔̓̏͌c̷̱̙̍̂͜k̷̠͍͌l̷̢̍͗̃n̷̖͇̏̆å̴̤c̵̲̼̫͑̎̆ f.e. a garbled version of O-GLicklnac, which in term is the phonetic version of “
Theory #4 appears very natural to me, especially in the light of papers like Chen et al 2006 or Cuntz et al 2012. And another supporting intuition from developmental neuroscience is that development is a huge mess and that figuring out where to put a long-range connection is really involved. And there can be a bunch of circuit remodeling on a local scale, once you established a long-range connection, there is little hope of substantially rewiring it.
In case you want to dive deeper into this (and you don’t want to read all those papers), I’d be happy to chat more about this :)
I’ve been meaning to dive into this for-e-ver and only now find the time for it! This is really neat stuff, haven’t enjoyed a framework this much since logical induction. Thank you for writing this!
Yep, I agree, SLIDE is probably a dud. Thanks for the references! And my inside view is also that current trends will probably continue and most interesting stuff will happen on AI-specialized hardware.
Thank you for the comment! You are right, that should be a ReLu in the illustration, I’ll fix it :)
Great explanation, I feel substantially less confused now. And thank you for adding two new shoulder advisors to my repertorie :D
As part of the AI Safety Camp our team is preparing a research report on the state of AI safety! Should be online within a week or two :)