phd student in comp neuroscience @ mpi brain research frankfurt. https://twitter.com/janhkirchner and https://universalprior.substack.com/
Jan
Great points, thanks for the comment! :) I agree that there are potentially some very low-hanging fruits. I could even imagine that some of these methods work better in artificial networks than in biological networks (less noise, more controlled environment).
But I believe one of the major bottlenecks might be that the weights and activations of an artificial neural network are just so difficult to access? Putting the weights and activations of a large model like GPT-3 under the microscope requires impressive hardware (running forward passes, storing the activations, transforming everything into a useful form, …) and then there are so many parameters to look at.
Giving researchers structured access to the model via a research API could solve a lot of those difficulties and appears like something that totally should exist (although there is of course the danger of accelerating progress on the capabilities side also).
Great point! And thanks for the references :)
I’ll change your background to Computational Cognitive Science in the table! (unless you object or think a different field is even more appropriate)
“Brain enthusiasts” in AI Safety
Thank you for the comment and the questions! :)
This is not clear from how we wrote the paper but we actually do the clustering in the full 768-dimensional space! If you look closely as the clustering plot you can see that the clusters are slightly overlapping—that would be impossible with k-means in 2D, since in that setting membership is determined by distance from the 2D centroid.
A descriptive, not prescriptive, overview of current AI Alignment Research
Oh true, I completely overlooked that! (if I keep collecting mistakes like this I’ll soon have enough for a “My mistakes” page)
Yes, good point! I had that in an earlier draft and then removed it for simplicity and for the other argument you’re making!
The Brain That Builds Itself
This sounds right to me! In particular, I just (re-)discovered this old post by Yudkowsky and this newer post by Alex Flint that both go a lot deeper on the topic. I think the optimal control perspective is a nice complement to those posts and if I find the time to look more into this then that work is probably the right direction.
As part of the AI Safety Camp our team is preparing a research report on the state of AI safety! Should be online within a week or two :)
Interesting, I added a note to the text highlighting this! I was not aware of that part of the story at all. That makes it more of a Moloch-example than a “mistaking adversarial for random”-example.
Yes, that’s a pretty fair interpretation! The macroscopic/folk psychology notion of “surprise” of course doesn’t map super cleanly onto the information-theoretic notion. But I tend to think of it as: there is a certain “expected surprise” about what future possible states might look like if everything evolves “as usual”, . And then there is the (usually larger) “additional surprise” about the states that the AI might steer us into, . The delta between those two is the “excess surprise” that the AI needs to be able to bring about.
It’s tricky to come up with a straightforward setting where the actions of the AI can be measured in nats, but perhaps the following works as an intuition pump: “If we give the AI full, unrestricted access to a control panel that controls the universe, how many operations does it have to perform to bring about the catastrophic event?”. That’s clearly still not well defined (there is no obvious/privileged way that the panel should look like), but it shows that 1) the “excess surprise” is a lower bound (we wouldn’t usually give the AI unrestricted access to that panel) and 2) that the minimum amount of operations required to bring about a catastrophic event is probably still larger than 1.
Adversarial attacks and optimal control
Thank you for your comment! You are right, these things are not clear from this post at all and I did not do a good job at clarifying that. I’m a bit low on time atm, but hopefully, I’ll be able to make some edits to the post to set the expectations for the reader more carefully.
The short answer to your question is: Yep, X is the space of events. In Vanessa’s post it has to be compact and metric, I’m simplifying this to an interval in R. And can be derived from by plugging in g=0 and replacing the measure by the Lesbegue integral . I have scattered notes where I derive the equations in this post. But it was clear to me that if I want to do this rigorously in the post, then I’d have to introduce an annoying amount of measure theory and the post would turn into a slog. So I decided to do things hand-wavy, but went a bit too hard in that direction.
Elementary Infra-Bayesianism
Cool paper, great to see the project worked out! (:
One question: How do you know the contractors weren’t just answering randomly (or were confused about the task) in your “quality after filtering” experiments (Table 4)? Is there agreement across contractors about the quality of completions (in case they saw the same completions)?
Continental Philosophy as Undergraduate Mathematics
Fascinating! Thanks for sharing!
Cool experiment! I could imagine that the tokenizer handicaps GPT’s performance here (reversing the characters leads to completely different tokens). With a character-level tokenizer GPT should/might be able to handle that task better!
There’s an important caveat here:
I’d be willing to bet that if you give the macaque more than 100ms they’ll get it right—That’s at least how it is for humans!
(Not trying to shift the goalpost, it’s a cool result! Just pointing at the next step.)