Likewise, thanks for taking the time to write such a long comment! And hoping that’s a typo in the second sentence :)
You’re welcome. And yes, this was as typo that I corrected. ^^
Wrt the community though, I’d be especially curious to get more feedback on Motivation #2. Do people not agree that transparency is *necessary* for AI Safety? And if they do agree, then why aren’t more people working on it?
My take is that a lot of people around here agree that transparency is at least useful, and maybe necessary. And the main reason why people are not working on it is a mix of personal fit, and the fact that without research in AI Alignment proper, transparency doesn’t seem that useful (if we don’t know what to look for).
I agree, but think that transparency is doing most of the work there (i.e. what you say sounds more to me like an application of transparency than scaling up the way that verification is used in current models.) But this is just semantics.
Well, transparency is doing some work, but it’s totally unable to prove anything. That’s a big part of the approach I’m proposing. That being said, I agree that this doesn’t look like scaling the current way.
Hm, I want to disagree, but this may just come down to a difference in what we mean by deployment. In the paragraph that you quoted, I was imagining the usual train/deploy split from ML where deployment means that we’ve frozen the weights of our AI and prohibit further learning from taking place. In that case, I’d like to emphasize that there’s a difference between intelligence as a meta-ability to acquire new capabilities and a system’s actual capabilities at a given time. Even if an AI is superintelligent, i.e. able to write new information into its weights extremely efficiently, once those weights are fixed, it can only reason and plan using whatever object-level knowledge was encoded in them up to that point. So if there was nothing about bio weapons in the weights when we froze them, then we wouldn’t expect the paperclip-maximizer to spontaneously make plans involving bio weapons when deployed.
You’re right that I was thinking of a more online system that could update it’s weights during deployment. Yet even with frozen weights, I definitely expect the model to make plans involving things that were not involved. For example, it might not have a bio-weapon feature, but the relevant subfeature to build some by quite local rules that don’t look like a plan to build a bio-weapon.
Suppose an AI system was trained on a dataset of existing transparency papers to come up with new project ideas in transparency. Then its first outputs would probably use words like neurons and weights instead of some totally incomprehensible concepts, since those would be the very same concepts that would let it efficiently make sense of its training set. And new ideas about neurons and weights would then be things that we could independently reason about even if they’re very clever ideas that we didn’t think of ourselves, just like you and I can have a conversation about circuits even if we didn’t come up with it.
You’re welcome. And yes, this was as typo that I corrected. ^^
My take is that a lot of people around here agree that transparency is at least useful, and maybe necessary. And the main reason why people are not working on it is a mix of personal fit, and the fact that without research in AI Alignment proper, transparency doesn’t seem that useful (if we don’t know what to look for).
Well, transparency is doing some work, but it’s totally unable to prove anything. That’s a big part of the approach I’m proposing. That being said, I agree that this doesn’t look like scaling the current way.
You’re right that I was thinking of a more online system that could update it’s weights during deployment. Yet even with frozen weights, I definitely expect the model to make plans involving things that were not involved. For example, it might not have a bio-weapon feature, but the relevant subfeature to build some by quite local rules that don’t look like a plan to build a bio-weapon.
That seems reasonable.