rvnnt comments on Are extrapolation-based AIs alignable?

rvnnt 29 Mar 2023 12:52 UTC
1 point
0
scale up the experiment of Pretraining from Human Feedback by using larger data

AFAICT, PHF doesn’t solve any of the core problems of alignment. IIUC, PHF is still using an imperfect reward model trained on a finite amount of human signals-of-approval; I’d tentatively expect scaling up PHF (to ASI) to result in death-or-worse by Goodhart. Haven’t thought about PHF very thoroughly though, so I’m uncertain here.

we can even try to design a data set such that it uses words like freedom, justice, alignment and more value laden words

Did you mean something like “(somehow) design a data set such that, in order to predict token-sequences in that data set, the AI has to learn the real-world structure of things we care about, like freedom, justice, alignment, etc.”? ^[1]

can only learn legitimate generalizations, not deceptive generalizations leading to deceptive alignment

I don’t understand this. What difference are you pointing at with “deceptive” vs “legitimate” generalizations? How does {AI-human (and/or AI-env) interactions being limited to a simple interface} preclude {learning “deceptive” generalizations}?

I’m under the impression that entirely “legitimate” generalizations can (and apriori probably will) lead to “deception”; see e.g. https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness. Do you disagree with that? (If yes, how?)

can’t amplify Goodhart

Side note: I don’t understand what you mean by this (in the given context).

can’t [...] hack the human’s values

I don’t see how this follows. IIUC, the proposition here is something like
- If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can’t hack the humans.
Is that a reasonable representation of what you’re saying? If yes, consider: What if we replace “the AI” with “Anonymous” and “the humans” with “the web server”? Then we get
- If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can’t hack the web server
...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn’t rely on hardware effects like rowhammering.

(I agree that limiting human-AI interactions to a simple interface would be helpful, but I think it’s far from sufficient (to guarantee any form of safety).)

IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier. I’m still confused as to why you think that. To the extent that I understood the reasons you presented, I think they’re incorrect (as outlined above). (Maybe I’m misunderstanding something.)

I’m kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
1. ↩︎
  I think a naively designed data set containing lots of {words that are value-laden for English-speaking humans} would not cut it, for hopefully obvious reasons.
- Noosphere89 29 Mar 2023 14:25 UTC
  1 point
  0
  Parent
  
  I’m kinda low on bandwidth, so I might not engage with this further. But in any case, thanks for trying to share parts of your model!
  
  This will be my last comment here. Thank you for trying to explain why you disagree with me!
  
  IIUC, a central theme here is the belief that {making learning offline vs online} and {limiting AI-human interfaces to be simple/understood} would solve large chunks of the whole alignment problem, or at least make it much easier.
  
  I”m impressed that you passed my ITT. I think analogies to other alignment problems like the human alignment problem miss that it’s the most difficult setting, but you don’t need to play on that difficulty, because AI is very different from humans.
  
  AFAICT, PHF doesn’t solve any of the core problems of alignment.
  
  While I definitely over claimed on how much it solves the alignment problems, I think this is definitely underselling the accomplishments. It’s an incomplete solution, in that it doesn’t do everything on it’s own, but it does carry a lot of weight.
  
  To talk about deceptive alignment more specifically, deceptive alignment is essentially where the AI isn’t aligned with human goals, and tries to hide that fact. One of the key prerequisites of deceptive alignment is that it is optimizing a non-myopic goal. It’s the most dangerous form of alignment, since we have an AI only aligned for instrumental, not terminal reasons.
  
  What Pretraining from Human Feedback did was it finally married a myopic goal with competitive capabilities, and once the myopic goal of conditional training was added, then deceptive alignment goes away, since a non-myopic goal being optimized is one of the key prerequisites.
  
  I don’t see how this follows. IIUC, the proposition here is something like
  
  If the AI only interacts with the humans via a simple, well-defined, and thoroughly understood interface, then the AI can’t hack the humans.
  
  Is that a reasonable representation of what you’re saying? If yes, consider: What if we replace “the AI” with “Anonymous” and “the humans” with “the web server”? Then we get
  
  If Anonymous only interacts with the web server via a simple, well-defined, and thoroughly understood interface, then Anonymous can’t hack the web server
  
  ...which is obviously false in the general case, right? Systems can definitely be hackable, even if interactions with them are limited to a simple interface; as evidence, we could consider any software exploit ever that didn’t rely on hardware effects like rowhammering.
  
  This is definitely right, and I did over claim here, though I do remember Pretraining from Human Feedback claimed to do this:
  
  Conditional training (as well as other PHF objectives) is purely offline: the LM is not able to affect its own training distribution. This is unlike RLHF, where the LM learns from self-generated data and thus is more likely to lead to risks from auto-induce distribution shift or gradient hacking.
  
  Which vindicated a narrower claim about the inability of the AI to hack or affect the training distribution, which I don’t know how much it supports my thesis on the immunity to hacking claims.
  
  To port another reason why I’m so optimistic on alignment, I think that alignment is scalable, or put it another way, while Pretraining from Human Feedback is imperfect right now, and even in the imperfections my view is that it would avoid X-risk almost entirely, small, consistent improvements in the vein of empirical work will eventually make it far more aligned than the original Pretraining from Human Feedback work. In the case of more data, they tested it and it showed increasing alignment with more data.
  
  To edit a quote from Thoth Hermes:
  
  Yudkowsky was wrong in the tendency to assume that certain abstractions just don’t apply whenever intelligence or capability is scaled way up.”
  
  This essentially explains my issues with the idea that alignment isn’t scalable.