Around the 1:25:00 mark, I’m not sure I agree with Yudkowsky’s point about AI not being able to help with alignment only(?) because those systems will be trained to get the thumbs up from the humans and not to give the real answers.
For example, if the Wright brothers had asked me about how wings produce lift, I may have only told them “It’s Bernoulli’s principle, and here’s how that works...” and spoken nothing about the Coanda effect—which they also needed to know about—because it was just enough to get the thumbs up from them. But...
But that still would’ve been a big step in the right direction for them. They could’ve then run experiments and seen that Bernoulli’s principle doesn’t explain the full story, and then asked me for more information, and at that point I would’ve had to have told them about the Coanda effect.
There’s also the possibility that what gets the thumbs up from the humans actually just is the truth.
For another example, if I ask a weak AGI for the cube root of 148,877 the only answer that gets a thumbs up is going to be 53, because I can easily check that answer.
So long as you remain skeptical and keep trying to learn more, I’m not seeing the issue. And of course, hanging over your head the entire time is the knowledge of exactly what the AGI is doing, so anyone with half a brain WOULD remain skeptical.
This could potentially also get you into a feedback loop of the weak explanations allowing you to slightly better align the AGI you’re using, which can then make it give you better answers.
Yudkowsky may have other reasons for thinking that weak AGI can’t help us in this way though, so IDK.
I agree it was a pretty weak point. I wonder if there is a longer form exploration of this topic from Eliezer or somebody else.
I think it is even contradictory. Eliezer says that AI alignment is solvable by humans and that verification is easier than the solution. But then he claims that humans wouldn’t even be able to verify answers.
I think a charitable interpretation could be “it is not going to be as usable as you think”. But perhaps I misunderstand something?
Humans, presumably, wont have to deal with deception between themselves so if there is sufficient time they can solve Alignment. If pressed for time (as it is now) then they will have to implement less understood solutions because thats the best they will have at the time.
Capabilities advance much faster that alignment, so there is likely no time to do meticulous research. And if you will try to use weak AIs as shortcut to outrun current “capabilities timeline” then you will somehow have to deal with suggestor and verifier problem (with much harder to verify suggestions than a simple math problems) which is not wholly about deception but also filtering somewhat working staff that may steer alignment in right direction. And may be not.
But i agree that this collaboration will be successfully used for patchwork (because shortcuts) alignment of weak AIs to placate general public and politicians. All of this depends on how hard Alignment problem is. Hard as EY think or may be harder or easier.
Around the 1:25:00 mark, I’m not sure I agree with Yudkowsky’s point about AI not being able to help with alignment only(?) because those systems will be trained to get the thumbs up from the humans and not to give the real answers.
For example, if the Wright brothers had asked me about how wings produce lift, I may have only told them “It’s Bernoulli’s principle, and here’s how that works...” and spoken nothing about the Coanda effect—which they also needed to know about—because it was just enough to get the thumbs up from them. But...
But that still would’ve been a big step in the right direction for them. They could’ve then run experiments and seen that Bernoulli’s principle doesn’t explain the full story, and then asked me for more information, and at that point I would’ve had to have told them about the Coanda effect.
There’s also the possibility that what gets the thumbs up from the humans actually just is the truth.
For another example, if I ask a weak AGI for the cube root of 148,877 the only answer that gets a thumbs up is going to be 53, because I can easily check that answer.
So long as you remain skeptical and keep trying to learn more, I’m not seeing the issue. And of course, hanging over your head the entire time is the knowledge of exactly what the AGI is doing, so anyone with half a brain WOULD remain skeptical.
This could potentially also get you into a feedback loop of the weak explanations allowing you to slightly better align the AGI you’re using, which can then make it give you better answers.
Yudkowsky may have other reasons for thinking that weak AGI can’t help us in this way though, so IDK.
I agree it was a pretty weak point. I wonder if there is a longer form exploration of this topic from Eliezer or somebody else.
I think it is even contradictory. Eliezer says that AI alignment is solvable by humans and that verification is easier than the solution. But then he claims that humans wouldn’t even be able to verify answers.
I think a charitable interpretation could be “it is not going to be as usable as you think”. But perhaps I misunderstand something?
Humans, presumably, wont have to deal with deception between themselves so if there is sufficient time they can solve Alignment. If pressed for time (as it is now) then they will have to implement less understood solutions because thats the best they will have at the time.
Capabilities advance much faster that alignment, so there is likely no time to do meticulous research. And if you will try to use weak AIs as shortcut to outrun current “capabilities timeline” then you will somehow have to deal with suggestor and verifier problem (with much harder to verify suggestions than a simple math problems) which is not wholly about deception but also filtering somewhat working staff that may steer alignment in right direction. And may be not.
But i agree that this collaboration will be successfully used for patchwork (because shortcuts) alignment of weak AIs to placate general public and politicians. All of this depends on how hard Alignment problem is. Hard as EY think or may be harder or easier.