Noosphere89 comments on How might we solve the alignment problem? (Part 1: Intro, summary, ontology)

Noosphere89 29 Oct 2024 18:52 UTC
5 points
0
The memory and coordination issues are a reason why this sort of approach doesn’t work as well for humans, but in the context of Carlsmith’s post, AIs probably have much lower coordination costs and can be assumed to have similar memories/memory capacities (At least for the given training run.).
And here, a very useful empirical rule is that there’s an exponential drop-off as you get less intelligent, but there aren’t hard intelligence barriers to learning something, just less and less chance to learn something within a given time period.
This is what motivates some hope for the scalable oversight agenda: As long as you keep the capability gain small enough or you have enough trusted AI labor (probably both in reality), you can use trusted dumb labor to create a smarter AI system in limited quantities that is also trusted, then copy the new smarter AI to research the alignment problem, and repeat.
So unlike the amount of 4 year olds vs weightlifting case, where the 4 year olds have the issue of both not being as intelligent as adults, and also not being able to coordinate, here we only have the intelligence issue, and we know pretty well how far this scales in many scientific fields, where intelligence basically is always good, but there are no hard cutoffs/hard barriers to doing well (modulo memory capacity issues, which is a problem but also we can usually assume that time, not memory is the bottleneck to alignment via scalable oversight, especially in an intelligence explosion).
See these quotes from Carl Shulman here:
Yeah. In science the association with things like scientific output, prizes, things like that, there’s a strong correlation and it seems like an exponential effect. It’s not a binary drop-off. There would be levels at which people cannot learn the relevant fields, they can’t keep the skills in mind faster than they forget them. It’s not a divide where there’s Einstein and the group that is 10 times as populous as that just can’t do it. Or the group that’s 100 times as populous as that suddenly can’t do it. The ability to do the things earlier with less evidence and such falls off at a faster rate in Mathematics and theoretical Physics and such than in most fields.
Yes, people would have discovered general relativity just from the overwhelming data and other people would have done it after Einstein.
No, that intuition is not necessarily correct. Machine learning certainly is an area that rewards ability but it’s also a field where empirics and engineering have been enormously influential. If you’re drawing the correlations compared to theoretical physics and pure mathematics, I think you’ll find a lower correlation with cognitive ability. Creating neural lie detectors that work involves generating hypotheses about new ways to do it and new ways to try and train AI systems to successfully classify the cases. The processes of generating the data sets of creating AIs doing their best to put forward truths versus falsehoods, to put forward software that is legit versus that has a trojan in it are experimental paradigms and in these experimental paradigms you can try different things that work. You can use different ways to generate hypotheses and you can follow an incremental experimental path. We’re less able to do that in the case of alignment and superintelligence because we’re considering having to do things on a very short timeline and it’s a case where really big failures are irrecoverable. If the AI starts rooting the servers and subverting the methods that we would use to keep it in check we may not be able to recover from that. We’re then less able to do the experimental procedures. But we can still do those in the weaker contexts where an error is less likely to be irrecoverable and then try and generalize and expand and build on that forward.
Here’s the link to all these quotes below:
https://www.lesswrong.com/posts/BdPjLDG3PBjZLd5QY/carl-shulman-on-dwarkesh-podcast-june-2023#Can_we_detect_deception_
So I expect coordination, and to a lesser extent interfaces to be a slack constraint for AIs by default (at least without AI control measures), compared to humans.